How to extract text from corrupt PAGES file

Really appreciate if any of you know a way to extract text from a 9MB Pages files that contains text and images and that will not open:

Error message: Not a valid File format.


Tried changing file type and opening in various programs (Word, Acrobat, Google doc converter) but nothing will open it.

I have a backup at home, but am away for several weeks, and done a lot of work on this file since last backup.


Hope there's a solution! Thanks

JR

MacBook Pro, OS X Mavericks (10.9.5), Pages 09 v.4.3

Posted on Dec 19, 2015 5:25 PM

Reply
15 replies

Dec 19, 2015 7:44 PM in response to julianDC

If this is actually a Pages v5 document, and it won't open in Preview, or open in Pages v5 while holding the shift key down — then you are done. The internal content in Pages v5 is in an unreadable/undecipherable, scrambled document format that nothing on earth can read except Pages v5.


If this is a Pages v5.5.2 or later document version, then it is by default in a single file format (compressed/renamed zip file), and no version of Pages v5 thru v5.2.2 on Mavericks was designed to read single file format documents that are not of Pages '09 origin. If it is a Pages document from Yosemite or El Capitan, you will need Pages v5.5.3 or later to change the file type to package format — which can be read by Pages v5+ on Mavericks.


If this is a Pages '09 document, there is a good possiblility that your attempting to open/modify/etc this document with all of those mentioned applications has damaged the document beyond repair. Try Preview, or the free LibreOffice (v5 or later) that can open very simple Pages '09 single-file format (not package folders) documents, sometimes with images. No guarantees though.

Dec 20, 2015 2:44 PM in response to julianDC

There is another possiblity. Inside every Pages '09 document there is an index.xml file. It is mostly xml styling commands that are interspersed with the text content of your Pages document. To get at your actual text will require some Terminal commands, and the free TextWrangler. The latter will open the index.xml file as color coded text, where black text is the actual content that you want to copy/paste into a new Pages document. The Terminal commands are necessary to isolate the index.xml file outside of the Pages document, so you can open it.


The usual caveats exist. If your damaged Pages document is an archive, and the archive is damaged, then you are likely done before the following steps would apply. If for some reason, the index.xml file is damaged, the same applies.


  1. Get TextWrangler v5.0.2 (or later) from the vendor site, not the OS X App Store.
    1. Reason: The vendor site version has a command-line tools installer on the TextWrangler menu that Apple prevents them from uploading to the App Store. Once the command-line tools are installed, you can invoke TextWrangler from the Terminal as edit index.xml.
    2. Launch the TextWrangler installer and install it.
    3. Launch TextWrangler from Launchpad, and when you get that unrecognized developer dialog, just click open.

      From the TextWrangler menu, choose Install command line tools (edit, twdiff, etc.) These go into /usr/local/bin.

  2. In Finder, just drag your "damaged" Pages document to the Desktop.
  3. From Launchpad, click Other, and then click Terminal.
    1. In Terminal, type the following to make your current directory the Desktop
      cd ~/Desktop
    2. Now, determine if that Pages document is a zipped file, or a folder
      file ./damaged.pages
    3. If it is a directory, then type the following to uncompress and open the index.xml file in TextWrangler:
      gunzip -c ./damaged.pages/index.xml.gz | edit -

      Document images will be inside of the damaged.pages directory, and can be copied to the Desktop.

    4. If it is a Zip Archive then type the following to extract the index.xml file and open it with TextWrangler:
      unzip -p ./damaged.pages index.xml | edit -
      1. To locate document images in the archive:
        unzip -l ./damaged.pages
      2. And to extract multiple images to the Desktop:
        unzip ./damaged.pages "*.jpg"
    5. Select and copy only the black text of your original content to your new Pages document. You will have to reformat it as needed.
  4. When you are done, use the File menu : Close Document in TextWrangler, or it will open that file the next time you launch TextWrangler.

Dec 20, 2015 2:49 PM in response to VikingOSX

Incredible advice, Viking. Thank you very much.

Downloaded TextWrangler as advised.

Typed as you advised:

got the following error message


"This file doesn’t appear to contain a valid ‘shebang’ line (application error code: 13304)"


Not sure if this means a permanent corruption, and trying to learn a little about "shebang", which I never connected with computers before now!


Thank you so much


Julian

Dec 20, 2015 3:17 PM in response to VikingOSX

Well, I completely misunderstood your instructions!

But I got part way, and then encountered this:

"

FSG-L30456:Desktop julianraby$ file Lionproblemversioncopy/damaged.pages

Lionproblemversioncopy/damaged.pages: cannot open `Lionproblemversioncopy/damaged.pages' (No such file or directory)


I wonder if I have made too many changes to the filename and type, a problem you raised yesterday?

Anyway, couldn't get to see if file is zipped file or folder.


Don't want to trouble you further, unless there seems to be a simpler solution. I'll try rewriting the article! But thanks for all the help.


Julian

Dec 21, 2015 12:54 AM in response to VikingOSX

Thanks, Viking. You're very patient!


Got this:

FSG-L30456:Desktop julianraby$ file ./Lionproblemversioncopy.pages

./Lionproblemversioncopy.pages: Zip archive data, at least v2.0 to extract


Then this:

FSG-L30456:Desktop julianraby$ unzip -p ./Lionproblemversioncopy.pages index.xml.gz | edit -

[./Lionproblemversioncopy.pages]

End-of-central-directory signature not found. Either this file is not

a zipfile, or it constitutes one disk of a multi-part archive. In the

latter case the central directory and zipfile comment will be found on

the last disk(s) of this archive.

Dec 21, 2015 1:48 PM in response to VikingOSX

Thanks, Viking. No luck, sadly:

Got this message


FSG-L30456:Desktop julianraby$ unzip -p ./Lionproblemversioncopy.pages index.xml | edit -

[./Lionproblemversioncopy.pages]

End-of-central-directory signature not found. Either this file is not

a zipfile, or it constitutes one disk of a multi-part archive. In the

latter case the central directory and zipfile comment will be found on

the last disk(s) of this archive.

edit: Warning: no data available on standard in


Never mind. thank you for all your help


Julian

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

How to extract text from corrupt PAGES file

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.