Convert PDF to Numbers or Excel spreadsheet?

Question

Level 1

7 points

Convert PDF to Numbers or Excel spreadsheet?

I have a PDF file that's over 200 pages total, with a list of tens or thousands of items. The file needs to be converted to a Numbers '09 or an Excel 2011 spreadsheet, but some of the software I've demoed messes up on the OCR scanning or doesn't understand where the columns need to be. Is there a way I can convert the PDF (which itself was originally an Excel file, but I don't have access to it) to a spreadsheet? Someone suggested MacVim, but I'm unfamiliar with it. Thanks for any advice!

Posted on Apr 18, 2011 5:38 PM

Reply

Answer 1

Best reply

Barry

Level 9

59,861 points

Apr 18, 2011 7:22 PM in response to smactrox

Adobe Acrobat or Adobe Distiller might be able to do that, but I've had no experience with either, and the price of admission is pretty high.

"PDF OCR X" comes in both free and paid versions. The free version is restricted to single page documents, but that should be sufficient to test its OCR capabilities and handling of columns. I checked this out ona single page Numbers Table printed to PDF, and wasn't happy with the results.

An alternate might be to print the PDF document and use a 'regular' OCR application to scan that.

Or, if the text itself is selectable and copyable, copy that from the PDF, paste it into Pages (or Text Edit) to clean up the columns,then transfer that to your spreadsheet via copy/paste. I had reasonable success with this using the same PDF as noted above, opened in Preview. Not perfect, but most columns were separated by two spaces and were thus distinguishable from single spaces within a cell. The header column, though, came out as separate rows.

Not what I'd call a solution, but I hope this helps you find one.

Regards,

Barry

Reply

Answer 2

smactrox Author

Level 1

7 points

Apr 18, 2011 9:12 PM in response to Barry

I'd love to just copy and paste but it's limited to the first column in a Numbers or Excel spreadsheet. Any suggestions to have it match the format of the original PDF?

Reply

Answer 3

Barry

Level 9

59,861 points

Apr 18, 2011 10:49 PM in response to smactrox

I'm not sure I understand what you mean by "it's limited to the first column in a Numbers or excel spreadsheet." When describing actions, it's useful to avoid pronouns and use the actual term to identify what 'it' is.

My guess is you mean that when you copy the text from the PDF, then paste it into a spreadsheet, the text all goes into the first column of the spreadsheet. If so, that's expected behaviour. The text in the pdf has no 'special' delimiter telling where one cell's contents ends and the next cell's content begins.

That's the reason for initially pasting not into a spreadsheet, but into a text editor where you can determine if there is a discernible separator (in my test document, most but not all data items were separated from the one following by two spaces) that can be replaced with a tab character using Find/Replace.

If your data itself contains no spaces that are part of the actual data, your task is made somewhat easier. Use Find/Replace (in the text editor) to find all occurrences of two spaces together and replace them with a single space. Repeat until zero occurrences are found, then Find the single spaces and Replace them with single tabs. Select All, copy, go to your spreadsheet, click once on the top left cell where the data is to start and Paste.

If your data itself contains spaces (eg. one cell contains the data "Thornton W. Burgess") your task is more difficult, especially if you find (as I did) that spacing between 'cells' was not consistent, and sometimes was only a single space character. In this situation, you'll have some initial searching and manual insertion of an extra space to ensure that each piece of data has at least two spaces separating it from the next. When you've done that, use Find/Replace to reduce all series of contiguous spaces to a maximum of two, then Find/Replace to replace all occurrences of two contiguous spaces with a single tab character. Then select All, Copy and Paste as above.

Easier by far of course,if it can be done, is to arrange to get a copie of the original Excel file. but failing that, you need to work with what you've got, and what you've got in a PDF is a visual representation of what the original would look like on paper. The representation may contain the original text, but won't include the delimiters in a form which you can extract easily.

Regards,

Barry

Reply