Want to highlight a helpful answer? Upvote!

Did someone help you, or did an answer or User Tip resolve your issue? Upvote by selecting the upvote arrow. Your feedback helps others! Learn more about when to upvote >

Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

Converting PDF's to Text

I have a huge collection of documents I want to digitize. I just bought an Epson scanner, so I can scan the documents in a variety of formats, including .jpg, .tiff and pdf. Unfortunately, I can't get the OCR software that came with the scanner (ABBYY FineReader Sprint) to work.


Then I remembered seeing PDF converters online, so I figured I could just scan everyrthing as a PDF, then convert it to text. But I'm confused. I tried Adobe Acrobat's export function, but that didn't do anything. I read that you can open a PDF in Preview and copy the text, but that doesn't work.


It sounds like there are two ways to create a PDF. With OCR software you can create a text-PDF, whereas I apparently have a scanned-image-PDF, if I understand correctly.


Anyway, I'm confused. Can anyone recommend a software program or online service that will convert PDF's to text on a Mac? I'm also interested in learning how to batch process PDF's. I'm going to have hundreds of documents, maybe a few thousand.


Thanks for any tips.

Posted on Apr 5, 2013 5:26 PM

Reply
Question marked as Best reply

Posted on Apr 5, 2013 5:55 PM

Not sure on the PDF to Text solution, however once you have that squared away...


I'm also interested in learning how to batch process PDF's. I'm going to have hundreds of documents, maybe a few thousand.

You should use Automator for this. Here is an example to do what you want in a batch process:


User uploaded file

User uploaded file

9 replies

Apr 5, 2013 8:00 PM in response to David Blomstrom

David Blomstrom wrote:


...Can anyone recommend a software program or online service that will convert PDF's to text on a Mac? I'm also interested in learning how to batch process PDF's. I'm going to have hundreds of documents, maybe a few thousand.


Thanks for any tips.

Since being able to scan combined with OCR will be the easiest approach, getting the OCR software to work would seem to be the best solution. Which version of FineReader Sprint do you have? There's a version in the App Store (https://itunes.apple.com/app/abbyy-finereader-express/id412310371?mt=12) which is supposed to be compatible with Lion and Mountain Lion and if that's not the version you have, perhaps you can upgrade to it. Check out http://www.abbyy.com/checkforupdates/?PartNumber=71817&product=FineReader%20Expr ess%20Edition%20for%20Mac which might do the trick.


Unless the scanning process has the OCR step built in, what you'll get is an image, usually JPG, which can be turned into a PDF file but it's still just a picture. If you could turn it into a PDF that has actual text in it, then you can get into text extraction. There are a number of programs which are supposed to be able to do that. The only one I've tried that works pretty well is MS Word in the Office 2013 suite for Windows. I have it running in a Windows 8 Virtual Machine on the Mac, but that's a long and expensive way around to begin to do what you need.

Apr 5, 2013 9:01 PM in response to FatMac-MacPro

Thanks for the tip; I'll have to check out the App Store's version.


I downloaded a trial version of ABBYY Express. I can open it, but it apparently won't connect with my printer. I don't understand why people sell scanners, then make you jump through all kinds of hoops figuring out what kinds of drivers and utilities you have to download to make'em work. Sheez.

Apr 5, 2013 9:04 PM in response to FatMac-MacPro

I should also mention that ABBYY Express can extract text from an image. However, as the previous poster suggested, that's a much more tedious process than simply scanning it as text the first time around.


On the other hand, the fastest way to do a project like this in the long run might be to just scan everything as images or PDF's, then convert them to text by batch processing them.


Right now, it's incredibly confusing, though.

Apr 6, 2013 9:23 AM in response to David Blomstrom

David Blomstrom wrote:


...On the other hand, the fastest way to do a project like this in the long run might be to just scan everything as images or PDF's, then convert them to text by batch processing them...

If your purpose is to archive pre-existing documents rather than modify them, why is it necessary to convert them to editable text to begin with? If the scanner software can't output as PDF directly, Preview can open a JPG and export it in PDF format. It can also import from a scanner, although its help system doesn't say how.

Apr 6, 2013 10:14 AM in response to David Blomstrom

David Blomstrom wrote:


...I downloaded a trial version of ABBYY Express. I can open it, but it apparently won't connect with my printer. I don't understand why people sell scanners, then make you jump through all kinds of hoops figuring out what kinds of drivers and utilities you have to download to make'em work. Sheez.

I don't know if this is relevant or exactly what it means but I ran into the following yesterday while researching ABBYY OCR software:


"Important! ABBYY FineReader Sprint 8.0 Mac Edition does not support scanners, cameras and fax modems that use emulated drivers. On the Intel platform, FineReader Sprint will only work with devices for which Intel drivers are installed."


That may be saying it's more likely that the OCR software would work with whatever drivers Epson provides with the scanner than with what Apple provides. I've been that route with older printers which have been abandoned by their manufacturer (Epson, as a matter of fact) for software updates and the older software will barely work with newer Mac OS's. The Apple software will make the printer work but not with all the features that Epson built into its software when the printers were new. As a result, I've avoided the Apple printer updates, and stuck with the aging Epson software.

Feb 19, 2014 8:47 AM in response to David Blomstrom

Adobe's Acrobat Pro has built-in OCR and it does, like all OCR software I've used, work to a degree. OCR is an imperfect process. The accuracy depends in large part on the quality of the original scan and it can be a long, difficult process to get an image of the text that the software can recognize acccurately. You will have the best luck with a 1-bit (black and white only, no gray or other color) at a fairly large size with high DPI. Even then, you can expect to spend a lot of time proofreading and correcting the extracted text.

Converting PDF's to Text

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.