Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

Find/identify pdfs with corrupted text layers?

Does anyone know a good way to check pdf files, and find out which ones have corrupted text layers?


For example, when perfectly good text has been replaced with a mess like this:




.1918., « » .!

, " # , " # $ % - "» # # "«& ' »( ,

# " ( .&..).


Sometimes the originals have bad text layers, and require ocr. Sometimes other applications such as Preview or Ghostscript can corrupt the text layers, but it's hard to tell when they have corrupted the text layers until I need to search or need to copy and paste into translation software.

MacBook Air (11-inch Mid 2013), macOS Sierra (10.12.6)

Posted on Mar 17, 2018 11:30 AM

Reply
Question marked as Best reply

Posted on Mar 17, 2018 2:26 PM

There are no layers in a PDF, only random indexes to types of content that can occur anywhere in the PDF. Short of writing a custom application to access the PDF, and walk down document index requesting a specific object type, there is no other means that I am aware of that can isolate the “corrupted text” PDF objects.


Some of the content in the PDF can be encrypted, or even appear in a different encoding, making the first paragraph an even more onerous task.

Similar questions

2 replies
Question marked as Best reply

Mar 17, 2018 2:26 PM in response to Marja E

There are no layers in a PDF, only random indexes to types of content that can occur anywhere in the PDF. Short of writing a custom application to access the PDF, and walk down document index requesting a specific object type, there is no other means that I am aware of that can isolate the “corrupted text” PDF objects.


Some of the content in the PDF can be encrypted, or even appear in a different encoding, making the first paragraph an even more onerous task.

Mar 17, 2018 2:26 PM in response to VikingOSX

I can search for text contents (such as "the") using the Finder search, and search for pdf encoding information (such as "Ghostscript 9.22" or "jpxdecode") using Easyfind.


Now if I could search for the absence of common English words from the text (such as "the" or "their") and the presence of certain pdf encodings (such as "Ghostscript 9.22") using the same search, then I could immediately screen out most English-language pdfs without corrupted text, and I should have an easier time checking the remaining pdfs.


I don't know if any tool allows all those parameters though.

Find/identify pdfs with corrupted text layers?

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.