Find/identify pdfs with corrupted text layers?

Question

Marja E Author

Level 1

43 points

Find/identify pdfs with corrupted text layers?

Does anyone know a good way to check pdf files, and find out which ones have corrupted text layers?

For example, when perfectly good text has been replaced with a mess like this:

.1918., « » .!

, " # , " # $ :«% - "» # # "«& ' »( ,

# " ( .&..).

Sometimes the originals have bad text layers, and require ocr. Sometimes other applications such as Preview or Ghostscript can corrupt the text layers, but it's hard to tell when they have corrupted the text layers until I need to search or need to copy and paste into translation software.

MacBook Air (11-inch Mid 2013), macOS Sierra (10.12.6)

Posted on Mar 17, 2018 11:30 AM

Reply

Answer 1

Best reply

VikingOSX

Community+ 2024

Level 10

111,329 points

Mar 17, 2018 2:26 PM in response to Marja E

There are no layers in a PDF, only random indexes to types of content that can occur anywhere in the PDF. Short of writing a custom application to access the PDF, and walk down document index requesting a specific object type, there is no other means that I am aware of that can isolate the “corrupted text” PDF objects.

Some of the content in the PDF can be encrypted, or even appear in a different encoding, making the first paragraph an even more onerous task.

Reply

Answer 2

Marja E Author

Level 1

43 points

Mar 17, 2018 2:26 PM in response to VikingOSX

I can search for text contents (such as "the") using the Finder search, and search for pdf encoding information (such as "Ghostscript 9.22" or "jpxdecode") using Easyfind.

Now if I could search for the absence of common English words from the text (such as "the" or "their") and the presence of certain pdf encodings (such as "Ghostscript 9.22") using the same search, then I could immediately screen out most English-language pdfs without corrupted text, and I should have an easier time checking the remaining pdfs.

I don't know if any tool allows all those parameters though.

Reply

Find/identify pdfs with corrupted text layers?

Similar questions