Announcement: Upgrade to macOS Mojave

With features like Dark Mode, Stacks, and four new built-in apps, macOS Mojave helps you get more out of every click. 
Find out how to upgrade to macOS Mojave > https://support.apple.com/macos/mojave

Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

Question:

Question: Find/identify pdfs with corrupted text layers?

Does anyone know a good way to check pdf files, and find out which ones have corrupted text layers?


For example, when perfectly good text has been replaced with a mess like this:




.1918., « » .!

, " # , " # $ % - "» # # "«& ' »( ,

# " ( .&..).


Sometimes the originals have bad text layers, and require ocr. Sometimes other applications such as Preview or Ghostscript can corrupt the text layers, but it's hard to tell when they have corrupted the text layers until I need to search or need to copy and paste into translation software.

MacBook Air (11-inch Mid 2013), macOS Sierra (10.12.6)

Posted on

Reply
Question marked as Helpful

Mar 17, 2018 2:26 PM in response to Marja E In response to Marja E

There are no layers in a PDF, only random indexes to types of content that can occur anywhere in the PDF. Short of writing a custom application to access the PDF, and walk down document index requesting a specific object type, there is no other means that I am aware of that can isolate the “corrupted text” PDF objects.


Some of the content in the PDF can be encrypted, or even appear in a different encoding, making the first paragraph an even more onerous task.

There’s more to the conversation

Read all replies

Page content loaded

Question marked as Helpful

Mar 17, 2018 2:26 PM in response to Marja E In response to Marja E

There are no layers in a PDF, only random indexes to types of content that can occur anywhere in the PDF. Short of writing a custom application to access the PDF, and walk down document index requesting a specific object type, there is no other means that I am aware of that can isolate the “corrupted text” PDF objects.


Some of the content in the PDF can be encrypted, or even appear in a different encoding, making the first paragraph an even more onerous task.

Mar 17, 2018 2:26 PM

Reply Helpful (1)

Mar 17, 2018 2:26 PM in response to VikingOSX In response to VikingOSX

I can search for text contents (such as "the") using the Finder search, and search for pdf encoding information (such as "Ghostscript 9.22" or "jpxdecode") using Easyfind.


Now if I could search for the absence of common English words from the text (such as "the" or "their") and the presence of certain pdf encodings (such as "Ghostscript 9.22") using the same search, then I could immediately screen out most English-language pdfs without corrupted text, and I should have an easier time checking the remaining pdfs.


I don't know if any tool allows all those parameters though.

Mar 17, 2018 2:26 PM

Reply Helpful
User profile for user: Marja E

Question: Find/identify pdfs with corrupted text layers?