Calibre, GhostScript and creating searchable PDFs...?

Question

Calibre, GhostScript and creating searchable PDFs...?

Apologies in advance for the stretch on category, Calibre and Ghostscript are not, per se, Mac technologies but the following questions are Mac based, in particular, Catalina. In a search of the Web I was unable to locate a community or forum specific to Ghostscript, although there is one for Calibre.

The Web search for Ghostscript did find numerous questions and answers in this Apple sub forum, so here I am. If any of you can direct me to a Ghostscript forum, please do.

One of the processes I utilize as a researcher is to place all of my source material into a digital archive subdivided into directories for books and another for papers. That categorization may seem arbitrary, as many papers are as long as books. The separation has more to do with the process required to get everything into a single, consistent format that allows global searches across the entire archive.

Papers are rarely a problem but digital books are an entirely different species, they require a good deal of work in order for them to be accurately searched, primarily to ensure that a digital work is not missed.

99 percent of the time, research papers come in the form of PDFs. Don't get me wrong, they do have problems but they are mostly, internal, small and predictable. Digital books come in eBook formats, overwhelmingly .epub, as well as, .mobi, .djvu and a smattering of PDFs.

Two tests in the PDF will quickly indicate if it can be searched: choose any page and text, double click, the word is not selected -instead, just a single character. Second, choose any page and type a word you see on the page into the PDF search field and it will return, "No Results Found."

Converting the eBook file into a readable, roughly formatted PDF that would be searchable required two steps:

Open the eBook file in Calibre, then run the output process that converts the eBook file into a PDF.
Open the terminal, locate you input/output folder with pwd and/or cd/ and run GhostScript command:

gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="output.pdf" "intput.pdf"

When Calibre produced the PDF from the eBook file, the PDF was unsearchable, the final GhostScript step corrected this deficiency..

Then I purchased a new MacBook Pro and once my work and system was on the new machine I upgraded to Catalina. That upgrade required me to update Calibre. The Catalina upgrade also required me to remove my Homebrew installation and with it, GhostScript. Once the dust settled I had the most recent versions of Calibre, Homebrew and GhostScript installed.

There is both good news and bad news.

Now, when I run the Calibre formatting / conversion of the eBook file in to a PDF, Calibre does it all. It not only produces a formatted PDF, as before, but the output is now searchable. This cuts the time in half to get what I need, a formatted, readable and searchable PDF – very good news.

But there is a new problem, sometimes, especially for older works, the original eBook file will come as a PDF file. In the majority of cases the PDF is good to go; functioning, readable and searchable, but in some cases the PDF is not searchable. When this happened, in my pre Catalina install, I would skip the Calibre step and run the PDF thru GhostScript.

Infrequent need to execute this process but the aforementioned GhostScript command , would fix the PDF every time, making no other visible change but rendering the PDF searchable.

Unfortunately, the GhostScript command no longer works. When you execute the GhostScript command in the Terminal there is no indication anything is wrong, sample:

GPL Ghostscript 9.53.3 (2020-10-01)
Processing pages 1 through 351.
Page 1
Page 2...

GhostScript then ticks off the pages as it processes them until it is complete, just like before -except the PDF output is not searchable.

Although Calibre can generate a spanking good PDF file from just about any eBook file, it is very poor at starting with PDFs. In my experience, it makes a hash of them. Calibre describes the reason for this their documentation, search for 'Convert PDF documents.'

The problem, and hopefully the solution lies in GhostScript.

I am embarrassed to say that on a technical level I did not understand why GhostScript worked in the first place, I cobbled the code together from reading bits and pieces around the Web. Having so little knowledge of how Ghostscript works and how it made a PDF searchable, hampers my ability to solve the problem.

Searching the Web for a solution has also been hindered by not knowing what to call this situation; after all, if you have two visually identical copies of a PDF, why is one searchable and one is not?

As I said earlier, this situation crops up infrequently, but my bad luck is that it is often important work that In cannot find elsewhere, often because it is an older, more obscure work.

If anyone can provide some guidance, that would be much appreciated.

Thank you.

MacBook Pro 16″, macOS 10.15

Posted on Nov 10, 2020 3:26 PM

Reply

Answer 1

Nov 10, 2020 10:45 PM in response to LatriciaP

LatriciaP: I used Adobe Acrobat for 15 years, I produced interactive PDFs utilizing the internal JavaScript engine and InDesign as the layout / design foundation. For a time, Acrobat was a great piece of software.

I owned an entire suite of of Adobe software, but I no longer use any Adobe products, I purchase and own software or I support opensource technologies, I do not rent software or hardware.

Thanks for your help.

Reply

Answer 2

Yer_Man

Level 10

166,378 points

Nov 10, 2020 11:41 PM in response to Garrett Cobarr

Besides the 400 to 500 books, there are almost 5000 research papers and about 16,000 to 18,000 news articles, all PDFs. As the material is all in PDF and searchable, this allows detailed search of the whole archive or sections of it. This can be done with Spotlight or other methods of search, I also use HoudahSpot.

Using different search platforms and multiple formats introduces greater complexity and risks missing the full view of targets sought.

Like I say, my research database reaches in excess of 20k texts, approx 60% of which are books.

I'm not suggesting that you introduce multiple platforms, I'm suggesting that you replace your current system with a better one.

Reply

Answer 3

Nov 10, 2020 10:33 PM in response to VikingOSX

VikingOSX: Thanks, I will check these places out and ask the same questions.

Reply

Answer 4

Yer_Man

Level 10

166,378 points

Nov 10, 2020 3:47 PM in response to Garrett Cobarr

Essentially a pdf is a photo/facsimile of a document - a graphic file. But that can be run through OCR, or Optical Character Recognition, which "reads" the images and add a text element to the file - basically you then have a graphic (the pdf) and a text file squeezed together. Searches read the text layer.

My workaround was a: don't convert the pubs - they're just a really easily searchable text format.

There are a lot of apps that will OCR a pdf, for those ones that haven't been already done.

Then I used DevonThink pro to both index and create a searchable database for the pubs and pdfs

(It also does OCR)

When that got unwieldy - 20k+ texts - I switched to FoxTrot Professional. It indexes my Calibre library plus my articles and makes them both instantly searchable - and brings me to the exact pages in the texts where the search term is found.

Reply

Answer 5

LatriciaP

Level 1

123 points

Nov 10, 2020 5:51 PM in response to Garrett Cobarr

What exactly are you asking? If you want to work with PDFs have you considered Adobe Acrobat? You can get a free 1 week trial. Just as a side note as someone who routinely uses several hundred GBs of PDFs every day, the majority of most PDFs file size is unnecessary wasted data you don't want or need. I say this because sometimes the size of a PDF limits as to how searchable it actually is in a reasonable amount of time especially depending on the program you use to search them. In Acrobat you can analyze what exactly is using the space in a PDF and also optimize PDFs into a much smaller file size without loosing any of the usability. This smaller size often times can make it much easier to search text inside the files from Finder and elsewhere. You can also create an Index of your files which is a file that speeds up searching for contents. Everything can be done in batches to multiple files at once. Not sure if this is helpful but since you're working with PDFs it may be useful to see what Adobe has to offer since they own the copyrights to a lot of useful functionalities and also invented the PDF. I took a screenshot of a file analysis of before optimization and after for a large PDF with all text. The file was reduced from 93.8 MB to 1.3MB which really speeds everything up when working with PDFs.

Reply

Answer 6

Nov 10, 2020 10:32 PM in response to Yer_Man

Terence Devlin: Thanks for the input, all is appreciated. But the method you describe adds a good deal of complexity to my situation.

Besides the 400 to 500 books, there are almost 5000 research papers and about 16,000 to 18,000 news articles, all PDFs. As the material is all in PDF and searchable, this allows detailed search of the whole archive or sections of it. This can be done with Spotlight or other methods of search, I also use HoudahSpot.

Using different search platforms and multiple formats introduces greater complexity and risks missing the full view of targets sought.

Apple itself has been introducing these kinds of problems in searching your volumes, their goals and motives are beyond my understanding. Starting in Sierra, Notes are no longer included in Spotlight searches. As far as I know, Stickies have never been included in Spotlight. Both Notes and Stickies have to be opened and searched with their own search systems.

Reply

Answer 7

VikingOSX

Level 10

126,548 points

Nov 10, 2020 3:48 PM in response to Garrett Cobarr

Stackoverflow has a ghostscript tag where people post questions and receive answers. And here is the ghostscript developer's site. Do not expect a stampede of help here for Ghostscript questions.

Reply

Calibre, GhostScript and creating searchable PDFs...?

Similar questions