Calibre, GhostScript and creating searchable PDFs...?
Apologies in advance for the stretch on category, Calibre and Ghostscript are not, per se, Mac technologies but the following questions are Mac based, in particular, Catalina. In a search of the Web I was unable to locate a community or forum specific to Ghostscript, although there is one for Calibre.
The Web search for Ghostscript did find numerous questions and answers in this Apple sub forum, so here I am. If any of you can direct me to a Ghostscript forum, please do.
One of the processes I utilize as a researcher is to place all of my source material into a digital archive subdivided into directories for books and another for papers. That categorization may seem arbitrary, as many papers are as long as books. The separation has more to do with the process required to get everything into a single, consistent format that allows global searches across the entire archive.
Papers are rarely a problem but digital books are an entirely different species, they require a good deal of work in order for them to be accurately searched, primarily to ensure that a digital work is not missed.
99 percent of the time, research papers come in the form of PDFs. Don't get me wrong, they do have problems but they are mostly, internal, small and predictable. Digital books come in eBook formats, overwhelmingly .epub, as well as, .mobi, .djvu and a smattering of PDFs.
Two tests in the PDF will quickly indicate if it can be searched: choose any page and text, double click, the word is not selected -instead, just a single character. Second, choose any page and type a word you see on the page into the PDF search field and it will return, "No Results Found."
Converting the eBook file into a readable, roughly formatted PDF that would be searchable required two steps:
- Open the eBook file in Calibre, then run the output process that converts the eBook file into a PDF.
- Open the terminal, locate you input/output folder with pwd and/or cd/ and run GhostScript command:
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="output.pdf" "intput.pdf"
When Calibre produced the PDF from the eBook file, the PDF was unsearchable, the final GhostScript step corrected this deficiency..
Then I purchased a new MacBook Pro and once my work and system was on the new machine I upgraded to Catalina. That upgrade required me to update Calibre. The Catalina upgrade also required me to remove my Homebrew installation and with it, GhostScript. Once the dust settled I had the most recent versions of Calibre, Homebrew and GhostScript installed.
There is both good news and bad news.
Now, when I run the Calibre formatting / conversion of the eBook file in to a PDF, Calibre does it all. It not only produces a formatted PDF, as before, but the output is now searchable. This cuts the time in half to get what I need, a formatted, readable and searchable PDF – very good news.
But there is a new problem, sometimes, especially for older works, the original eBook file will come as a PDF file. In the majority of cases the PDF is good to go; functioning, readable and searchable, but in some cases the PDF is not searchable. When this happened, in my pre Catalina install, I would skip the Calibre step and run the PDF thru GhostScript.
Infrequent need to execute this process but the aforementioned GhostScript command , would fix the PDF every time, making no other visible change but rendering the PDF searchable.
Unfortunately, the GhostScript command no longer works. When you execute the GhostScript command in the Terminal there is no indication anything is wrong, sample:
GPL Ghostscript 9.53.3 (2020-10-01)
Processing pages 1 through 351.
Page 1
Page 2...
GhostScript then ticks off the pages as it processes them until it is complete, just like before -except the PDF output is not searchable.
Although Calibre can generate a spanking good PDF file from just about any eBook file, it is very poor at starting with PDFs. In my experience, it makes a hash of them. Calibre describes the reason for this their documentation, search for 'Convert PDF documents.'
The problem, and hopefully the solution lies in GhostScript.
I am embarrassed to say that on a technical level I did not understand why GhostScript worked in the first place, I cobbled the code together from reading bits and pieces around the Web. Having so little knowledge of how Ghostscript works and how it made a PDF searchable, hampers my ability to solve the problem.
Searching the Web for a solution has also been hindered by not knowing what to call this situation; after all, if you have two visually identical copies of a PDF, why is one searchable and one is not?
As I said earlier, this situation crops up infrequently, but my bad luck is that it is often important work that In cannot find elsewhere, often because it is an older, more obscure work.
If anyone can provide some guidance, that would be much appreciated.
Thank you.
MacBook Pro 16″, macOS 10.15