Help with Searching .pdf's

Question

Level 1

69 points

Help with Searching .pdf's

I have been trying to make an Automator that will search .pdf's for long list of words then display the words found.

Firstly - can anyone help with Automator to do this? If not, I thought I could put the long list of words (they never change) in a text document and when I want, copy and paste them into the preview .pdf search bar. But this has not worked well - I dont know if there need to be " " for each term, use a comma between words or " | " or nothing at all??

Thank you

MacBook Pro 16″, macOS 12.5

Posted on Aug 18, 2022 11:23 AM

Reply

Answer 1

VikingOSX

Level 10

123,674 points

Aug 20, 2022 12:43 PM in response to mrchntmarine

It sounds as if you have a single named file containing all of the search words. If true, I won't have to prompt you for that file.
When you match a word from the preceding file to words in the PDF, do you care about duplicate matches in the PDF, or just a list of unique words that may or may not have duplicate matches?
1. If you care about duplicate matches, do you want a concordance (count of individual word matches) too?

I will finish this when I get responses to the questions.

Reply

Answer 2

mrchntmarine Author

Level 1

69 points

Aug 21, 2022 6:24 AM in response to VikingOSX

I get a different .pdf each month and would like to select it, right click and run the Automator. I’m looking for the same words each time for now, but would add or remove some words as time goes by. If words are found, I’d like to see a list of which ones were found. I can then go to the document and search for the found word (s) and see it’s context. Most times there will be no word, I know that. But I have a large list, maybe 100 words, and when one is found, I need to know it’s use.

For instance - out of the list, ship name Enterprise is found…. I can then go and open .pdf and do a search for that one word, instead doing all 100 words, and see where it is used, and more importantly, for what purpose - named in a story, stamp auction ( interest is in Philatelic uses here), Etc…

Does this help? Need more info? Thanks much!

Reply

Answer 3

mrchntmarine Author

Level 1

69 points

Aug 21, 2022 2:01 PM in response to VikingOSX

Question - just noticed in text edit, I only have option to save as .rtf and not .txt. Problem? Also, when. using more than word to search as I will be doing, do I use commas in the text file as separator? Just a "space", " | " or what?? Tks

Reply

Answer 4

VikingOSX

Level 10

123,674 points

Aug 21, 2022 2:59 PM in response to mrchntmarine

Make that a plain text file named search_words.txt. This solution does not work with Rich Text documents. Try the Quick Action again after that file format and name fix.

What version of macOS are you running this on? Big Sur and later use Finder, not Finder.app, but that should not matter. If you are running this on an older operating system, the Run AppleScript may behave differently, though that syntax that you red-arrowed should not be happening as long as the action was receiving some file.

Reply

Answer 5

VikingOSX

Level 10

123,674 points

Aug 21, 2022 2:53 PM in response to mrchntmarine

That item 1 is your search_words.txt file being passed into the Run AppleScript. No whitespace in the filename in the Get Specified Finder items action. Click that hammer icon in your Run AppleScript action and save. Try again.

Reply

Answer 6

VikingOSX

Level 10

123,674 points

Aug 21, 2022 3:25 PM in response to mrchntmarine

It would appear that you either did not get any text from the PDF that you selected in the Finder, or no text was being passed to the split_words handler as the value mytext is empty and that is borking the componentsSeparatedByString method.

Reply

Answer 7

VikingOSX

Level 10

123,674 points

Aug 21, 2022 4:01 PM in response to mrchntmarine

A Quick Action does not provide any interactive results as you have shown. but an Automator application does. The code was written and tested specifically to be run as a Quick Action, just as I posted the complete example.

Reply

Answer 8

mrchntmarine Author

Level 1

69 points

Aug 21, 2022 4:18 PM in response to VikingOSX

its a quick action. So i m not allowed to run each action and view the results from inside automator? There is a "results" option under each action...... Its finding the text. It doesnt seem to like this I think:

Reply

Answer 9

VikingOSX

Level 10

123,674 points

Aug 21, 2022 4:51 PM in response to mrchntmarine

When you run a Quick Action internally through its Run button, only the text file is passed to it and no PDF, because you haven;'t selected one in the Finder. input is just input, and cannot be referenced as item n of input. On its best day, Automator is a PITA to attempt to debug code. That Automator Quick Action should behave differently when run select a PDF and run the QA from the Finder's Quick Action menu as it was designed to do.

Reply

Answer 10

VikingOSX

Level 10

123,674 points

Aug 22, 2022 6:17 AM in response to mrchntmarine

When I tested this QA solution, I used the full text of Steve Job's Stanford commencement address exported to PDF from Pages. The search word text file was about ten words some known to be in the PDF and others that were not. This QA works on all of my PDF here created from plain text by Pages, MacTeX, or Apple's PDFContext. When I run it against a litany of other sourced PDF content and creation tools, and with content of varying complexity the QA fails. When I replaced that Zsh word splitting handler, I stopped getting those related errors and word matches stopped occurring.

My time is limited and I have already invested a day in the development, testing, and refinement of the code I have already posted. For reasons of the first paragraph, I doubt that I can make this QA sufficiently resilient to work correctly for all PDF text content, and I cannot continue with it.

Reply

Answer 11

mrchntmarine Author

Level 1

69 points

Aug 22, 2022 8:52 AM in response to VikingOSX

Ok. Thanks for trying! Appreciate it.

I did convert the .pdf file(s) to different kinds of .pdf and got the same error. I also tried multiple other .pdf's using different words and also got the same error... Who knows?

Tks again.

PS. I googled the speech site, Stanford, downloaded the speech as .pdf in Safari and searched for the word "lucky".

Same error.... Ugh.

Reply

Answer 12

VikingOSX

Level 10

123,674 points

Aug 22, 2022 3:08 PM in response to mrchntmarine

I spent some more time with this today, and even when converting the text encoding from ISO-8859-1 to UTF-8, I still run into issues and Automator just gives up. Think I am now actually done working on this beast.

Reply

Answer 13

mrchntmarine Author

Level 1

69 points

Aug 22, 2022 8:20 PM in response to VikingOSX

Well I tried converting to different .pdfs ( I think , haha), with no luck. All this has made me remember my college coding professor from a class in C. He was tough and that’s about the last time Ive coded. I have no clue what the error means…. There is a process in Automator that searches .pdf on desktop, but I couldn’t make that work either.

I threw this up too on another board, with no replies.

what a drag - I’ve got about 100 words I’d like to search for in .pdf’s w/o having to type them all each month.

not giving up yet!

Tks for the help.

Reply

Answer 14

VikingOSX

Level 10

123,674 points

Aug 23, 2022 7:48 AM in response to VikingOSX

I have a Python script here that searches a PDF for a word and returns that word when matched and a list of unique page numbers where it is found. That cannot be implemented on macOS 12.3.1 or later without special instructions, and my challenge now is to translate that code functionality to AppleScript/Objective-C as that presently is supported in Monterey.

I have tested some crude code in AppleScript that gets at the found word, and all pages that it occurs on in multipage PDFs as a proof of concept. The trick is to get that formatted output in a text file as word: 1, 5, 7, 10, 40 and collapse multiple matches on the same page into a single occurrence of that page number.

Reply

Answer 15

mrchntmarine Author

Level 1

69 points

Aug 23, 2022 10:14 AM in response to VikingOSX

yes, I dont like the problem controlling either..... Im going to mess around too - but dont have much of any experience here - so slower for me. I can, if it helps, try to post a link to one of the files I get each month if that would help.... Humm, ill do it anyway as I dont get notifications and will have to see check back. trying this...

[Link Edited by Moderator]

Reply

Help with Searching .pdf's

Similar questions