Help with Searching .pdf's

I have been trying to make an Automator that will search .pdf's for long list of words then display the words found.


Firstly - can anyone help with Automator to do this? If not, I thought I could put the long list of words (they never change) in a text document and when I want, copy and paste them into the preview .pdf search bar. But this has not worked well - I dont know if there need to be " " for each term, use a comma between words or " | " or nothing at all??


Thank you

MacBook Pro 16″, macOS 12.5

Posted on Aug 18, 2022 11:23 AM

Reply
Question marked as Top-ranking reply

Posted on Aug 25, 2022 6:01 PM

It would appear the the copy/paste to the hosting community changed the double quote to its HTML equivalent which is " .


Try copy/pasting this code into an existing AppleScript-Cocoa app template per the instructions:


(*
	pdfsearch.applescript
	
	Given a text file containing single, compound, or quoted word strings, and without any empty lines,
	use that as a wordlist to find all occurrences of those words, and the pages found,
	in a given PDF document.
	
	Produce a text report of the words that match, and the found page numbers. That text
	report is written to the original PDF Document location with a "_matched.txt" suffix.
	
	Code does not work reliably in Automator's Run AppleScript workflow. On M1/M2 Macs, won't work
	as a saved AppleScript application. In Script Editor, use the following to build it:
	  1) File menu > New From Template > Cocoa-AppleScript Applet.
	  2) Paste the following AppleScript/Objective-C into the preceding template
	  3) Click the hammer icon to compile the code
	  4) Option-key + File menu > Save As…
	        a) File format: Script Bundle
		 b) set filename to pdfsearch.scptd
		 c) no options set
		 d) Save
	  5) Double-click pdfsearch.scpt to run it
	
	Reference: https://discussions.apple.com/thread/254122468
	Tested: macOS 11.6.8, 12.5.1
	Version: 1.1
	Author: VikingOSX, 2022-08-25, Apple Support Communities, no warranties expressed or implied.
*)

use framework "Foundation"
use framework "PDFKit"
use AppleScript version "2.4" -- macOS Yosemite or later
use scripting additions

property PDFDocument : a reference to current application's PDFDocument
property NSString : a reference to current application's NSString
property NSUTF8StringEncoding : a reference to current application's NSUTF8StringEncoding
property NSURL : a reference to current application's NSURL
property NSArray : a reference to current application's NSArray
property NSMutableOrderedSet : a reference to current application's NSMutableOrderedSet
property NSMutableString : a reference to current application's NSMutableString
property NSMutableArray : a reference to current application's NSMutableArray
property NSLiteralSearch : a reference to current application's NSLiteralSearch
property NSCaseInsensitiveSearch : a reference to current application's NSCaseInsensitiveSearch

property WORDFILE : "~/Desktop/search_words.txt"

-- prompt for one or more PDFs. This script does not perform Optical Character Recognition of scans.
set PDFs to (choose file of type {"com.adobe.pdf"} with multiple selections allowed)

-- make the tilde path as absolute path
set xwordfile to (NSString's stringWithString:WORDFILE)'s stringByStandardizingPath()

-- get the WORDFILE into an array
set searchFile to NSString's alloc()'s initWithContentsOfFile:xwordfile encoding:NSUTF8StringEncoding |error|:0
set searchWords to (NSArray's arrayWithArray:(searchFile's componentsSeparatedByString:linefeed))'s mutableCopy()

set outfilename to NSMutableString's alloc()'s init()
-- "exact case matches" from the WORDFILE can omit the addition of the NSCaseInsensitiveSearch
set findOptions to (NSLiteralSearch as integer) + (NSCaseInsensitiveSearch as integer)

-- these Set data structures automatically purge duplicates and order ascending.
-- search word matches will appear alphabetical and page numbers ascending
set nameset to NSMutableOrderedSet's alloc()'s init()
set pageset to NSMutableOrderedSet's alloc()'s init()

repeat with apdf in PDFs
	-- add the following suffix to the original PDF path and name.
	(outfilename's setString:((NSString's stringWithString:(POSIX path of apdf))'s stringByDeletingPathExtension()))
	(outfilename's appendString:"_matched.txt")
	
	set pdf to (PDFDocument's alloc()'s initWithURL:(NSURL's fileURLWithPath:(POSIX path of apdf)))
	
	repeat with aword in searchWords
		set found to (pdf's findString:aword withOptions:findOptions)
		if not (count of (found as list)) = 0 then
			repeat with selection in found
				(nameset's addObject:((selection's |string|()) as text))
				repeat with apage in selection's pages()
					(pageset's addObject:(apage's label()))
				end repeat
			end repeat
			
			set aname to (nameset's allObjects()'s firstObject()) as text
			set pageStr to ((pageset's allObjects()'s componentsJoinedByString:", ") as text)'s quoted form
			
			if not (aname contains missing value) = true then
				set args to (outfilename as text)'s quoted form & space & aname's quoted form & space & pageStr
				my format_and_write(args)
			end if
			
			# reset these orderedSets for next word search results
			nameset's removeAllObjects()
			pageset's removeAllObjects()
		end if
	end repeat
	
end repeat
return

on format_and_write(args)
	-- allow for left-justified 20 character word matches and 40 character page sequences
	-- 1 = outfilename 2 = aname 3 = pageStr
	return (do shell script "/bin/zsh -s <<'EOF' - " & args & "
#!/bin/zsh
printf '%-20s%-40s\\n' $2 $3 >> $1
EOF")
end format_and_write


69 replies

Aug 24, 2022 1:23 PM in response to VikingOSX

VikingOSX wrote:

• Would you be content with a text file containing the output of a PDF word search that looks like the following when searching a four-page PDF with the following words:

bogus
• his
• Hall
• stranger
• portmanteau


Yes


with the output results showing the name of the matched word, and a list of pages on which that word was found:

his 1, 2, 3, 4
Hall 1, 2, 3, 4
stranger 1
portmanteau 1


Yes



I am allowing for twenty-character search words and 40 characters for the page number matches. These are tweakable.

Right now, I have it working on a fixed search_word.txt file and a fixed four-page PDF, but the devil is in the details getting that formatting to work, and I still have to revise the code to process multiple PDFs, and adapt it to an Automator Quick Action. The cursing part is done, and I should have that QA posted this afternoon.



So the twenty character search per word should be fine. Some examples of the ship names: Margaret Thompson, Harry Culbreath, Gulf Trader... Lots of the ships are 2 part names as you see here. It would work if I were to search only for "Culbreath" or "Trader" if its too much to get to use 2 part names. Not an issue. Also, I am content with only searching one .pdf at a time. Finally, if its a lot of work to get the page number to show in the results, skip that part too. Because, if I see the results and the pages listed, I will anyway have to open the .pdf and find the page. So, at this point its easier for me and less work for you I think for me to open the .pdf and either search for the word(s) found than have to have the page number, go to the page and then have to scan for the word. I do not want to create more work for you than necessary.


The bulk of what I am trying to "automate" , is having 75-100 names to search for. i.e., typing each one singly to search month after month.



Does this help?



Aug 25, 2022 3:16 PM in response to VikingOSX

So I know I said this before, haha, but now I really got swamped. ill try to get to this tonight to test. if not, tomorrow sometime for sure. Ill keep you posted though. I know I said I dont get notifications so I went and changed a setting then I got one right when you posted but then none. So ill do this test and post tomorrow. Thanks again for all your time!!

Aug 25, 2022 8:08 PM in response to VikingOSX

so, I think we are close.... I did the new copy and paste, etc. , per instructions. pdfsearch.scptd is on my desktop. When I double click to run, the script editor opens and I have to click the play button to run. When I do, finder opens and I select a .pdf and the script run and I get this error. (See below). I also do not know how the script gets the text list of names? From your post above,


"but does allow you to use a fixed name text file containing your search words"


What do I name the file??


Aug 25, 2022 8:36 PM in response to VikingOSX

couldn't resist. I did a text file with a few more words and did not alphabetize them(just to see). Here is the output:



I opened the .pdf to check the words and all is good and the output is correct. Also, for instance, the word Independence appears on p. 25 once and p. 29 two times... But this is good for me the way it is - much better than typing all the words over and over each month.


Will test more tomorrow. GREAT!!


PS - I think you mentioned this earlier ?? Is there a limit to the number of search terms I can have in the text file?

Aug 26, 2022 5:03 AM in response to mrchntmarine

To get those results, you apparently figured out that the WORDFILE property is the tilde path and name of the text file containing your search words. The expectation, and my testing place this words file on your local drive, though the name and location is for you to decide. The tilde is simply the operating system's abbreviation for /Users/yourname. I don't know of a limit in the quanta of search words you can use, though I expect far larger than you would want to type, or wait for processing to complete.


As I mention in the code comments, I am using Set data structures which inherently remove duplicates, so words that may occur multiple times on a page only reflect once per page in the printed results. Preview will show you those occurrences during the individual word search.


One aesthetic that I don't know if can be fixed is that when the page numbers wrap, they do so under the words, rather than indent under the preceding page numbers. In Python 2, there is a textwrap package included that allows one to control indented line wrap in the print statement, but one would have to write this in either the Zsh shell, or AppleScript/Objective-C and life is too short.

Aug 26, 2022 11:38 AM in response to VikingOSX

tested again with 3 pages of names. this is good. thank you again..


I see we cannot PM here. This script was done with AppleScript, yes? If I wanted to learn a little language, is this one one that can be used as a starting point? If so, can you recommend a book? If not a starting point, can you tell me. what I should look at first? this query has got me wanting to know more. Tks again.

Aug 29, 2022 8:15 PM in response to VikingOSX

again, tks.. Question - if I want to move the search file from my desktop to another location, can I go to the compiled version of the script and just change the path or do I have to work with the text copy of the script, reinsert to the editor and re-compile?


i.e., procedure - im guessing that I cannot just change the path to the new location in the complied version I am running?

Sep 9, 2022 8:30 AM in response to VikingOSX

Update:


I have created a replacement handler that will wrap the second column of page numbers for each found word.



Replace:


on format_and_write(args)
	-- allow for left-justified 20 character word matches and 40 character page sequences
	-- 1 = outfilename 2 = aname 3 = pageStr
	return (do shell script "/bin/zsh -s <<'EOF' - " & args & "
#!/bin/zsh
printf '%-20s%-40s\\n' $2 $3 >> $1
EOF")
end format_and_write


with:


on format_and_write_columns(args)
	
	-- allow for left-justified 20 character word matches and 60 character page sequences
	-- that now wrap on the second column.
	-- 1 = outfilename 2 = aname 3 = pageStr
	return (do shell script "/bin/zsh -s <<'EOF' - " & args & "
#!/bin/zsh

# column widths
col_1=20
col_2=60

# the indent width
printf -v spacing '%*s' $col_1
# print out both dynamic (*) columns with column wrap on page numbers
printf '%-*s%-*s\\n\\n' $col_1 ${2} $col_2 \"$(fold -s -w $col_2 <<<${3} | sed -e \"2,\\$s/^/$spacing/\")\" >> ${1}
EOF")
end format_and_write_columns


Don't forget to replace the handler call from:


my format_and_write(args)


to:


my format_and_write_columns(args)


Sep 12, 2022 12:47 PM in response to VikingOSX

Very nice. Tks! I’ll try it when I’m back home. I have never had a word return so

many hits though , but I’ll make the change when I can. An improvement. I have though, as I think you already mentioned , get hits when my searched word is part of another. For instance if I search for Pitt, I get a hit on Pittsburgh. Or if I search for Dale, Scottsdale was a hit for instance.


I may have it backward - can’t remember now - anyway, one way or another, a word that’s part of another word…. Rolling down the highway on a phone!!


BUT, it’s still much better than what I had.


many thanks again and keep you posted.

Sep 12, 2022 5:17 PM in response to VikingOSX

Tks for the explanation. Skip that as far as im concerned.... Ive still got much better than what I had before you helped. Its great as is. When I get home im going to insert the new code and re-compile.


Also, tks again for the tip on the books and other links. read a little but haven't gotten much into it yet. I blew a gasket on the "Learn AppleScript: The Comprehensive Guide to Scripting and Automation on Mac OS X (Learn (Apress)) 3rd ed. Edition" and returned it..... I clicked on the 1st ink to an Apple site and it didnt work! Nice touch for a programming book and the CS with the editor wasn't any help either. Anyhow, looked like a good book, I just lost my cool.


Sep 13, 2022 4:05 AM in response to mrchntmarine

I am looking at some coding possibilities here to match all of your search words to the unique words on a specific PDF page — in one line of code, without the issue of substring matches, and without regular expression dependency. I know that works, and it will just be a matter of associating the matching page numbers to the matched words. This is a work in progress…


Most of the AppleScript books and online materials may be quite old by now and any links in that material can be incorrect, or gone entirely due to time lapse.

Sep 14, 2022 9:44 AM in response to VikingOSX

Part 2


	-- matching search words in order of occurrence
	-- repeat with key in (muDict's allKeys())
	
	-- matching search words sorted alphabetically by key name
	repeat with key in (muDict's allKeys()'s sortedArrayUsingSelector:{("localizedCompare:")})
		set aname to key as text
		set pageStr to ((muDict's objectForKey:key)'s allObjects()'s componentsJoinedByString:", ") as text
		set args to (outfilename as text)'s quoted form & space & aname's quoted form & space & pageStr's quoted form
		my format_and_write_columns(args)
	end repeat
	muDict's removeAllObjects()
end repeat
-- cleanup
searchSet's removeAllObjects()
muDict's removeAllObjects()
return

on format_and_write_columns(args)
	-- reference: https://unix.stackexchange.com/questions/233085/print-columns-of-data-that-wrap-internally
	-- allow for left-justified 20 character word matches and 60 character page sequences
	-- that now wrap on the second column.
	-- 1 = outfilename 2 = aname 3 = pageStr
	return (do shell script "/bin/zsh -s <<'EOF' - " & args & "
#!/bin/zsh

# column widths
col_1=20
col_2=60

# the indent width
printf -v spacing '%*s' $col_1
# print out both columns with column wrap on second page numbers
printf '%-*s%-*s\\n\\n' $col_1 $2 $col_2 \"$(fold -s -w $col_2 <<<${3} | sed -e \"2,\\$s/^/$spacing/\")\" >> $1
EOF")
end format_and_write_columns


Sep 15, 2022 3:07 PM in response to VikingOSX

Okay. So, maybe a snafu. Attached is a .txt of results. The Final Version results is the 1 word at the top of the text file, "Arlington". I knew there to be more words in the .pdf from my list, at least "Otus", and I then ran the original script and you can see the couple of lines appended to the file.... So it seems the latest version is not picking up all the words.



This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Help with Searching .pdf's

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.