Help with Searching .pdf's

I have been trying to make an Automator that will search .pdf's for long list of words then display the words found.


Firstly - can anyone help with Automator to do this? If not, I thought I could put the long list of words (they never change) in a text document and when I want, copy and paste them into the preview .pdf search bar. But this has not worked well - I dont know if there need to be " " for each term, use a comma between words or " | " or nothing at all??


Thank you

MacBook Pro 16″, macOS 12.5

Posted on Aug 18, 2022 11:23 AM

Reply
Question marked as Top-ranking reply

Posted on Aug 25, 2022 6:01 PM

It would appear the the copy/paste to the hosting community changed the double quote to its HTML equivalent which is " .


Try copy/pasting this code into an existing AppleScript-Cocoa app template per the instructions:


(*
	pdfsearch.applescript
	
	Given a text file containing single, compound, or quoted word strings, and without any empty lines,
	use that as a wordlist to find all occurrences of those words, and the pages found,
	in a given PDF document.
	
	Produce a text report of the words that match, and the found page numbers. That text
	report is written to the original PDF Document location with a "_matched.txt" suffix.
	
	Code does not work reliably in Automator's Run AppleScript workflow. On M1/M2 Macs, won't work
	as a saved AppleScript application. In Script Editor, use the following to build it:
	  1) File menu > New From Template > Cocoa-AppleScript Applet.
	  2) Paste the following AppleScript/Objective-C into the preceding template
	  3) Click the hammer icon to compile the code
	  4) Option-key + File menu > Save As…
	        a) File format: Script Bundle
		 b) set filename to pdfsearch.scptd
		 c) no options set
		 d) Save
	  5) Double-click pdfsearch.scpt to run it
	
	Reference: https://discussions.apple.com/thread/254122468
	Tested: macOS 11.6.8, 12.5.1
	Version: 1.1
	Author: VikingOSX, 2022-08-25, Apple Support Communities, no warranties expressed or implied.
*)

use framework "Foundation"
use framework "PDFKit"
use AppleScript version "2.4" -- macOS Yosemite or later
use scripting additions

property PDFDocument : a reference to current application's PDFDocument
property NSString : a reference to current application's NSString
property NSUTF8StringEncoding : a reference to current application's NSUTF8StringEncoding
property NSURL : a reference to current application's NSURL
property NSArray : a reference to current application's NSArray
property NSMutableOrderedSet : a reference to current application's NSMutableOrderedSet
property NSMutableString : a reference to current application's NSMutableString
property NSMutableArray : a reference to current application's NSMutableArray
property NSLiteralSearch : a reference to current application's NSLiteralSearch
property NSCaseInsensitiveSearch : a reference to current application's NSCaseInsensitiveSearch

property WORDFILE : "~/Desktop/search_words.txt"

-- prompt for one or more PDFs. This script does not perform Optical Character Recognition of scans.
set PDFs to (choose file of type {"com.adobe.pdf"} with multiple selections allowed)

-- make the tilde path as absolute path
set xwordfile to (NSString's stringWithString:WORDFILE)'s stringByStandardizingPath()

-- get the WORDFILE into an array
set searchFile to NSString's alloc()'s initWithContentsOfFile:xwordfile encoding:NSUTF8StringEncoding |error|:0
set searchWords to (NSArray's arrayWithArray:(searchFile's componentsSeparatedByString:linefeed))'s mutableCopy()

set outfilename to NSMutableString's alloc()'s init()
-- "exact case matches" from the WORDFILE can omit the addition of the NSCaseInsensitiveSearch
set findOptions to (NSLiteralSearch as integer) + (NSCaseInsensitiveSearch as integer)

-- these Set data structures automatically purge duplicates and order ascending.
-- search word matches will appear alphabetical and page numbers ascending
set nameset to NSMutableOrderedSet's alloc()'s init()
set pageset to NSMutableOrderedSet's alloc()'s init()

repeat with apdf in PDFs
	-- add the following suffix to the original PDF path and name.
	(outfilename's setString:((NSString's stringWithString:(POSIX path of apdf))'s stringByDeletingPathExtension()))
	(outfilename's appendString:"_matched.txt")
	
	set pdf to (PDFDocument's alloc()'s initWithURL:(NSURL's fileURLWithPath:(POSIX path of apdf)))
	
	repeat with aword in searchWords
		set found to (pdf's findString:aword withOptions:findOptions)
		if not (count of (found as list)) = 0 then
			repeat with selection in found
				(nameset's addObject:((selection's |string|()) as text))
				repeat with apage in selection's pages()
					(pageset's addObject:(apage's label()))
				end repeat
			end repeat
			
			set aname to (nameset's allObjects()'s firstObject()) as text
			set pageStr to ((pageset's allObjects()'s componentsJoinedByString:", ") as text)'s quoted form
			
			if not (aname contains missing value) = true then
				set args to (outfilename as text)'s quoted form & space & aname's quoted form & space & pageStr
				my format_and_write(args)
			end if
			
			# reset these orderedSets for next word search results
			nameset's removeAllObjects()
			pageset's removeAllObjects()
		end if
	end repeat
	
end repeat
return

on format_and_write(args)
	-- allow for left-justified 20 character word matches and 40 character page sequences
	-- 1 = outfilename 2 = aname 3 = pageStr
	return (do shell script "/bin/zsh -s <<'EOF' - " & args & "
#!/bin/zsh
printf '%-20s%-40s\\n' $2 $3 >> $1
EOF")
end format_and_write


69 replies

Aug 20, 2022 12:43 PM in response to mrchntmarine

  1. It sounds as if you have a single named file containing all of the search words. If true, I won't have to prompt you for that file.
  2. When you match a word from the preceding file to words in the PDF, do you care about duplicate matches in the PDF, or just a list of unique words that may or may not have duplicate matches?
    1. If you care about duplicate matches, do you want a concordance (count of individual word matches) too?


I will finish this when I get responses to the questions.

Aug 21, 2022 6:24 AM in response to VikingOSX

I get a different .pdf each month and would like to select it, right click and run the Automator. I’m looking for the same words each time for now, but would add or remove some words as time goes by. If words are found, I’d like to see a list of which ones were found. I can then go to the document and search for the found word (s) and see it’s context. Most times there will be no word, I know that. But I have a large list, maybe 100 words, and when one is found, I need to know it’s use.

For instance - out of the list, ship name Enterprise is found…. I can then go and open .pdf and do a search for that one word, instead doing all 100 words, and see where it is used, and more importantly, for what purpose - named in a story, stamp auction ( interest is in Philatelic uses here), Etc…


Does this help? Need more info? Thanks much!

Aug 21, 2022 2:59 PM in response to mrchntmarine

Make that a plain text file named search_words.txt. This solution does not work with Rich Text documents. Try the Quick Action again after that file format and name fix.


What version of macOS are you running this on? Big Sur and later use Finder, not Finder.app, but that should not matter. If you are running this on an older operating system, the Run AppleScript may behave differently, though that syntax that you red-arrowed should not be happening as long as the action was receiving some file.

Aug 21, 2022 4:51 PM in response to mrchntmarine

When you run a Quick Action internally through its Run button, only the text file is passed to it and no PDF, because you haven;'t selected one in the Finder. input is just input, and cannot be referenced as item n of input. On its best day, Automator is a PITA to attempt to debug code. That Automator Quick Action should behave differently when run select a PDF and run the QA from the Finder's Quick Action menu as it was designed to do.

Aug 22, 2022 6:17 AM in response to mrchntmarine

When I tested this QA solution, I used the full text of Steve Job's Stanford commencement address exported to PDF from Pages. The search word text file was about ten words some known to be in the PDF and others that were not. This QA works on all of my PDF here created from plain text by Pages, MacTeX, or Apple's PDFContext. When I run it against a litany of other sourced PDF content and creation tools, and with content of varying complexity the QA fails. When I replaced that Zsh word splitting handler, I stopped getting those related errors and word matches stopped occurring.


My time is limited and I have already invested a day in the development, testing, and refinement of the code I have already posted. For reasons of the first paragraph, I doubt that I can make this QA sufficiently resilient to work correctly for all PDF text content, and I cannot continue with it.

Aug 22, 2022 8:52 AM in response to VikingOSX

Ok. Thanks for trying! Appreciate it.


I did convert the .pdf file(s) to different kinds of .pdf and got the same error. I also tried multiple other .pdf's using different words and also got the same error... Who knows?


Tks again.


PS. I googled the speech site, Stanford, downloaded the speech as .pdf in Safari and searched for the word "lucky".


Same error.... Ugh.

Aug 22, 2022 8:20 PM in response to VikingOSX

Well I tried converting to different .pdfs ( I think , haha), with no luck. All this has made me remember my college coding professor from a class in C. He was tough and that’s about the last time Ive coded. I have no clue what the error means…. There is a process in Automator that searches .pdf on desktop, but I couldn’t make that work either.


I threw this up too on another board, with no replies.


what a drag - I’ve got about 100 words I’d like to search for in .pdf’s w/o having to type them all each month.


not giving up yet!


Tks for the help.

Aug 23, 2022 7:48 AM in response to VikingOSX

I have a Python script here that searches a PDF for a word and returns that word when matched and a list of unique page numbers where it is found. That cannot be implemented on macOS 12.3.1 or later without special instructions, and my challenge now is to translate that code functionality to AppleScript/Objective-C as that presently is supported in Monterey.


I have tested some crude code in AppleScript that gets at the found word, and all pages that it occurs on in multipage PDFs as a proof of concept. The trick is to get that formatted output in a text file as word: 1, 5, 7, 10, 40 and collapse multiple matches on the same page into a single occurrence of that page number.



Aug 23, 2022 10:14 AM in response to VikingOSX

yes, I dont like the problem controlling either..... Im going to mess around too - but dont have much of any experience here - so slower for me. I can, if it helps, try to post a link to one of the files I get each month if that would help.... Humm, ill do it anyway as I dont get notifications and will have to see check back. trying this...


[Link Edited by Moderator]


This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Help with Searching .pdf's

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.