Help with Searching .pdf's

I have been trying to make an Automator that will search .pdf's for long list of words then display the words found.


Firstly - can anyone help with Automator to do this? If not, I thought I could put the long list of words (they never change) in a text document and when I want, copy and paste them into the preview .pdf search bar. But this has not worked well - I dont know if there need to be " " for each term, use a comma between words or " | " or nothing at all??


Thank you

MacBook Pro 16″, macOS 12.5

Posted on Aug 18, 2022 11:23 AM

Reply
Question marked as Top-ranking reply

Posted on Aug 25, 2022 6:01 PM

It would appear the the copy/paste to the hosting community changed the double quote to its HTML equivalent which is " .


Try copy/pasting this code into an existing AppleScript-Cocoa app template per the instructions:


(*
	pdfsearch.applescript
	
	Given a text file containing single, compound, or quoted word strings, and without any empty lines,
	use that as a wordlist to find all occurrences of those words, and the pages found,
	in a given PDF document.
	
	Produce a text report of the words that match, and the found page numbers. That text
	report is written to the original PDF Document location with a "_matched.txt" suffix.
	
	Code does not work reliably in Automator's Run AppleScript workflow. On M1/M2 Macs, won't work
	as a saved AppleScript application. In Script Editor, use the following to build it:
	  1) File menu > New From Template > Cocoa-AppleScript Applet.
	  2) Paste the following AppleScript/Objective-C into the preceding template
	  3) Click the hammer icon to compile the code
	  4) Option-key + File menu > Save As…
	        a) File format: Script Bundle
		 b) set filename to pdfsearch.scptd
		 c) no options set
		 d) Save
	  5) Double-click pdfsearch.scpt to run it
	
	Reference: https://discussions.apple.com/thread/254122468
	Tested: macOS 11.6.8, 12.5.1
	Version: 1.1
	Author: VikingOSX, 2022-08-25, Apple Support Communities, no warranties expressed or implied.
*)

use framework "Foundation"
use framework "PDFKit"
use AppleScript version "2.4" -- macOS Yosemite or later
use scripting additions

property PDFDocument : a reference to current application's PDFDocument
property NSString : a reference to current application's NSString
property NSUTF8StringEncoding : a reference to current application's NSUTF8StringEncoding
property NSURL : a reference to current application's NSURL
property NSArray : a reference to current application's NSArray
property NSMutableOrderedSet : a reference to current application's NSMutableOrderedSet
property NSMutableString : a reference to current application's NSMutableString
property NSMutableArray : a reference to current application's NSMutableArray
property NSLiteralSearch : a reference to current application's NSLiteralSearch
property NSCaseInsensitiveSearch : a reference to current application's NSCaseInsensitiveSearch

property WORDFILE : "~/Desktop/search_words.txt"

-- prompt for one or more PDFs. This script does not perform Optical Character Recognition of scans.
set PDFs to (choose file of type {"com.adobe.pdf"} with multiple selections allowed)

-- make the tilde path as absolute path
set xwordfile to (NSString's stringWithString:WORDFILE)'s stringByStandardizingPath()

-- get the WORDFILE into an array
set searchFile to NSString's alloc()'s initWithContentsOfFile:xwordfile encoding:NSUTF8StringEncoding |error|:0
set searchWords to (NSArray's arrayWithArray:(searchFile's componentsSeparatedByString:linefeed))'s mutableCopy()

set outfilename to NSMutableString's alloc()'s init()
-- "exact case matches" from the WORDFILE can omit the addition of the NSCaseInsensitiveSearch
set findOptions to (NSLiteralSearch as integer) + (NSCaseInsensitiveSearch as integer)

-- these Set data structures automatically purge duplicates and order ascending.
-- search word matches will appear alphabetical and page numbers ascending
set nameset to NSMutableOrderedSet's alloc()'s init()
set pageset to NSMutableOrderedSet's alloc()'s init()

repeat with apdf in PDFs
	-- add the following suffix to the original PDF path and name.
	(outfilename's setString:((NSString's stringWithString:(POSIX path of apdf))'s stringByDeletingPathExtension()))
	(outfilename's appendString:"_matched.txt")
	
	set pdf to (PDFDocument's alloc()'s initWithURL:(NSURL's fileURLWithPath:(POSIX path of apdf)))
	
	repeat with aword in searchWords
		set found to (pdf's findString:aword withOptions:findOptions)
		if not (count of (found as list)) = 0 then
			repeat with selection in found
				(nameset's addObject:((selection's |string|()) as text))
				repeat with apage in selection's pages()
					(pageset's addObject:(apage's label()))
				end repeat
			end repeat
			
			set aname to (nameset's allObjects()'s firstObject()) as text
			set pageStr to ((pageset's allObjects()'s componentsJoinedByString:", ") as text)'s quoted form
			
			if not (aname contains missing value) = true then
				set args to (outfilename as text)'s quoted form & space & aname's quoted form & space & pageStr
				my format_and_write(args)
			end if
			
			# reset these orderedSets for next word search results
			nameset's removeAllObjects()
			pageset's removeAllObjects()
		end if
	end repeat
	
end repeat
return

on format_and_write(args)
	-- allow for left-justified 20 character word matches and 40 character page sequences
	-- 1 = outfilename 2 = aname 3 = pageStr
	return (do shell script "/bin/zsh -s <<'EOF' - " & args & "
#!/bin/zsh
printf '%-20s%-40s\\n' $2 $3 >> $1
EOF")
end format_and_write


69 replies

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Help with Searching .pdf's

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.