Rename pdf files using text within the pdf itself on macOS Sonoma

I have several thousand pdfs that I need to rename. In the pdf is an account number and I want to rename the pdf file to that account number. The account number is in the same location on every file. I know nothing about scripts and/or automator but I am willing to learn. This is a process that I have to complete each year when we send out renewals. I did see a few other threads but I am lost and need someone to explain it in simple laymans terms. I can add an image of the document if needed. TIA


[Re-Titled by Moderator]

iMac (M1, 2021)

Posted on Jan 4, 2024 7:17 AM

Reply
25 replies

Jan 5, 2024 8:46 AM in response to Jenpower1

It is always beneficial to have live data (sample PDF) to test, but you cannot post a PDF here, and it would need to be on a public web location (e.g. dropbox, etc) so I could download it. Five different PDFs would be useful. I right-clicked on your invoice image above and after downloading it, opened and exported it to a PDF. I had mixed testing results due to the low resolution of the image within the PDF.


When I create a regular PDF with a Customer #: nnnnnn string in the body text, I can extract the 4 - 6 digit number. Same is true when the customer number is in a text box. With different (optical character recognition — OCR) code, I can reveal the Customer #: nnnnnn text string in a PDF where that data is in an image. Testing on your sample PDF will help me choose which code I ultimately use.


Some further questions:

  1. These PDF are in English, and not additional languages?
  2. Always a one-page PDF?
  3. What is more convenient for you:
    1. Renamed PDF in same folder?
    2. Renamed PDF in separate folder?


I also plan to have the solution write a report showing the mapping of the original to renamed filenames, or if for some reason, a rename could not be accomplished, that original filename too.

Jan 5, 2024 10:09 AM in response to VikingOSX

The following is not a final solution but to test if the customer invoice number is found as text in the PDF.


Steps:

  1. Click once on the Desktop and then press shift+cmd+U to open the /Applications/Utility folder
    1. Double-click on the application "Script Editor" in that location to launch it.
    2. Set the location folder at the top to your Desktop and then click New Document button.
  2. Copy and paste the following AppleScript code into the Script Editor
    1. Click the hammer icon on the toolbar to compile the pasted AppleScript. This is just checking the syntax and if no errors, it is ready to run.
    2. Now, click the Run ▶︎ button.
      1. A dialog will appear where you select one of your sample invoice PDF
      2. If the customer number is found, it will be displayed in another dialog


AppleScript code (copy/paste the following into Script Editor):


use framework "Foundation"
use framework "PDFKit"
use AppleScript version "2.4" -- macOS Yosemite and later
use scripting additions

property ca : current application

set thePDF to POSIX path of (choose file of type "PDF") as text
set pdf to ca's PDFDocument's alloc()'s initWithURL:(ca's NSURL's fileURLWithPath:thePDF)
set pdf_text to ca's NSString's stringWithString:((pdf's pageAtIndex:0)'s |string|())

-- ITU regular expression syntax to capture invoice number in the PDF
set regexPat to ca's NSString's stringWithString:"(?<=Customer #:)\\s+([[:digit:]]{4,6})"

set match to pdf_text's rangeOfString:regexPat options:(ca's NSRegularExpressionSearch)
try
	set capture to (pdf_text's substringWithRange:match)'s stringByTrimmingCharactersInSet:(ca's NSCharacterSet's whitespaceCharacterSet)
	if not (capture = missing value) = true then
		display dialog capture as text with title "Customer Invoice Number"
	end if
on error
	display dialog "Unable to find customer invoice # in PDF.
It may be an internal image format and not text." with title "Error dialog"
end try
return


Two possible dialog results:



Jan 6, 2024 1:49 PM in response to Jenpower1

Of course, Murphy's law…


Certainly nothing I encountered here in my testing. Maybe I should have tested on a sampling of your PDFs. 😉


Did it rename any PDFs before you received this error or none at all? Anything in the report? I asked that in case it processed some PDFs before the error, and perhaps there was something different about the PDF that caused this error.


I can make some sense of that NSRange, and the floating point number (e.g. NSNotFound), but I don't know where this occurred in the code as it is a runtime and not a syntax error. I am taking a look now…


Can you reply with the AppleScript code that you copy and pasted into the Script Editor using the Additional Text tool in this editor? Sometimes, the code isn't entirely copied and I have to verify it works here without those errors. If it does, then there is something in one of your PDFs that is triggering the error you received.


Jan 6, 2024 11:50 AM in response to Jenpower1

I (hope) that this is finished and works to your expectations for processing the folder of PDF documents. The first thing you will see when you run it is perhaps the following, and you should just click Allow.



The next thing will be a folder chooser where you pick that folder with your 1000 invoice PDFs. Single-click the folder name, do not double-click into it. Then, the application will set about capturing invoice digits and renaming your existing invoice PDFs to these digits in the same folder location. All totally silent. When it is done processing, you will see the following with different numbers:


Rejected PDFs may because either the invoice number was not found, or it was outside the digit range of 4 - 6 digits. The filename is not changed for rejected PDF and there will be an entry in the report indicating the invoice number (if digits not 4 - 6). The report is now in alphabetic order by original filename making it easier to read.



You will be opening up the Script Editor as before, and copy/pasting the code below into it. Click the compile button, and run it on a folder containing 50 - 100 of your PDFs. Just to see how well it processed them.


Next, you will want to use File menu : Save to preserve the AppleScript source and on that save panel:

  1. File Format: Text (do not change Line Endings)
  2. Save As: process_pdfs.applescript
  3. The location can be your Documents folder or even the Desktop. Your choice.
  4. Save


Follow this with an option-key + File menu : Save As… because we want to make this a double-clickable application on the Desktop:

  1. File Format: Application
  2. Options: unset
  3. Save As: process_pdfs.app
  4. Save to your Desktop
  5. And now Quit Script Editor


You now have a double-clickable application on your Desktop that will process a folder of PDF invoices as your requested.


Code:






Jan 4, 2024 4:33 PM in response to Jenpower1

If the Service Address and Customer # information are in a text box when exported to PDF, then I have code that I have tested, and extracts that customer # right now. If that is an image background with text flattened onto it, then I don't have a working solution at this time that can extract the text from the image while processing the PDF.


What I can do is prompt you for the name of the folder containing these PDFs and then process and rename each PDF file in the folder.


Let's revisit this tomorrow, Jan 5.

Jan 4, 2024 10:53 AM in response to Jenpower1

Are these PDFs generated by an application, or the result of scanning documents to PDF? If the latter, have the PDFs been OCR'd so that the content is not an image, but rather has a searchable text layer?


Yes, an image of a sample PDF (no confidential content) posted here would be helpful.


One must be able to look inside the binary PDF content and find that account number.

  • What does an example account number look like, and do the account numbers vary in length, or characters in the account number?
  • Do the account numbers always appear at the beginning of the PDF, left-aligned, or are they embedded in surrounding text, or right-aligned?
  • Are the PDFs all in one folder location or several?
  • Are the renamed PDFs to be written to a separate folder location?



Jan 4, 2024 1:34 PM in response to VikingOSX

Thank you so much for responding!! To answer your questions:


The PDFs are exported as PDFs from a graphic design program

The account numbers are all numeric and vary in length. - 4 to 6 characters

The account numbers are embedded at the end of the page - always in the same location.

All of the PDFs are in the same folder.

The renamed PDFs can either remain in the same folder or be moved if necessary.


I am attaching 2 screenshots - one of the entire page and the one of the enlarged area where the account number is located.



Jan 5, 2024 12:34 PM in response to Jenpower1

That screen shot eliminates alot of potential code for me, so eureka. Although five invoice PDFs with varying customer numbers would be grand, the result above suggests I can create my own test data here that will serve that purpose.


Now to finish the code… though likely not done today (1/05). Hope to have it done by Sat 1/06) so you can test it on a smaller folder of these invoices to confirm it continues to work.

Jan 6, 2024 9:06 AM in response to Jenpower1

Just an update for Sat 01/06:


Here is what is working so far.


  1. Select a folder containing PDFs
    1. Not using the Finder due to the larger quantity of PDFs stated earlier
  2. Capture invoice number from each PDF and create new PDF name based on invoice number
    1. Write renamed PDFs into original folder
    2. check if invoice number named PDF exists and if so, skip that processing.
  3. Write a text report to the Desktop showing the mapping of original to invoice # renamed PDF files


Work remaining:

  1. Alpha sort the original filenames for a more readable report.
  2. Additional testing, documentation, and code cleanup
  3. Currently stuck on something that is eating up alot of time, but I will power through it today.


The report example:



I will be home and dog sitting tomorrow offsite through Tues eve, and will be without my 32-in display, so I need to get this done today if at all possible.

Jan 6, 2024 3:06 PM in response to Jenpower1

Fixed the problem.



and



It turns out that I was assuming there was a space after Customer #: <space>nnnn and your PDFs have no space in that location, so all I had to do was alter the regexPat line to:


set regexPat to ca's NSString's stringWithString:"(?<=Customer #:)\\s*([[:digit:]]{4,})"


Where the \\s* means 0 or more spaces.


Fix that one line in the source AppleScript and do another option+Save As: to a Desktop application and you should be good to go. Thanks for helping with this solution.

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Rename pdf files using text within the pdf itself on macOS Sonoma

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.