Applescript or workflow to extract text from PDF and rename PDF with the results

Hi Everyone,


I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.


What I need to do is name each PDF with the code which is in the text on the PDF.


It would work like this in an ideal world:


1. Split PDF into single pages


2. Extract text from PDF


3. Rename PDF using the extracted text


I'm struggling with part 3!


I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)


I did think about using a variable for the name, but the rename functions doesn't let me use variables.

Posted on Feb 3, 2014 8:45 AM

Reply
25 replies

Feb 4, 2014 8:22 AM in response to Tony T1

Hi Tony,


now that it's working better I have found some problems in using BBEDIT as my tool for extracting the text I want.


Because I can only extract a line of text containing "..." it extracts the whole line, which in some cases contains too much information or bad characters for using for the filename, and there isn't any action for BBEDIT to clean it any more.


Do you know a way of extracting certain text from the resulting output from the Extract PDF Text action, maybe using an applescript to call from automator?


The codes I'm looking to extract all begin with HB- and are this format: HB-.._......



Sample text here:


Selected Cleaning*

*See individual labels for items included

in this offer. EPU5 Offer valid from 09/03/14 to 27/04/14 HB-PC_123456



Feb 4, 2014 12:04 PM in response to Phillip Briggs

The codes I'm looking to extract all begin with HB- and are this format: HB-.._......


Sample text here:


Selected Cleaning*

*See individual labels for items included

in this offer. EPU5 Offer valid from 09/03/14 to 27/04/14 HB-PC_123456




As VikingOSX suggested, use regular expressions. I don't use BBedit, so you'll need to look at the documentation, but the expression in grep is:

"HB-.*[0-9]"

This assumes that the last character is a number.

If you want from HB- to the last char use:

"HB-.*$"

Feb 5, 2014 12:10 AM in response to Phillip Briggs

Hello


You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)


Currently the regex pattern is set to:


/HB-.._[0-9]{6}/


which means HB- followed by two characters and _ and 6 digits.


Minimally tested under 10.6.8.


Hope this may help,

H



_main()
on _main()
    script o
        property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
            default location (path to desktop) with multiple selections allowed
        
        set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
            default location (path to desktop)
        
        set args to ""
        repeat with a in my aa
            set args to args & a's POSIX path's quoted form & space
        end repeat
        
        considering numeric strings
            if (system info)'s system version < "10.9" then
                set ruby to "/usr/bin/ruby"
            else
                set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
            end if
        end considering
        
        do shell script ruby & " <<'EOF' - " & args & "
require 'osx/cocoa'
include OSX
require_framework 'PDFKit'

outdir = ARGV.shift.chomp('/')

ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
    url = NSURL.fileURLWithPath(f)
    doc = PDFDocument.alloc.initWithURL(url)
    path = doc.documentURL.path
    pcnt = doc.pageCount
    
    (0 .. (pcnt - 1)).each do |i|
        page = doc.pageAtIndex(i)
        page.string.to_s =~ /HB-.._[0-9]{6}/
        name = $&
        unless name
            puts \"no matching string in page #{i + 1} of #{path}\"
            next # ignore this page
        end
        doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
        unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
            puts \"failed to save page #{i + 1} of #{path}\"
        end
    end
end
EOF"
    end script
    tell o to run
end _main

Feb 5, 2014 4:09 AM in response to Tony T1

Tony,


I have this working now - but I used the shell script:


echo "$1" | grep -o "HB-\S*"


intstead becasue some codes didn't follow the accepted format and had an extra bit on the end, such as _E or _V2 etc. So using this I am able to pull out the code up to the end of the string.


The problems I have now relate to what happens when a PDF has totally the wrong code - the workflow leaves the "WORKING.PDF" and matching text file in the folder which messes up the next files, even though everything is ticked 'Replace Existing Files" where possible.


I've got away with not using BBEDIT as well, Text Edit is OK for reading the extracted text.


I'll keep at it to see if I can cleanup the WORKING files after the workflow.


Thanks again for your help - I certainly am closer now and have more options.

Feb 5, 2014 4:16 AM in response to Hiroto

Hi Hiroto,


your script is great and works really well.


I have a couple of questions though:


1. If I wanted to extend the regex to include extra characters where possible, i.e. to match HB-123456_E what would I use? I have tried "HB-\S*" which seems to work OK in Automator for matching everything upto the first space.


2. If I wanted to run the script as a folder action, or hot folder script I would not need the dialog boxes to select source and destination. Can it be altered easily for that purpose?


Thanks you so much for your help so far!

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Applescript or workflow to extract text from PDF and rename PDF with the results

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.