Applescript or workflow to extract text from PDF and rename PDF with the results

Question

Applescript or workflow to extract text from PDF and rename PDF with the results

Hi Everyone,

I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.

What I need to do is name each PDF with the code which is in the text on the PDF.

It would work like this in an ideal world:

1. Split PDF into single pages

2. Extract text from PDF

3. Rename PDF using the extracted text

I'm struggling with part 3!

I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)

I did think about using a variable for the name, but the rename functions doesn't let me use variables.

Posted on Feb 3, 2014 8:45 AM

Reply

Answer 1

Tony T1

Level 6

10,247 points

Feb 3, 2014 9:37 AM in response to Phillip Briggs

.... but the rename functions doesn't let me use variables.

It should. This test worked for me:

Reply

Answer 2

Feb 4, 2014 1:28 AM in response to Tony T1

You're right it does work when I use a simplified workflow like that - my actual one is mor complicated so I must have it wrong somewhere?

It fails at the last step where I try to select the PDF and name it using the variable I defined earlier. Strangely the text file at stage 11 is saved and named correctly.

Reply

Answer 3

Tony T1

Level 6

10,247 points

Feb 4, 2014 5:41 AM in response to Phillip Briggs

That is strange. Do you get an error message?

What does it show in the last Action when you click: [Results]

Reply

Answer 4

Feb 4, 2014 5:50 AM in response to Tony T1

It says more than one item was passed?

but I'm just trying to select the PDF which was previously named "Working"

I named it like that earlier in the workflow so I could pick it up again at teh end.

Reply

Answer 5

Tony T1

Level 6

10,247 points

Feb 4, 2014 6:35 AM in response to Phillip Briggs

Phillip Briggs wrote:

It says more than one item was passed?

but I'm just trying to select the PDF which was previously named "Working"

Click [Results] in the Get Specifies Finder Items Workclow to see what was passed

Reply

Answer 6

Feb 4, 2014 6:46 AM in response to Tony T1

Aah - for some reason it passed the text file from earlier.

I just clicked the "ignore this actions input' and it just passed the PDF

So far so good!

Thanks.

Reply

Answer 7

Feb 4, 2014 8:22 AM in response to Tony T1

Hi Tony,

now that it's working better I have found some problems in using BBEDIT as my tool for extracting the text I want.

Because I can only extract a line of text containing "..." it extracts the whole line, which in some cases contains too much information or bad characters for using for the filename, and there isn't any action for BBEDIT to clean it any more.

Do you know a way of extracting certain text from the resulting output from the Extract PDF Text action, maybe using an applescript to call from automator?

The codes I'm looking to extract all begin with HB- and are this format: HB-.._......

Sample text here:

Selected Cleaning*

*See individual labels for items included

in this offer. EPU5 Offer valid from 09/03/14 to 27/04/14 HB-PC_123456

Reply

Answer 8

VikingOSX

Level 10

123,144 points

Feb 4, 2014 9:33 AM in response to Phillip Briggs

Tony and Phillip,

Perhaps a regular expression grouping within BBedit will return just the text string you want and not the whole line.

Reply

Answer 9

Tony T1

Level 6

10,247 points

Feb 4, 2014 12:04 PM in response to Phillip Briggs

The codes I'm looking to extract all begin with HB- and are this format: HB-.._......

Sample text here:

Selected Cleaning*

*See individual labels for items included

in this offer. EPU5 Offer valid from 09/03/14 to 27/04/14 HB-PC_123456

As VikingOSX suggested, use regular expressions. I don't use BBedit, so you'll need to look at the documentation, but the expression in grep is:

"HB-.*[0-9]"

This assumes that the last character is a number.

If you want from HB- to the last char use:

"HB-.*$"

Reply

Answer 10

Tony T1

Level 6

10,247 points

Feb 4, 2014 12:45 PM in response to Phillip Briggs

If you're not familiar with BBEdit reg-ex (I'm not), you can:

Add Run Shell Script (and Pass Input [as arguments] after Set Value of Variable with:

echo "$1" | grep -o "HB-.*$"

Then add Set Value of Variable again:

Reply

Answer 11

Tony T1

Level 6

10,247 points

Feb 4, 2014 1:03 PM in response to Tony T1

I guess you can put this Run Shell Script before your Set Value of Varialble (no need to set variable again)

Reply

Answer 12

Hiroto

Level 5

7,467 points

Feb 5, 2014 12:10 AM in response to Phillip Briggs

Hello

You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)

Currently the regex pattern is set to:

/HB-.._[0-9]{6}/

which means HB- followed by two characters and _ and 6 digits.

Minimally tested under 10.6.8.

Hope this may help,

H

_main()
on _main()
    script o
        property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
            default location (path to desktop) with multiple selections allowed
        
        set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
            default location (path to desktop)
        
        set args to ""
        repeat with a in my aa
            set args to args & a's POSIX path's quoted form & space
        end repeat
        
        considering numeric strings
            if (system info)'s system version < "10.9" then
                set ruby to "/usr/bin/ruby"
            else
                set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
            end if
        end considering
        
        do shell script ruby & " <<'EOF' - " & args & "
require 'osx/cocoa'
include OSX
require_framework 'PDFKit'

outdir = ARGV.shift.chomp('/')

ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
    url = NSURL.fileURLWithPath(f)
    doc = PDFDocument.alloc.initWithURL(url)
    path = doc.documentURL.path
    pcnt = doc.pageCount
    
    (0 .. (pcnt - 1)).each do |i|
        page = doc.pageAtIndex(i)
        page.string.to_s =~ /HB-.._[0-9]{6}/
        name = $&
        unless name
            puts \"no matching string in page #{i + 1} of #{path}\"
            next # ignore this page
        end
        doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
        unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
            puts \"failed to save page #{i + 1} of #{path}\"
        end
    end
end
EOF"
    end script
    tell o to run
end _main

Reply

Answer 13

Feb 5, 2014 12:57 AM in response to Phillip Briggs

Wow what great responses - I will try them out today and let you know how I get on!

Reply

Answer 14

Feb 5, 2014 4:09 AM in response to Tony T1

Tony,

I have this working now - but I used the shell script:

echo "$1" | grep -o "HB-\S*"

intstead becasue some codes didn't follow the accepted format and had an extra bit on the end, such as _E or _V2 etc. So using this I am able to pull out the code up to the end of the string.

The problems I have now relate to what happens when a PDF has totally the wrong code - the workflow leaves the "WORKING.PDF" and matching text file in the folder which messes up the next files, even though everything is ticked 'Replace Existing Files" where possible.

I've got away with not using BBEDIT as well, Text Edit is OK for reading the extracted text.

I'll keep at it to see if I can cleanup the WORKING files after the workflow.

Thanks again for your help - I certainly am closer now and have more options.

Reply

Answer 15

Feb 5, 2014 4:16 AM in response to Hiroto

Hi Hiroto,

your script is great and works really well.

I have a couple of questions though:

1. If I wanted to extend the regex to include extra characters where possible, i.e. to match HB-123456_E what would I use? I have tried "HB-\S*" which seems to work OK in Automator for matching everything upto the first space.

2. If I wanted to run the script as a folder action, or hot folder script I would not need the dialog boxes to select source and destination. Can it be altered easily for that purpose?

Thanks you so much for your help so far!

Reply