Applescript or workflow to extract text from PDF and rename PDF with the results

Question

Applescript or workflow to extract text from PDF and rename PDF with the results

Hi Everyone,

I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.

What I need to do is name each PDF with the code which is in the text on the PDF.

It would work like this in an ideal world:

1. Split PDF into single pages

2. Extract text from PDF

3. Rename PDF using the extracted text

I'm struggling with part 3!

I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)

I did think about using a variable for the name, but the rename functions doesn't let me use variables.

Posted on Feb 3, 2014 8:45 AM

Reply

Answer 1

Tony T1

Level 6

10,247 points

Feb 5, 2014 4:34 AM in response to Phillip Briggs

1. If I wanted to extend the regex to include extra characters where possible, i.e. to match HB-123456_E what would I use? I have tried "HB-\S*" which seems to work OK in Automator for matching everything upto the first space.

To match from HB-to _ and space: "HB-.*_[:space:]"

To match from HB- to the end of that line: "HB-.*$"

To match from HB-to _ and any char : "HB-.*_[A-Za-z]"

To match from HB- to _ and any char or digit : "HB-.*_[A-Za-z0-9]"

See: http://www.addedbytes.com/download/regular-expressions-cheat-sheet-v2/png

Reply

Answer 2

Feb 5, 2014 6:34 AM in response to Tony T1

Hi All,

I really want to use: HB-.._\S*

as this captures all eventualities - codes that include an extension like "HB-12_345678_E" and also codes that don't. It captures everything up to the first space which I think should cover everything.

It works in Automator in this shell script:

echo "$1" | grep -o "HB-.._\S*"

But in Hiroto's script if I replace: "page.string.to_s =~ /HB-.._[0-9]{6}/"

with: page.string.to_s =~ /HB-.._\S*/

I get a Syntax error:

Expected “"” but found unknown token.

Any ideas?

Reply

Answer 3

Tony T1

Level 6

10,247 points

Feb 5, 2014 8:21 AM in response to Phillip Briggs

Instead of \S try [^ ]

page.string.to_s =~ /HB-.._[^ ]*/

Oh wait, it's Applescript, you need to escape the \

page.string.to_s =~ /HB-.._\\S*/

Reply

Answer 4

Feb 5, 2014 8:25 AM in response to Tony T1

YES!

That works!

I can certainly use it as it stands now - but can I make the script 'standalone' to attach to folder actions, use in workflows etc. I will try!

Thanks again

Reply

Answer 5

VikingOSX

Level 10

123,188 points

Feb 5, 2014 4:25 PM in response to Tony T1

What a royal pain that AppleScript escaping is too.😠

Reply

Answer 6

Hiroto

Level 5

7,467 points

Feb 5, 2014 7:25 PM in response to Phillip Briggs

Hello

Here's a revised version crafted for an Automator Run Shell Script action, which can be used to create an Automator Folder Action.

To use this, create new Automator Folder Action with Run Shell Script Action set as follows:

Shell = /bin/bash

Pass input = as arguments

Contents = code as listed below

Notes on script.

* The output directory is hard-coded in script which is currently set to ~/Desktop/testout.

* When used as an Automator Folder Action attached to a watcher folder A, this script will process each pdf file added to A and then move it to done directory named Done in A. It will create A/Done directory if not present. (The directory A is obtained as the parent directory of the first item in the argument list, which is equal to the watched folder when used as Folder Action. But if not used as Folder Action, this logic would be inappropriate and done directory would be better hard-coded in script.)

* It only processes *.pdf files and leave others alone.

* It will create log file named YYYY-MM-DD_log.txt in done directory.

* In my experiences, Folder Action is non-deterministic and unreliable. It is often very slow to be triggered and can even fail to be fired. It is one of the reasons I implemented detailed logging in this script.

Hope this may help,

H

#!/bin/bash
# 
#     for Run Shell Script Action in Automator Folder Action
#         input  = pdf files
#         output = none
# 
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w <<'EOF' - "$@"
require 'FileUtils'
require 'osx/cocoa'
include OSX
require_framework 'PDFKit'

outdir = File.expand_path('~/Desktop/testout')          # destination directory where resulting pdfs are saved
donedir = File.dirname(ARGV[0]) + '/Done'               # done directory where done files are moved
logf = "#{donedir}/#{Time.now.strftime('%F_log.txt')}"  # log file named after YYYY-MM-DD_log.txt is created in done directory

def log(logf, s)
    File.open(logf, 'a') do |a|
        a.print "%-26s%s\n" % [Time.now.strftime('%F %T%z'), s]
    end
end

# 
#     create outdir, donedir if not present
#     
[outdir, donedir].each do |d|
    d = File.readlink(d) if File.symlink?(d)
    FileUtils.mkdir_p d unless File.exists?(d)
    raise RuntimeError, "#{d}: Not a directory." unless File.directory?(d)
end

# 
#     process each argument
# 
ARGV.select {|f| f =~ /\.pdf$/i }.each do |f|
    url = NSURL.fileURLWithPath(f)
    doc = PDFDocument.alloc.initWithURL(url)
    path = doc.documentURL.path
    pcnt = doc.pageCount
    k = 0
    (0 .. (pcnt - 1)).each do |i|
        page = doc.pageAtIndex(i)
        page.string.to_s =~ /HB-.._\S+/
        name = $&
        unless name
            log(logf, "# Skipped page #{i + 1} of #{path}: No matching string.")
            next # ignore this page
        end
        doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
        unless doc1.writeToFile("#{outdir}/#{name}.pdf")
            log(logf, "# Failed to save page #{i + 1} of #{path}")
        else
            k += 1
            log(logf, "Saved page #{i + 1} of #{path} as #{outdir}/#{name}.pdf")
        end
    end
    log(logf, "Extracted #{k} out of #{pcnt} page(s) of #{path}")
    FileUtils.mv(f, donedir)
    log(logf, "Moved #{path} -> #{donedir}/#{File.basename(path)}\n")
end
EOF

Reply

Answer 7

Feb 6, 2014 3:49 AM in response to Phillip Briggs

That is awesome - thanks to everyone!

Reply

Answer 8

ckrach

Level 1

0 points

Oct 24, 2014 2:00 PM in response to Hiroto

This is a great script, Hiroto. I have the exact same need as Phillip except I need the script to keep the pdf(s) intact when renaming them rather than split it up into individual pages (using the extracted text only from page 1). What would need to change to accomplish this? Thanks in advance.

Reply

Answer 9

Dec 30, 2014 6:17 AM in response to Tony T1

Can someone help with this? I'd like the text search sting to be a variable, so it asks me what I want to search for instead of being pre-set:

So the "VM_" string would be the variable in the line: page.string.to_s =~ /VM_\\S*

_main()

on _main()

script o

property aa : choose file with prompt ("Choose PDF Files.") of type {"com.adobe.pdf"} ¬

default location (path to desktop) with multiple selections allowed

set my aa's beginning to choose folder with prompt ("Choose Destination Folder.") ¬

default location (path to desktop)

set args to ""

repeat with a in my aa

set args to args & a's POSIX path's quoted form & space

end repeat

considering numeric strings

if (system info)'s system version < "10.9" then

set ruby to "/usr/bin/ruby"

else

set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"

end if

end considering

do shell script ruby & " <<'EOF' - " & args & "

require 'osx/cocoa'

include OSX

require_framework 'PDFKit'

outdir = ARGV.shift.chomp('/')

ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|

url = NSURL.fileURLWithPath(f)

doc = PDFDocument.alloc.initWithURL(url)

path = doc.documentURL.path

pcnt = doc.pageCount

(0 .. (pcnt - 1)).each do |i|

page = doc.pageAtIndex(i)

page.string.to_s =~ /VM_\\S*/

name = $&

unless name

puts \"no matching string in page #{i + 1} of #{path}\"

next # ignore this page

end

doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page

unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")

puts \"failed to save page #{i + 1} of #{path}\"

end

EOF"

end script

tell o to run

end _main

Reply

Answer 10

Tony T1

Level 6

10,247 points

Dec 30, 2014 6:56 AM in response to Phillip Briggs

Applescript?

Just use display dialog:

set searchText to text returned of (display dialog "Enter Search Text" default answer "")

Reply