Applescript or workflow to extract text from PDF and rename PDF with the results

Hi Everyone,


I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.


What I need to do is name each PDF with the code which is in the text on the PDF.


It would work like this in an ideal world:


1. Split PDF into single pages


2. Extract text from PDF


3. Rename PDF using the extracted text


I'm struggling with part 3!


I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)


I did think about using a variable for the name, but the rename functions doesn't let me use variables.

Posted on Feb 3, 2014 8:45 AM

Reply
25 replies

Feb 5, 2014 4:34 AM in response to Phillip Briggs

1. If I wanted to extend the regex to include extra characters where possible, i.e. to match HB-123456_E what would I use? I have tried "HB-\S*" which seems to work OK in Automator for matching everything upto the first space.


To match from HB-to _ and space: "HB-.*_[:space:]"

To match from HB- to the end of that line: "HB-.*$"

To match from HB-to _ and any char : "HB-.*_[A-Za-z]"

To match from HB- to _ and any char or digit : "HB-.*_[A-Za-z0-9]"


See: http://www.addedbytes.com/download/regular-expressions-cheat-sheet-v2/png

Feb 5, 2014 6:34 AM in response to Tony T1

Hi All,


I really want to use: HB-.._\S*


as this captures all eventualities - codes that include an extension like "HB-12_345678_E" and also codes that don't. It captures everything up to the first space which I think should cover everything.


It works in Automator in this shell script:


echo "$1" | grep -o "HB-.._\S*"



But in Hiroto's script if I replace: "page.string.to_s =~ /HB-.._[0-9]{6}/"


with: page.string.to_s =~ /HB-.._\S*/


I get a Syntax error:


Expected “"” but found unknown token.


Any ideas?

Feb 5, 2014 7:25 PM in response to Phillip Briggs

Hello


Here's a revised version crafted for an Automator Run Shell Script action, which can be used to create an Automator Folder Action.


To use this, create new Automator Folder Action with Run Shell Script Action set as follows:

Shell = /bin/bash

Pass input = as arguments

Contents = code as listed below


Notes on script.


* The output directory is hard-coded in script which is currently set to ~/Desktop/testout.


* When used as an Automator Folder Action attached to a watcher folder A, this script will process each pdf file added to A and then move it to done directory named Done in A. It will create A/Done directory if not present. (The directory A is obtained as the parent directory of the first item in the argument list, which is equal to the watched folder when used as Folder Action. But if not used as Folder Action, this logic would be inappropriate and done directory would be better hard-coded in script.)


* It only processes *.pdf files and leave others alone.


* It will create log file named YYYY-MM-DD_log.txt in done directory.


* In my experiences, Folder Action is non-deterministic and unreliable. It is often very slow to be triggered and can even fail to be fired. It is one of the reasons I implemented detailed logging in this script.


Hope this may help,

H



#!/bin/bash
# 
#     for Run Shell Script Action in Automator Folder Action
#         input  = pdf files
#         output = none
# 
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w <<'EOF' - "$@"
require 'FileUtils'
require 'osx/cocoa'
include OSX
require_framework 'PDFKit'

outdir = File.expand_path('~/Desktop/testout')          # destination directory where resulting pdfs are saved
donedir = File.dirname(ARGV[0]) + '/Done'               # done directory where done files are moved
logf = "#{donedir}/#{Time.now.strftime('%F_log.txt')}"  # log file named after YYYY-MM-DD_log.txt is created in done directory

def log(logf, s)
    File.open(logf, 'a') do |a|
        a.print "%-26s%s\n" % [Time.now.strftime('%F %T%z'), s]
    end
end

# 
#     create outdir, donedir if not present
#     
[outdir, donedir].each do |d|
    d = File.readlink(d) if File.symlink?(d)
    FileUtils.mkdir_p d unless File.exists?(d)
    raise RuntimeError, "#{d}: Not a directory." unless File.directory?(d)
end

# 
#     process each argument
# 
ARGV.select {|f| f =~ /\.pdf$/i }.each do |f|
    url = NSURL.fileURLWithPath(f)
    doc = PDFDocument.alloc.initWithURL(url)
    path = doc.documentURL.path
    pcnt = doc.pageCount
    k = 0
    (0 .. (pcnt - 1)).each do |i|
        page = doc.pageAtIndex(i)
        page.string.to_s =~ /HB-.._\S+/
        name = $&
        unless name
            log(logf, "# Skipped page #{i + 1} of #{path}: No matching string.")
            next # ignore this page
        end
        doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
        unless doc1.writeToFile("#{outdir}/#{name}.pdf")
            log(logf, "# Failed to save page #{i + 1} of #{path}")
        else
            k += 1
            log(logf, "Saved page #{i + 1} of #{path} as #{outdir}/#{name}.pdf")
        end
    end
    log(logf, "Extracted #{k} out of #{pcnt} page(s) of #{path}")
    FileUtils.mv(f, donedir)
    log(logf, "Moved #{path} -> #{donedir}/#{File.basename(path)}\n")
end
EOF

Dec 30, 2014 6:17 AM in response to Tony T1

Can someone help with this? I'd like the text search sting to be a variable, so it asks me what I want to search for instead of being pre-set:

So the "VM_" string would be the variable in the line: page.string.to_s =~ /VM_\\S*


_main()

on _main()

script o

property aa : choose file with prompt ("Choose PDF Files.") of type {"com.adobe.pdf"} ¬

default location (path to desktop) with multiple selections allowed


set my aa's beginning to choose folder with prompt ("Choose Destination Folder.") ¬

default location (path to desktop)


set args to ""

repeat with a in my aa

set args to args & a's POSIX path's quoted form & space

end repeat


considering numeric strings

if (system info)'s system version < "10.9" then

set ruby to "/usr/bin/ruby"

else

set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"

end if

end considering


do shell script ruby & " <<'EOF' - " & args & "

require 'osx/cocoa'

include OSX

require_framework 'PDFKit'



outdir = ARGV.shift.chomp('/')



ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|

url = NSURL.fileURLWithPath(f)

doc = PDFDocument.alloc.initWithURL(url)

path = doc.documentURL.path

pcnt = doc.pageCount

(0 .. (pcnt - 1)).each do |i|

page = doc.pageAtIndex(i)

page.string.to_s =~ /VM_\\S*/

name = $&

unless name

puts \"no matching string in page #{i + 1} of #{path}\"

next # ignore this page

end

doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page

unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")

puts \"failed to save page #{i + 1} of #{path}\"

end

end

end

EOF"

end script

tell o to run

end _main

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Applescript or workflow to extract text from PDF and rename PDF with the results

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.