Automator to split PDF and rename it by text

Question

Level 1

4 points

Automator to split PDF and rename it by text

I have a document that I render each day which is approximately 5-10 pages long. This document is in PDF format. I need to be able to split it into each page as its own document and rename it to the text in the first line (ie. a person's name). Is there a way to do this? I can use automator to split the files, but I cannot figure out how to successfully rename them. Thank you!!

Brandon

MacBook Pro (Retina, Mid 2012), OS X Mavericks (10.9.5)

Posted on Jan 11, 2015 7:29 PM

Reply

Answer 1

Hiroto

Level 5

7,467 points

Jan 12, 2015 10:56 AM in response to drbarney0330

Hello

You may try the following ruby (rubycocoa) script. It will work under OS X 10.5 through 10.9.

* Under OS X 10.10, you need to manually install RubyCocoa 1.2.0 which supports Ruby 2.0 or later.

http://rubycocoa.sourceforge.net/ http://sourceforge.net/projects/rubycocoa/files/RubyCocoa/1.2.0/

and change the ruby interpreter in script to

/System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/bin/ruby

#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w # # ARGV = pdf [pdf ...] [output_directory] # # * If out_directory is specified and present, pages of every specified pdf are saved in out_directory. # Otherwise, pages of each pdf are saved in directory named after pdf followed by "'s pages" in the same directory as pdf. # * Each page is named after the text in first line of the page without leading and trailing white spaces. # # Usage e.g., # ./split_pdf.rb *.pdf out # require 'osx/cocoa' OSX.require_framework 'PDFKit' include OSX def usage $stderr.puts "Usage: #{File.basename($0)} pdf [pdf ...] [output_directory]" exit 1 end usage unless ARGV.length > 0 outdir = File.directory?(ARGV.last) ? ARGV.pop : nil usage unless ARGV.length > 0 ARGV.each do |f| url = NSURL.fileURLWithPath(f) doc = PDFDocument.alloc.initWithURL(url) unless doc $stderr.puts "Not a pdf file: %s" % f next end odir = outdir ? outdir : (f + "'s pages") Dir.mkdir(odir) unless File.directory?(odir) (0 .. (doc.pageCount - 1)).each do |i| page = doc.pageAtIndex(i) page.string.to_s =~ /^[![:space:]]*(.*?)[![:space:]]*(\015\012|\015|\012)/o # [1] name = $1 unless name $stderr.puts "no matching string in page %d of %s" % [i + 1, f] next end doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) unless doc1.writeToFile(outfile = "#{odir}/#{name}.pdf") $stderr.puts "Failed to write page %d of %s to %s" % [i + 1, f, outfile] end end end # # [1] ! is present before tab, likely bug of PDFPage -string method #

In case, here's an AppleScript wrapper.

--APPLESCRIPT _main() on _main() set ff to (choose file of type {"com.adobe.pdf"} with prompt "Choose source pdf file(s)." with multiple selections allowed) set d to (choose folder with prompt "Choose destination folder.") set args to "" repeat with a in ff & d set args to args & space & a's POSIX path's quoted form end repeat do shell script "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -w <<'EOF' - " & args & " # # ARGV = pdf [pdf ...] [output_directory] # # * If out_directory is specified and present, pages of every specified pdf are saved in out_directory. # Otherwise, pages of each pdf are saved in directory named after pdf followed by \"'s pages\" in the same directory as pdf. # * Each page is named after the text in first line of the page without leading and trailing white spaces. # require 'osx/cocoa' OSX.require_framework 'PDFKit' include OSX def usage $stderr.puts \"Usage: #{File.basename($0)} pdf [pdf ...] [output_directory]\" exit 1 end usage unless ARGV.length > 0 outdir = File.directory?(ARGV.last) ? ARGV.pop : nil usage unless ARGV.length > 0 ARGV.each do |f| url = NSURL.fileURLWithPath(f) doc = PDFDocument.alloc.initWithURL(url) unless doc $stderr.puts \"Not a pdf file: %s\" % f next end odir = outdir ? outdir : (f + \"'s pages\") Dir.mkdir(odir) unless File.directory?(odir) (0 .. (doc.pageCount - 1)).each do |i| page = doc.pageAtIndex(i) page.string.to_s =~ /^[![:space:]]*(.*?)[![:space:]]*(\\015\\012|\\015|\\012)/o # [1] name = $1 unless name $stderr.puts \"no matching string in page %d of %s\" % [i + 1, f] next end doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) unless doc1.writeToFile(outfile = \"#{odir}/#{name}.pdf\") $stderr.puts \"Failed to write page %d of %s to %s\" % [i + 1, f, outfile] end end end # # [1] ! is present before tab, likely bug of PDFPage -string method # EOF" end _main --END OF APPLESCRIPT

Good luck,

H

Reply

Answer 2

etnad

Level 1

5 points

Mar 19, 2016 10:29 AM in response to Hiroto

Hello 浩人

I found by chance this suggestion you gave to drbarney0330

I have exactly the same problem however after downloading Ruby Ver 1.2.0

ElCapitan refuses to install it

Can I just remove the newer version and save it in my desktop?

will this allow me to downgrade to the previous version?

Or do you have some suggestions

Regards

グラツィエディクオーレ浩人さん

奈良市 from Italy

Reply

Answer 3

VikingOSX

Level 10

123,458 points

Mar 19, 2016 11:42 AM in response to etnad

Apple has Scripting Bridge techology that allows AppleScript, Python, and previously, Ruby to access the Objective-C/Cocoa libraries. The purpose of the RubyCocoa project was to re-enable the “glue” that would allow the standard Ruby that ships with OS X to again, access the preceding libraries. Hiroto's Ruby script requires a functional Ruby/Cocoa bridge, and the Ruby/Cocoa project provided that capability for Mavericks and Yosemite only.

There is however, no Ruby/Cocoa installer for El Capitan, and as you have observed, the installer checks the release of OS X, and with El Capitan — won't install. Without the Ruby/Cocoa installation, you either compile/install from Ruby/Cocoa sources, or translate Hiroto's Ruby code into Python — which natively accesses the same Libraries via the Scripting Bridge.

Reply

Answer 4

etnad

Level 1

5 points

Mar 19, 2016 11:56 AM in response to VikingOSX

Thanks for your explanations, interest and time

I am totally incapable to convert Ruby to Python but luckily I still keep a drive in my bay With Mavericks and another with SnowLeopard

Hiroto script works very well (as all of his work and suggestions) in Mavericks. Therefore I was able to use it successfully

I simply deleted from the Ruby framework the ver.2 and keept the 1.8 which was also available in the same folder.

Kind Regards

Dan

Reply

Answer 5

Hiroto

Level 5

7,467 points

Mar 21, 2016 11:11 PM in response to etnad

Hello

There's no pre-compiled build of rubycocoa 1.2.0 for OS X 10.11 as of 2016-03. You'd need to build and install rubycocoa 1.2.0 from source code by yourself.

Meanwhile, you may try the following pyobjc version of the original rubycocoa script. It should work without additional installation although I have only tested it with pybojc 2.2b3 and python 2.6.1 under OS X 10.6.8.

#!/usr/bin/python # coding: utf-8 # # file: # split_pdf.py # # usage: # split_pdf.py pdf [pdf ...] [output_directory] # argv[1..] : source pdf file(s) # argv[-1] : output directory # # * If output directory is specified and present, pages of every pdf are saved in the directory. # Otherwise, pages of each pdf are saved in directory named after pdf followed by "'s pages" in the same directory as pdf. # * Each page is named after text in the first non-blank line of the page without leading and trailing white spaces. # import sys, os, re from Foundation import NSURL from Quartz.PDFKit import PDFDocument, PDFPage def usage(): sys.stderr.write('Usage: %s pdf [pdf ...] [output_directory]\n' % os.path.basename(sys.argv[0])) sys.exit(1) def main(): if len(sys.argv) < 2: usage() outdir = sys.argv.pop().rstrip('/') if os.path.isdir(sys.argv[-1]) else None if len(sys.argv) < 2: usage() for f in [ a.decode('utf-8') for a in sys.argv[1:] ]: url = NSURL.fileURLWithPath_(f) doc = PDFDocument.alloc().initWithURL_(url) if not doc: sys.stderr.write('%s: not a pdf file\n' % f.encode('utf-8')) continue odir = outdir if outdir else (f + "'s pages") if not os.path.isdir(odir): os.mkdir(odir) path = doc.documentURL().path() pcnt = doc.pageCount() for i in range(0, pcnt): page = doc.pageAtIndex_(i) m = re.search(r'^[!\s]*(\S.*?)[!\s]*(\r\n|\r|\n|\Z)', page.string(), re.M) # [1] if not m: sys.stderr.write('No matching string in page %d of %s\n' % (i + 1, path.encode('utf-8'))) continue # ignore this page n = m.group(1) n = re.sub(r':', ';', n) # replace : with ; (: in POSIX name is changed to / in HFS+ name) n = re.sub(r'/', ':', n) # replace / with : (/ is reserved as node separator in POSIX path) doc1 = PDFDocument.alloc().initWithData_(page.dataRepresentation()) if not doc1.writeToFile_('%s/%s.pdf' % (odir, n)): sys.stderr.write('Failed to save page %d of %s' % (i + 1, path.encode('utf-8'))) main() # # Notes # [1] ! is present before tab in string returned by PDFPage -string method #

And in case, here's its AppleScript wrapper, which will return errors if any in result pane/window of (Apple)Script Editor.

--APPLESCRIPT _main() on _main() set ff to (choose file of type {"com.adobe.pdf"} with prompt "Choose source pdf file(s)." with multiple selections allowed) set d to (choose folder with prompt "Choose destination folder.") set args to "" repeat with a in ff & d set args to args & space & a's POSIX path's quoted form end repeat do shell script "/usr/bin/python <<'EOF' - " & args & " 2>&1 # coding: utf-8 # # file: # split_pdf.py # # usage: # split_pdf.py pdf [pdf ...] [output_directory] # argv[1..] : source pdf file(s) # argv[-1] : output directory # # * If output directory is specified and present, pages of every pdf are saved in the directory. # Otherwise, pages of each pdf are saved in directory named after pdf followed by \"'s pages\" in the same directory as pdf. # * Each page is named after text in the first non-blank line of the page without leading and trailing white spaces. # import sys, os, re from Foundation import NSURL from Quartz.PDFKit import PDFDocument, PDFPage def usage(): sys.stderr.write('Usage: %s pdf [pdf ...] [output_directory]\\n' % os.path.basename(sys.argv[0])) sys.exit(1) def main(): if len(sys.argv) < 2: usage() outdir = sys.argv.pop().rstrip('/') if os.path.isdir(sys.argv[-1]) else None if len(sys.argv) < 2: usage() for f in [ a.decode('utf-8') for a in sys.argv[1:] ]: url = NSURL.fileURLWithPath_(f) doc = PDFDocument.alloc().initWithURL_(url) if not doc: sys.stderr.write('%s: not a pdf file\\n' % f.encode('utf-8')) continue odir = outdir if outdir else (f + \"'s pages\") if not os.path.isdir(odir): os.mkdir(odir) path = doc.documentURL().path() pcnt = doc.pageCount() for i in range(0, pcnt): page = doc.pageAtIndex_(i) m = re.search(r'^[!\\s]*(\\S.*?)[!\\s]*(\\r\\n|\\r|\\n|\\Z)', page.string(), re.M) # [1] if not m: sys.stderr.write('No matching string in page %d of %s\\n' % (i + 1, path.encode('utf-8'))) continue # ignore this page n = m.group(1) n = re.sub(r':', ';', n) # replace : with ; (: in POSIX name is changed to / in HFS+ name) n = re.sub(r'/', ':', n) # replace / with : (/ is reserved as node separator in POSIX path) doc1 = PDFDocument.alloc().initWithData_(page.dataRepresentation()) if not doc1.writeToFile_('%s/%s.pdf' % (odir, n)): sys.stderr.write('Failed to save page %d of %s' % (i + 1, path.encode('utf-8'))) main() # # Notes # [1] ! is present before tab in string returned by PDFPage -string method # EOF" end _main --END OF APPLESCRIPT

Regards,

H

Reply