Search PDF contents for ISBN number and save to text file

Question

Level 1

45 points

Search PDF contents for ISBN number and save to text file

Hi,

Hoping someone can help me create a script that can find the ISBN inside of a PDF and write that filename and number to a text file.

I've found a way to do this using java using and document processing in Acrobat but I need to do it by pointing at a folder of files.

Format desired would be:

original filename.pdf, isbn#

Any help is appreciated,

John

2 x 2.8 Ghz Quad-Core Intel Xeon Mac Pro, Mac OS X (10.6.7), VM Ware with XP Pro

Posted on Apr 25, 2011 5:15 AM

Reply

Answer 1

Apr 25, 2011 6:53 AM in response to John_C

How is the isbn number tagged in the file. That is is there something like ISBN: 123455677890 or something similar? Or are you just looikng for a 10 digit number in the file?

Automator has an action that will convert a pdf file to a text file. You could do that then grep for the ISBN number in the text file. There is also a 3rd party command line tool pdftotext that you could download and use. Then you could pipe the output of the tool into grep. Eliminates the intermediate text file.

Reply

Answer 2

John_C Author

Level 1

45 points

Apr 25, 2011 7:06 AM in response to Frank Caggiano

It is part of the text of the file. Generally on one line as ISBN 1234567890123 and may have dashes between numbers or periods between letters.

Do you have any examples of how to incorporate the two options you mention?

Not sure if it would help any I do have a javascript that is used in Acrobat(batch/document processing) itself that typically finds the ISBN. Although I do not want to point to Acrobat to do this. It needs to be able to run without the presence of Acrobat. Below is this javascript for what it is worth.

/* Extract ISBN numbers From the Document */

// This script will scan all pages of the input document

// and extract valid ISBN numbers into new PDF document.

// Output PDF document will be placed in the same folder

// as input. The name of the output document will be:

// Original filename + "_Extracted_ISBN"

// Visit www.evermap.com for more useful JavaScript samples.

// This is a combination of strict and relaxed versions of ISBN number format

var reISBN=/(ISBN[\:\=\s][\s]*(?=[-0-9xX ]{13})(?:[0-9]+[- ]){3}[0-9]*[xX0-9])|(ISBN[\:\=\s][ ]*\d{9,10}[\d|x])/g;

var strExt = "_Extracted_ISBN.pdf";

var strIntro = "ISBN numbers extracted from document: ";

var strFinal = "Total number of ISBN numbers extracted: " ;

ExtractFromDocument(reISBN,strExt,strIntro,strFinal);

function ExtractFromDocument(reMatch, strFileExt, strMessage1, strMessage2)

{

var chWord, numWords;

// construct filename for output document

var filename = this.path.replace(/\.pdf$/, strFileExt);

// create a report document

try {

var ReportDoc = new Report();

var Out = new Object(); // array where we will collect all our emails before outputing them

ReportDoc.writeText(strMessage1 + this.path);

ReportDoc.divide(1); // draw a horizontal divider

ReportDoc.writeText(" "); // write a blank line to output

for (var i = 0; i < this.numPages; i++)

{

numWords = this.getPageNumWords(i);

var PageText = "";

for (var j = 0; j < numWords; j++) {

var word = this.getPageNthWord(i,j,false);

PageText += word;

}

var strMatches = PageText.match(reMatch);

if (strMatches == null) continue;

// now output matches into report document

for (j = 0; j < strMatches.length; j++) {

Out[strMatches[j]] = true; // store email as a property name

}

var nTotal = 0;

for (var prop in Out)

{

ReportDoc.writeText(prop);

nTotal++;

}

ReportDoc.writeText(" "); // output extra blank line

ReportDoc.divide(1); // draw a horizontal divider

ReportDoc.writeText(strMessage2 + nTotal);

// save report to a document

ReportDoc.save(

{

cDIPath: filename

});

}

catch(e)

{

app.alert("Processing error: "+e)

}

} // end of the function

Reply

Answer 3

Apr 25, 2011 10:38 AM in response to John_C

Nice regexp!

Pdf files a pain when it comes to doing something like this with them.

Here is a VERY primitive Automater workflow showing what may be possible http://dl.dropbox.com/u/13002668/ISBNgrep.workflow.zip

Look at it and give it a try to see if it gets the ISBN number out of a file.

If you post one of your pdf files so that I would have something to play around with I'll take a look and see what I can do.

regards

Reply

Answer 4

John_C Author

Level 1

45 points

Apr 25, 2011 11:14 AM in response to Frank Caggiano

Take a look at these:

http://dl.dropbox.com/u/27242073/sample1.pdf

http://dl.dropbox.com/u/27242073/sample2.pdf

Although other styles exist this is a good start.

Let me know what you think.

Thanks

Reply

Answer 5

Apr 26, 2011 6:59 AM in response to John_C

John,

I looked at the two files and unfortunately my idea of using Automator to create a file from the pdf and then grepping for ISBN worked on one of the files but not the other.

I'm not sure what would be the best way for you to go now. You're biggest hurdle is to convert the pdf into something that regular text tools can make sense of and its not as easy (at least in my experience) as it at first seems that it should be. Searching for pdf to text tools gets a number of hits. I don;t have any direct experience with any of them so I can't say which might be best. Imagemagick which I have used works well but requires Ghostscript to be installed to do the pdf conversions.

How big a project this is and if its a one time thing or recurring will of course affect how much time and effort you want to put into this. It's an interesting problem so I' going to continue picking at it. If I get anything I'll post back. If you come across anything let me know.

regards

Reply

Answer 6

John_C Author

Level 1

45 points

Apr 26, 2011 9:56 AM in response to Frank Caggiano

Frank,

Thanks for taking to time to look at this. The automator action you uploaded will work on both the files I have here. The sample files you looked at were "stripped" down to make them smaller and I must have done something to them that changed it to where the isbn is not seen. Your automator app will see the isbn on the original but only if I extract the page with the isbn on it to a single page. If I leave it as a multiple page file it will crash during the processing.

I really like how easy this, and the javascript I uploaded, appear to work. The java does not crash as the file is processed. If it would only write the found value to a simple text file then it would be wonderful.

Can a javascript be incorporated into an applescript?

Thanks for your help,

John

Reply

Answer 7

Apr 26, 2011 11:02 AM in response to John_C

Ah ok thats a little better. You say if it is a multi page file it crashes. What is crashing?

If you could put up one of the original files, one that is crashing on you I'll take a look at it.

So Acrobat can run javascripts, didn't know that. As for incorporating the javascript into an applescript that's not really the problem. The problem is getting the pdf into a text style to grab the ISBN number out. If you had that using whats in Applescript (or the command line) will be more then sufficient to do the job.

Reply

Answer 8

John_C Author

Level 1

45 points

Apr 29, 2011 8:45 AM in response to Frank Caggiano

OK, Fianlly getting back.

It is the automator processing that crashes. On a longer file it will just close shop during the process.

I've uploaded a file that will crash the isbngrep workflow to:

http://dl.dropbox.com/u/27242073/Binder1.pdf

As to Acrobat running java:

In Advanced/Document Processing/Batch Processing/New Sequence/Select Command

You go to the Execute JavaScript, add this then click edit. In this window you can add your javascript. Then run this internal batch processing to get the script to go into gear.

I'm hoping to add this function to an existing applescript. This other AS already gathers other basic info about a PDF and saves to a text file. If it would just find the ISBN an include that on the report it would be such a big help.

Thanks,

John

Reply

Answer 9

hubionmac

Level 2

435 points

Apr 30, 2011 3:21 AM in response to John_C

you could give the command line tool ps2ascii a try. It also works with pdf and you could pipe it's output (pure text) to a nice perl/ruby/grep/sed/awk 1-liner to extract the isbn numbers (still have to read a book about Regular Expression ;-) )

I did a quick & dirty version, but I think you will have to install ps2ascii first. I don't remember but I think it was installed with ghostscript (using macports)

--30.04.2011 hubionmac.com

--quick & dirty script to extract first ISBN-Number of a PDF file... just uses grep, maybe some nice reqexp would do a better job!

set myselection to choose fileof type {"pdf"} with multiple selections allowed

set myoutput to ""

repeat with pdf_file in myselection

tell application "Finder" to set pdfname to name of (pdf_file as alias)

set pdf_file_posix to quoted form of POSIX path of (pdf_file as alias)

do shell script ""

try

--first add the path, otherwhise s2ascii will fail since it cannot find ghostscript (gs) which is also installed in /usr/local/bin (think by macports)

set ISBN_String to do shell script "PATH=\"$PATH:/usr/local/bin\"; /usr/local/bin/ps2ascii " & pdf_file_posix & " | grep -m 1 ISBN"

set foundLine to true

on error

display dialog "maybe \"" & pdfname & "\" does not contain a ISBN at all"

set foundLine to false

end try

if foundLine is true then

repeat with s in every word of ISBN_String

try

get s as integer

set s to s as text

set foundisbn to true

exit repeat

on error

set s to ""

end try

end repeat

end if

set myoutput to myoutput & pdfname & tab & s & return

end repeat

tell application "TextEdit"

activate

set a to makenewdocument

set text of a to myoutput as text

end tell

Reply

Answer 10

John_C Author

Level 1

45 points

May 3, 2011 10:38 AM in response to hubionmac

hubionmac,

The ghostscript sounds almost doable. Although the bulk of what you describe makes sense it is over my head. Thanks for the suggestions.

John

Reply