Extract and Sort Numbers in a PDF?

Question

Level 1

0 points

Extract and Sort Numbers in a PDF?

Hey Support,

I got the task of going through a 70 page PDF and taking out all the facts that include numbers in them then sorting them under the category that number has. So for example, when reading I found "Research shows that the portion of the brain that assesses risk and danger does not fully develop until the mid 20’s." This has 20 in it so I would copy and paste this into my document and put it under the "20" category. Another example is "Applicants for home training must train with an instructor who is at least 20 years of age or older." This would also go under the 20 category. So I am wondering if there is a way to filter out all all these lines/facts that have numbers in them and then categorize them. I achieved this partially with an Automator workflow, but I would like to see if there are any other ways as this is kind of getting messy now.

Thanks!

MacBook Pro (13-inch Mid 2010), OS X Mavericks (10.9.2), 8GB RAM, 120GB SSD, 1TB HDD

Posted on Mar 20, 2014 4:13 PM

Reply

Answer 1

etresoft

Level 8

46,554 points

Mar 20, 2014 5:03 PM in response to KooKilla

No. The only reliable way to extract data from PDF is to manually type it in. PDF is designed for printing and nothing else.

Reply

Answer 2

KooKilla Author

Level 1

0 points

Mar 20, 2014 5:21 PM in response to etresoft

Well I converted it to RTF. Is there any way I can do it now? Maybe I wasn't clear enough

Reply

Answer 3

Tony T1

Level 6

10,232 points

Mar 20, 2014 5:37 PM in response to KooKilla

KooKilla,

Here's the Workflow:

The Run Shell Script Action is:

textutil -stdout -convert txt "$1" | grep -E  '[0-9]{1,}' > ~/Desktop/Report.txt
while read line
do
    name=$line
    s=$( echo "$name" | grep -Eo  '[0-9]{1,}' )
    a=($s)
    for i in "${a[@]}"
    do
         printf "%s %s\n" "$i": "$line" >> ~/Desktop/Report2.txt
    done
done < ~/Desktop/Report.txt
sort -n ~/Desktop/Report2.txt > ~/Desktop/ReportSorted.txt
rm "$1" ~/Desktop/Report.txt ~/Desktop/Report2.txt

The Workflow is:

So, if the PDF is:

"I am 20. My friend is 30. Her husband is 40."

"I am 30. My friend is 40. Her husband is 50."

ReportSorted.txt will be:

20: "I am 20. My friend is 30. Her husband is 40."

30: "I am 20. My friend is 30. Her husband is 40."

30: "I am 30. My friend is 40. Her husband is 50."

40: "I am 20. My friend is 30. Her husband is 40."

40: "I am 30. My friend is 40. Her husband is 50."

50: "I am 30. My friend is 40. Her husband is 50."

...and as etresoft said:

Manually verify with the PDF to make sure you haven't missed anything

Reply

Answer 4

etresoft

Level 8

46,554 points

Mar 20, 2014 5:33 PM in response to KooKilla

KooKilla wrote:

Well I converted it to RTF. Is there any way I can do it now? Maybe I wasn't clear enough

It would be best to use a plain text format for such things. However, you will have to manually verify with the PDF to make sure you haven't missed anything. If you see text like "20" in the PDF, it may not be "20" after extracting the text. It may look fine in most places, but it could get corrupted in others. If you try it on another PDF, you might not get anything.

Reply

Answer 5

Tony T1

Level 6

10,232 points

Mar 20, 2014 5:44 PM in response to etresoft

You're right about PDF's. When I extracted to plain text with Automator, it was full of hidden characters that grep choked on. I worked around this by having Automator extract to rtf, then I used textutil to convert from rtf to txt to clean-up the hidden characters.

Reply

Answer 6

Hiroto

Level 5

7,461 points

Mar 21, 2014 3:39 PM in response to KooKilla

Hello

You may also try something like the shell script below.

It will extract text per sentence not per line. However, its definition of sentence is so incomplete that it cannot extract complex sentence such as quoted sentence in sentence correctly. If it is real problem, we'd need to refine parsing logic.

Hope this may help,

H

#!/bin/bash

infile=~/desktop/a.pdf            # input pdf file
outfile=~/desktop/a.out.txt        # output text file

/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby -Ku <<'EOF' - "$infile" > "$outfile"
require 'osx/cocoa'
include OSX
require_framework 'PDFKit'

url = NSURL.fileURLWithPath(ARGV[0])
s = PDFDocument.alloc.initWithURL(url).string.to_s

re_sstop =                                                # re - sentence stop
    /
        (?: 
            [.?!]+            # 1* sentence punctuations
            [ ]*            #  * space
            (?:                # *1 closing quotations            
                ["'”’»]
            )?
        )
        |
        \z                    # end of string
    /xomu
re_delim = /(?: \s+ | \z )/xomu                            # re - token delimiter
re_blankline = /(?: \r\n | \r | \n ){2,}/xomu            # re - blank lines
re_sentence = /( .+? #{re_sstop} ) #{re_delim}/xomu        # re - sentence    

hash = {}
s.gsub!(re_blankline, ' ')    # remove blank lines that OSX::PDFDocument#string method inserts between pages
s.scan(re_sentence).each do |q|
    q.first.scan(/[0-9]+/).each do |n|
        n = n.to_i
        if hash.key?(n)
            hash[n] << q unless hash[n].include?(q)
        else
            hash[n] = [q]
        end
    end
end
hash.keys.sort.each do |k|
    print "# Category = #{k}\n\n"
    hash[k].each do |v|
        print "# Statement = \n#{v}\n\n"
    end
    print "-" * 70, "\n"
end
EOF

Reply

Answer 7

Tony T1

Level 6

10,232 points

Mar 21, 2014 3:50 PM in response to Hiroto

Hiroto,

Did something in my script not work?

Did what the OP needed when I tested it.

Reply

Answer 8

Tony T1

Level 6

10,232 points

Mar 21, 2014 4:55 PM in response to Hiroto

Hiroto,

If the PDF is:

"I am 20. My friend is 30. Her husband is 40."

"I am 30. My friend is 40. Her husband is 50."

OP Needs is looking for the following output (see: https://discussions.apple.com/thread/6012407#25232161 )

20: "I am 20. My friend is 30. Her husband is 40."

30: "I am 20. My friend is 30. Her husband is 40."

30: "I am 30. My friend is 40. Her husband is 50."

40: "I am 20. My friend is 30. Her husband is 40."

40: "I am 30. My friend is 40. Her husband is 50."

50: "I am 30. My friend is 40. Her husband is 50."

Not sure if your script is what's needed (...or maybe it is, need OP input on this)

Reply

Answer 9

Hiroto

Level 5

7,461 points

Mar 21, 2014 7:46 PM in response to Tony T1

Hello

An answer is prescribed for the question asked and precision of answer is constrained by precision of question.

Since the question is unclear about the origin of source text, I assumed that a) the source text is general article consisting of chapters of sections of paragraphs of sentences and b) the "fact" to be retrived is statement represented by sentence. And since the question does not specify the output format, I used my discretion in defining it. However, should the retrieved data be used in spreadsheet or database, csv or xml would have been better.

Regards,

H

PS. Given the source text in pdf:

I am 20. My friend is 30. Her husband is 40.
I am 30. My friend is 40. Her husband is 50.

my script will yield the following output:

# Category = 20

# Statement = 
I am 20.

----------------------------------------------------------------------
# Category = 30

# Statement = 
My friend is 30.

# Statement = 
I am 30.

----------------------------------------------------------------------
# Category = 40

# Statement = 
Her husband is 40.

# Statement = 
My friend is 40.

----------------------------------------------------------------------
# Category = 50

# Statement = 
Her husband is 50.

----------------------------------------------------------------------

Reply

Answer 10

Mar 22, 2014 6:29 AM in response to Hiroto

A Bash, Ruby & Cocoa trifle!

That is a lovely, and admirable, piece of work! 🙂

Reply

Answer 11

Tony T1

Level 6

10,232 points

Mar 22, 2014 7:29 AM in response to Hiroto

Hiroto wrote:

An answer is prescribed for the question asked and precision of answer is constrained by precision of question.

The question in this post did not link to the prior post that did give this information. (However, I did link to that post in my 1st reply in this thread which did clear up the points that you mention.)

And since the question does not specify the output format, I used my discretion in defining it. However, should the retrieved data be used in spreadsheet or database, csv or xml would have been better.

OP is looking to use "Excel or Numbers or something else to create a chart? Something that will take anything that has like the number 16 and put it into a 16 category with all other lines that include that number?"

As it appears that the OP needs all lines that include the selected number, then if the line has "20", then

20: "I am 20. My friend is 30. Her husband is 40."

appears preferable to

"

I am 20."

(but not 100% sure if thats what the OP meant)

But I only output a plain text file, so I was a bit lazy in that regard.

It should be an easy exersise for you to adapt your script to create the chart requested.

Reply

Answer 12

Hiroto

Level 5

7,461 points

Mar 22, 2014 8:34 AM in response to Tony T1

Hello

Unfortunately I had not read the linked thread which in fact appears to be the original thread.

So I'd guess the original text is a flat collection of statements, each on its own line, in educational psychology or something? Because if it's normal article in such field, extraction per paragraph would only result in too coarse data, I'm afraid.

This is another example to demonstrate how important the precise description of problem is in the first place. If a problem is well defined, it is almost solved. 😉

All the best,

H

Reply

Answer 13

Tony T1

Level 6

10,232 points

Mar 22, 2014 9:15 AM in response to Hiroto

Agreed. Also, the test data supplied by the OP was limited at best.

Reply