Pages and Terminal output different results of word count
If pass my .rtf file to wc -w command, it returns me the figure of 6909 words, whereas the words counter of Apple's Pages shows 2617 ones. Why?
Mac OS X (10.7.5), MacBook Pro 15.4 mid-2012
Apple Event: May 7th at 7 am PT
If pass my .rtf file to wc -w command, it returns me the figure of 6909 words, whereas the words counter of Apple's Pages shows 2617 ones. Why?
Mac OS X (10.7.5), MacBook Pro 15.4 mid-2012
Because you are passing the raw RTF syntax in that document to wc, and Pages is showing only the actual words in your document. The RTF control words permeate throughout that .rtf document. Here is an example:
{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
{\fonttbl\f0\fnil\fcharset0 Baskerville;}
{\colortbl;\red255\green255\blue255;\red26\green26\blue26;\red164\green8\blue0;}
\margl1440\margr1440\vieww8820\viewh17160\viewkind0
\deftab720
\pard\pardeftab720\sa200\pardirnatural
{\header \pard\ql\b\f0\fs28 [ zig.rtf ] \par}
\f0\fs38 \cf2 Lorem ipsum dolor sit amet, ligula suspendisse nulla pretium, rhoncus tempor placerat fermentum, enim integer ad vestibulum volutpat. Nisl rhoncus turpis est, vel elit, congue wisi enim nunc ultricies sit, magna tincidunt. Maecenas aliquam maecenas \cf0 ligula\cf2 \cf0 nostra\cf2 , accumsan taciti. Sociis mauris in integer, a dolor netus non dui aliquet, sagittis felis sodales, dolor sociis mauris, vel eu libero cras. Interdum at. Eget habitasse elementum est, ipsum purus pede porttitor class, ut adipiscing, aliquet sed auctor, imperdiet arcu per diam dapibus libero duis. Enim eros in vel, volutpat nec pellentesque leo, {\field{\*\fldinst{HYPERLINK "http://www.hp.com"}}{\fldrslt temporibus}} scelerisque \cf3 nec\cf2 .\
and this is the text that Pages '09 shows you:
Because you are passing the raw RTF syntax in that document to wc, and Pages is showing only the actual words in your document. The RTF control words permeate throughout that .rtf document. Here is an example:
{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
{\fonttbl\f0\fnil\fcharset0 Baskerville;}
{\colortbl;\red255\green255\blue255;\red26\green26\blue26;\red164\green8\blue0;}
\margl1440\margr1440\vieww8820\viewh17160\viewkind0
\deftab720
\pard\pardeftab720\sa200\pardirnatural
{\header \pard\ql\b\f0\fs28 [ zig.rtf ] \par}
\f0\fs38 \cf2 Lorem ipsum dolor sit amet, ligula suspendisse nulla pretium, rhoncus tempor placerat fermentum, enim integer ad vestibulum volutpat. Nisl rhoncus turpis est, vel elit, congue wisi enim nunc ultricies sit, magna tincidunt. Maecenas aliquam maecenas \cf0 ligula\cf2 \cf0 nostra\cf2 , accumsan taciti. Sociis mauris in integer, a dolor netus non dui aliquet, sagittis felis sodales, dolor sociis mauris, vel eu libero cras. Interdum at. Eget habitasse elementum est, ipsum purus pede porttitor class, ut adipiscing, aliquet sed auctor, imperdiet arcu per diam dapibus libero duis. Enim eros in vel, volutpat nec pellentesque leo, {\field{\*\fldinst{HYPERLINK "http://www.hp.com"}}{\fldrslt temporibus}} scelerisque \cf3 nec\cf2 .\
and this is the text that Pages '09 shows you:
Any ideas how do I make it generate more relevant count results? Some UNIX command? I want to make it a part of my Automator workflow that gets input in the form of .PDF file, extracts the text, at the same time creating .rtf or .txt file, and passes output (the text file) to the shell script action. All I need is to arrange that script correctly so I'd get relevant figures.
Thanks
You can use textutil to convert to txt, output to stdout and pipe to wc
textutil -stdout -convert txt Untitled.rtf | wc
Tony's solution will work fine, though since you are interested in word count, I would suggest piping the textutil output to wc -w. This however gets you a word count value that is left padded with 7 spaces. To get your word count with leading and trailing spaces trimmed use the following:
$ textutil -stdout -convert txt five_words.rtf | wc -w | awk '$1=$1'
Result: 5
Here is an automator workflow that prompts for a PDF file, extracts text from it as RTF, passes that RTF through the above command sequence, and then pops a dialog displaying the word count. In my example, this was a one-page PDF.
Intense! Now I'd like to pass the output to a maths operation to take that value and pop a dialog displaying the result of math operation.
There is the bc calculator as a UNIX tool. See man bc(1). Using the $words variable from my previous Automator example. You know how to make that display in AppleScript now from the command-line.
$ math_result=$(echo "$words * 10" | bc)
math_result = 3610.
You could do the math in the 'awk' segment
textutil -stdout -convert txt five_words.rtf | wc -w | awk '{print $1 * 10}'
If you need to pass in additional values for the math, you can do something like
textutil -stdout -convert txt five_words.rtf | wc -w | \
awk -v v1=${var1} -v v2=${var2} '{print ($1 * v1) + v2}'
You can have as man -v options as you need.
If you need to capture the output you can use
answer=$(textutil -stdout -convert txt five_words.rtf | wc -w | \
awk -v v1=${var1} -v v2=${var2} '{print ($1 * v1) + v2}')
echo $answer
Hi Bob,
Thanks for adding the awk calculations.
I can't seem to make it work because when I run it from within Automator I get Error message "The action “Run Shell Script” encountered an error." The workflow failed: "The file [which is the .rtf file converted form txt that extracted text from PDF] doesn't exist". The file exists but the informational message telling that "the file doesn't exist" returns only the tail of the filename as the whole of the filename. I guess it's because UNIX script can't tell spaces in file name.
Update. The action "Run Shell Script" still fails. I changed its name to the one without spaces with underscore. I got a message "execution error ":no user interaction allowed".
Update 1. In your example " "Run Shell Script" receives input as arguments. It failed until I changed that to "stdin". However it doesn't pop a Finder window displaying the count result.
What would be my next steps?
Also my workflow is saved as Applications and is set to receive its input as files and folders and consists of only two actions: Extract text from PDF--> Run Shell Script. I just intend to execute this workflow by dragging PDFs on the App's icon.
Screen capture (shift+command+4 and drag around the Extract PDF Text and Run Shell Script actions) your workflow, and use the camera icon in this editor to post the capture image here. You have a syntax error somewhere. You are correct: spaces in UNIX filesystem names are taboo, and must be escaped, or quoted.
Show me your code.
In your Run Shell Script action, change the solitary $f to "${f}" in the textutil command line, and it will handle filenames with spaces. Since I grew up on UNIX, I never, ever use filenames with spaces, or filenames that look like sentences.
Something's really going with this because I still get the Error message.😕
It doesn't accept input as "arguments", only as "stdin". But the Apple Script here that should trigger dialog pop up simply doesn't work. I guess System Events.app is rather vague to respond (but that's just my guess).
Pages and Terminal output different results of word count