Apple Event: May 7th at 7 am PT

Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

Pages and Terminal output different results of word count

If pass my .rtf file to wc -w command, it returns me the figure of 6909 words, whereas the words counter of Apple's Pages shows 2617 ones. Why?

Mac OS X (10.7.5), MacBook Pro 15.4 mid-2012

Posted on Apr 29, 2016 3:40 PM

Reply
Question marked as Best reply

Posted on Apr 29, 2016 7:45 PM

Because you are passing the raw RTF syntax in that document to wc, and Pages is showing only the actual words in your document. The RTF control words permeate throughout that .rtf document. Here is an example:


{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210

{\fonttbl\f0\fnil\fcharset0 Baskerville;}

{\colortbl;\red255\green255\blue255;\red26\green26\blue26;\red164\green8\blue0;}

\margl1440\margr1440\vieww8820\viewh17160\viewkind0

\deftab720

\pard\pardeftab720\sa200\pardirnatural

{\header \pard\ql\b\f0\fs28 [ zig.rtf ] \par}





\f0\fs38 \cf2 Lorem ipsum dolor sit amet, ligula suspendisse nulla pretium, rhoncus tempor placerat fermentum, enim integer ad vestibulum volutpat. Nisl rhoncus turpis est, vel elit, congue wisi enim nunc ultricies sit, magna tincidunt. Maecenas aliquam maecenas \cf0 ligula\cf2 \cf0 nostra\cf2 , accumsan taciti. Sociis mauris in integer, a dolor netus non dui aliquet, sagittis felis sodales, dolor sociis mauris, vel eu libero cras. Interdum at. Eget habitasse elementum est, ipsum purus pede porttitor class, ut adipiscing, aliquet sed auctor, imperdiet arcu per diam dapibus libero duis. Enim eros in vel, volutpat nec pellentesque leo, {\field{\*\fldinst{HYPERLINK "http://www.hp.com"}}{\fldrslt temporibus}} scelerisque \cf3 nec\cf2 .\


and this is the text that Pages '09 shows you:

User uploaded file

32 replies
Question marked as Best reply

Apr 29, 2016 7:45 PM in response to scrutinizer82

Because you are passing the raw RTF syntax in that document to wc, and Pages is showing only the actual words in your document. The RTF control words permeate throughout that .rtf document. Here is an example:


{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210

{\fonttbl\f0\fnil\fcharset0 Baskerville;}

{\colortbl;\red255\green255\blue255;\red26\green26\blue26;\red164\green8\blue0;}

\margl1440\margr1440\vieww8820\viewh17160\viewkind0

\deftab720

\pard\pardeftab720\sa200\pardirnatural

{\header \pard\ql\b\f0\fs28 [ zig.rtf ] \par}





\f0\fs38 \cf2 Lorem ipsum dolor sit amet, ligula suspendisse nulla pretium, rhoncus tempor placerat fermentum, enim integer ad vestibulum volutpat. Nisl rhoncus turpis est, vel elit, congue wisi enim nunc ultricies sit, magna tincidunt. Maecenas aliquam maecenas \cf0 ligula\cf2 \cf0 nostra\cf2 , accumsan taciti. Sociis mauris in integer, a dolor netus non dui aliquet, sagittis felis sodales, dolor sociis mauris, vel eu libero cras. Interdum at. Eget habitasse elementum est, ipsum purus pede porttitor class, ut adipiscing, aliquet sed auctor, imperdiet arcu per diam dapibus libero duis. Enim eros in vel, volutpat nec pellentesque leo, {\field{\*\fldinst{HYPERLINK "http://www.hp.com"}}{\fldrslt temporibus}} scelerisque \cf3 nec\cf2 .\


and this is the text that Pages '09 shows you:

User uploaded file

Apr 30, 2016 2:33 AM in response to VikingOSX

Any ideas how do I make it generate more relevant count results? Some UNIX command? I want to make it a part of my Automator workflow that gets input in the form of .PDF file, extracts the text, at the same time creating .rtf or .txt file, and passes output (the text file) to the shell script action. All I need is to arrange that script correctly so I'd get relevant figures.


Thanks

Apr 30, 2016 6:20 AM in response to scrutinizer82

Tony's solution will work fine, though since you are interested in word count, I would suggest piping the textutil output to wc -w. This however gets you a word count value that is left padded with 7 spaces. To get your word count with leading and trailing spaces trimmed use the following:


$ textutil -stdout -convert txt five_words.rtf | wc -w | awk '$1=$1'

Result: 5


Here is an automator workflow that prompts for a PDF file, extracts text from it as RTF, passes that RTF through the above command sequence, and then pops a dialog displaying the word count. In my example, this was a one-page PDF.

User uploaded file

User uploaded file

Apr 30, 2016 8:10 AM in response to scrutinizer82

You could do the math in the 'awk' segment

textutil -stdout -convert txt five_words.rtf | wc -w | awk '{print $1 * 10}'

If you need to pass in additional values for the math, you can do something like

textutil -stdout -convert txt five_words.rtf | wc -w | \

awk -v v1=${var1} -v v2=${var2} '{print ($1 * v1) + v2}'

You can have as man -v options as you need.


If you need to capture the output you can use

answer=$(textutil -stdout -convert txt five_words.rtf | wc -w | \

awk -v v1=${var1} -v v2=${var2} '{print ($1 * v1) + v2}')

echo $answer

May 4, 2016 2:29 PM in response to VikingOSX

I can't seem to make it work because when I run it from within Automator I get Error message "The action “Run Shell Script” encountered an error." The workflow failed: "The file [which is the .rtf file converted form txt that extracted text from PDF] doesn't exist". The file exists but the informational message telling that "the file doesn't exist" returns only the tail of the filename as the whole of the filename. I guess it's because UNIX script can't tell spaces in file name.


Update. The action "Run Shell Script" still fails. I changed its name to the one without spaces with underscore. I got a message "execution error ":no user interaction allowed".


Update 1. In your example " "Run Shell Script" receives input as arguments. It failed until I changed that to "stdin". However it doesn't pop a Finder window displaying the count result.


What would be my next steps?

May 4, 2016 2:54 PM in response to scrutinizer82

Screen capture (shift+command+4 and drag around the Extract PDF Text and Run Shell Script actions) your workflow, and use the camera icon in this editor to post the capture image here. You have a syntax error somewhere. You are correct: spaces in UNIX filesystem names are taboo, and must be escaped, or quoted.


Show me your code.

Pages and Terminal output different results of word count

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.