scrutinizer82

Q: Pages and Terminal output different results of word count

If pass my .rtf file to wc -w command, it returns me the figure of 6909 words, whereas the words counter of Apple's Pages shows 2617 ones. Why?

Mac OS X (10.7.5), MacBook Pro 15.4 mid-2012

Posted on Apr 29, 2016 3:42 PM

Close

Q: Pages and Terminal output different results of word count

  • All replies
  • Helpful answers

Page 1 of 3 last Next
  • by VikingOSX,Apple recommended

    VikingOSX VikingOSX Apr 29, 2016 7:45 PM in response to scrutinizer82
    Level 7 (20,591 points)
    Mac OS X
    Apr 29, 2016 7:45 PM in response to scrutinizer82

    Because you are passing the raw RTF syntax in that document to wc, and Pages is showing only the actual words in your document. The RTF control words permeate throughout that .rtf document. Here is an example:

     

    {\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210

    {\fonttbl\f0\fnil\fcharset0 Baskerville;}

    {\colortbl;\red255\green255\blue255;\red26\green26\blue26;\red164\green8\blue0;}

    \margl1440\margr1440\vieww8820\viewh17160\viewkind0

    \deftab720

    \pard\pardeftab720\sa200\pardirnatural

    {\header \pard\ql\b\f0\fs28 [ zig.rtf ] \par}

     

     

     

     

    \f0\fs38 \cf2 Lorem ipsum dolor sit amet, ligula suspendisse nulla pretium, rhoncus tempor placerat fermentum, enim integer ad vestibulum volutpat. Nisl rhoncus turpis est, vel elit, congue wisi enim nunc ultricies sit, magna tincidunt. Maecenas aliquam maecenas \cf0 ligula\cf2  \cf0 nostra\cf2 , accumsan taciti. Sociis mauris in integer, a dolor netus non dui aliquet, sagittis felis sodales, dolor sociis mauris, vel eu libero cras. Interdum at. Eget habitasse elementum est, ipsum purus pede porttitor class, ut adipiscing, aliquet sed auctor, imperdiet arcu per diam dapibus libero duis. Enim eros in vel, volutpat nec pellentesque leo, {\field{\*\fldinst{HYPERLINK "http://www.hp.com"}}{\fldrslt temporibus}} scelerisque \cf3 nec\cf2 .\

     

    and this is the text that Pages '09 shows you:

    Screen Shot 2016-04-29 at 10.43.56 PM.jpg

  • by scrutinizer82,

    scrutinizer82 scrutinizer82 Apr 30, 2016 2:33 AM in response to VikingOSX
    Level 1 (43 points)
    Mac OS X
    Apr 30, 2016 2:33 AM in response to VikingOSX

    Any ideas how do I make it generate more relevant count results? Some UNIX command? I want to make it a part of my Automator workflow that gets input in the form of .PDF file, extracts the text, at the same time creating .rtf or .txt file, and passes output (the text file) to the shell script action. All I need is to arrange that script correctly so I'd get relevant figures.

     

    Thanks

  • by Tony T1,

    Tony T1 Tony T1 Apr 30, 2016 5:24 AM in response to scrutinizer82
    Level 6 (9,232 points)
    Mac OS X
    Apr 30, 2016 5:24 AM in response to scrutinizer82

    You can use textutil to convert to txt, output to stdout and pipe to wc

     

         textutil -stdout -convert txt Untitled.rtf | wc

  • by VikingOSX,

    VikingOSX VikingOSX Apr 30, 2016 6:20 AM in response to scrutinizer82
    Level 7 (20,591 points)
    Mac OS X
    Apr 30, 2016 6:20 AM in response to scrutinizer82

    Tony's solution will work fine, though since you are interested in word count, I would suggest piping the textutil output to wc -w. This however gets you a word count value that is left padded with 7 spaces. To get your word count with leading and trailing spaces trimmed use the following:

     

    $ textutil -stdout -convert txt five_words.rtf | wc -w | awk '$1=$1'

    Result: 5

     

    Here is an automator workflow that prompts for a PDF file, extracts text from it as RTF, passes that RTF through the above command sequence, and then pops a dialog displaying the word count. In my example, this was a one-page PDF.

    Screen Shot 2016-04-30 at 9.16.24 AM.jpg

    Screen Shot 2016-04-30 at 9.16.51 AM.jpg

  • by scrutinizer82,

    scrutinizer82 scrutinizer82 Apr 30, 2016 6:33 AM in response to VikingOSX
    Level 1 (43 points)
    Mac OS X
    Apr 30, 2016 6:33 AM in response to VikingOSX

    Intense! Now I'd like to pass the output to a maths operation to take that value and pop a dialog displaying the result of math operation.

  • by VikingOSX,

    VikingOSX VikingOSX Apr 30, 2016 7:03 AM in response to scrutinizer82
    Level 7 (20,591 points)
    Mac OS X
    Apr 30, 2016 7:03 AM in response to scrutinizer82

    There is the bc calculator as a UNIX tool. See man bc(1). Using the $words variable from my previous Automator example. You know how to make that display in AppleScript now from the command-line.

    $ math_result=$(echo "$words * 10" | bc)

    math_result = 3610.

  • by BobHarris,

    BobHarris BobHarris Apr 30, 2016 8:10 AM in response to scrutinizer82
    Level 6 (19,272 points)
    Mac OS X
    Apr 30, 2016 8:10 AM in response to scrutinizer82

    You could do the math in the 'awk' segment

    textutil -stdout -convert txt five_words.rtf | wc -w | awk '{print $1 * 10}'

    If you need to pass in additional values for the math, you can do something like

    textutil -stdout -convert txt five_words.rtf | wc -w | \

      awk -v v1=${var1} -v v2=${var2} '{print ($1 * v1) + v2}'

    You can have as man -v options as you need.

     

    If you need to capture the output you can use

    answer=$(textutil -stdout -convert txt five_words.rtf | wc -w | \

      awk -v v1=${var1} -v v2=${var2} '{print ($1 * v1) + v2}')

    echo $answer

  • by VikingOSX,

    VikingOSX VikingOSX Apr 30, 2016 9:40 AM in response to BobHarris
    Level 7 (20,591 points)
    Mac OS X
    Apr 30, 2016 9:40 AM in response to BobHarris

    Hi Bob,

     

    Thanks for adding the awk calculations.

  • by scrutinizer82,

    scrutinizer82 scrutinizer82 May 4, 2016 2:29 PM in response to VikingOSX
    Level 1 (43 points)
    Mac OS X
    May 4, 2016 2:29 PM in response to VikingOSX

    I can't seem to make it work because when I run it from within Automator I get Error message "The action “Run Shell Script” encountered an error." The workflow failed: "The file [which is the .rtf file converted form txt that extracted text from PDF] doesn't exist". The file exists but the informational message telling that "the file doesn't exist" returns only the tail of the filename as the whole of the filename. I guess it's because UNIX script can't tell spaces in file name.

     

    Update. The action "Run Shell Script" still fails. I changed its name to the one without spaces with underscore. I got a message "execution error ":no user interaction allowed".

     

    Update 1. In your example " "Run Shell Script" receives input as arguments. It failed until I changed that to "stdin". However it doesn't pop a Finder window displaying the count result.

     

    What would be my next steps?

  • by scrutinizer82,

    scrutinizer82 scrutinizer82 May 4, 2016 2:52 PM in response to scrutinizer82
    Level 1 (43 points)
    Mac OS X
    May 4, 2016 2:52 PM in response to scrutinizer82

    Also my workflow is saved as Applications and is set to receive its input as files and folders and consists of only two actions: Extract text from PDF--> Run Shell Script. I just intend to execute this workflow by dragging PDFs on the App's icon.

  • by VikingOSX,

    VikingOSX VikingOSX May 4, 2016 2:54 PM in response to scrutinizer82
    Level 7 (20,591 points)
    Mac OS X
    May 4, 2016 2:54 PM in response to scrutinizer82

    Screen capture (shift+command+4 and drag around the Extract PDF Text and Run Shell Script actions) your workflow, and use the camera icon in this editor to post the capture image here. You have a syntax error somewhere. You are correct: spaces in UNIX filesystem names are taboo, and must be escaped, or quoted.

     

    Show me your code.

  • by scrutinizer82,

    scrutinizer82 scrutinizer82 May 4, 2016 3:02 PM in response to VikingOSX
    Level 1 (43 points)
    Mac OS X
    May 4, 2016 3:02 PM in response to VikingOSX

    Hi, VikingOSX,

    Thanks for your reply and here're 2 screenshots of my workflow: one is the main window, the second is the workflow's log, I use "Get Specified Finder items" for test purposes:

     

    Shell.png

    Shell 2.png

  • by VikingOSX,

    VikingOSX VikingOSX May 4, 2016 3:13 PM in response to scrutinizer82
    Level 7 (20,591 points)
    Mac OS X
    May 4, 2016 3:13 PM in response to scrutinizer82

    In your Run Shell Script action, change the solitary $f to "${f}" in the textutil command line, and it will handle filenames with spaces. Since I grew up on UNIX, I never, ever use filenames with spaces, or filenames that look like sentences.

  • by scrutinizer82,

    scrutinizer82 scrutinizer82 May 4, 2016 3:24 PM in response to VikingOSX
    Level 1 (43 points)
    Mac OS X
    May 4, 2016 3:24 PM in response to VikingOSX

    Something's really going with this because I still get the Error message.

Page 1 of 3 last Next