Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

Pages and Terminal output different results of word count

If pass my .rtf file to wc -w command, it returns me the figure of 6909 words, whereas the words counter of Apple's Pages shows 2617 ones. Why?

Mac OS X (10.7.5), MacBook Pro 15.4 mid-2012

Posted on Apr 29, 2016 3:40 PM

Reply
Question marked as Best reply

Posted on Apr 29, 2016 7:45 PM

Because you are passing the raw RTF syntax in that document to wc, and Pages is showing only the actual words in your document. The RTF control words permeate throughout that .rtf document. Here is an example:


{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210

{\fonttbl\f0\fnil\fcharset0 Baskerville;}

{\colortbl;\red255\green255\blue255;\red26\green26\blue26;\red164\green8\blue0;}

\margl1440\margr1440\vieww8820\viewh17160\viewkind0

\deftab720

\pard\pardeftab720\sa200\pardirnatural

{\header \pard\ql\b\f0\fs28 [ zig.rtf ] \par}





\f0\fs38 \cf2 Lorem ipsum dolor sit amet, ligula suspendisse nulla pretium, rhoncus tempor placerat fermentum, enim integer ad vestibulum volutpat. Nisl rhoncus turpis est, vel elit, congue wisi enim nunc ultricies sit, magna tincidunt. Maecenas aliquam maecenas \cf0 ligula\cf2 \cf0 nostra\cf2 , accumsan taciti. Sociis mauris in integer, a dolor netus non dui aliquet, sagittis felis sodales, dolor sociis mauris, vel eu libero cras. Interdum at. Eget habitasse elementum est, ipsum purus pede porttitor class, ut adipiscing, aliquet sed auctor, imperdiet arcu per diam dapibus libero duis. Enim eros in vel, volutpat nec pellentesque leo, {\field{\*\fldinst{HYPERLINK "http://www.hp.com"}}{\fldrslt temporibus}} scelerisque \cf3 nec\cf2 .\


and this is the text that Pages '09 shows you:

User uploaded file

32 replies

May 8, 2016 5:06 AM in response to scrutinizer82

Hello


You might also try something like the following python script using PDFKit via PyObjC bridge. Note that definition of word is different from that of wc(1) which just counts chunks of text split by white spaces. If you prefer the wc's behaviour, it is easy to amend the code accordingly. Currently the DEBUG output is enabled by DEBUG = 1 in code. To disable it, change it to DEBUG = 0.



#!/usr/bin/python # coding: utf-8 # # file: # pdf_wc.py # # function: # print word count of pdf file(s) to stdout and show dialogue per file # # usage: # ./pdf_wc.py file [file ...] # # version: # 0.20 # - using PDFSelection -selectionsByLine in order to treat # hyphenation character (to be removed) at end of line # # written by Hiroto, 2016-05 # DEBUG = 1 import sys, os, re, subprocess from Quartz.PDFKit import PDFDocument from Foundation import NSURL r = re.compile(r'\w+', re.U) for f in [ a.decode('utf-8') for a in sys.argv[1:] if re.search(r'\.pdf$', a, re.I) ]: doc = PDFDocument.alloc().initWithURL_(NSURL.fileURLWithPath_(f)) if not doc: sys.stderr.write('%s: not a pdf file\n' % f.encode('utf-8')) continue fn = os.path.basename(f) tt = [] for line in doc.selectionForEntireDocument().selectionsByLine(): t = line.attributedString().string() t = t.rstrip(' ') + ' ' # normalise line so as to end with space t = re.sub(r'- $', '', t) # remove hyphenation character followed by space at end of line tt.append(t) s = ''.join(tt) if DEBUG: print '[DEBUG]' print fn.encode('utf-8') # file name for t in tt: print '[%s]' % t.encode('utf-8') # lines in source text (preprocessed) print s.encode('utf-8') # source text (extracted from pdf) for w in r.findall(s): print w.encode('utf-8') # list of words thereof print '[/DEBUG]' wc = len(r.findall(s)) sys.stdout.write('%6d\t%s\n' % (wc, fn.encode('utf-8'))) sys.stdout.flush() ascr=''' on run argv tell application "System Events" activate display dialog "Word Count: " & argv's item 1 & return & "File Name: " & argv's item 2 end tell return end run ''' p = subprocess.Popen(['osascript', '-e', ascr, str(wc), fn]) p.communicate()




To use it in an Automator workflow, use "Run Shell Script" action as follows with a preceding action returning list of (pdf) files. (The python code above will ignore files other than *.pdf.)


Run Shell Script action:

- shell = /usr/bin/python

- pass input = as arguments

- code = as listed above



E.g.,


User uploaded file




Code is briefly tested with pyobjc 2.2b3 and python 2.6.1 under OS X 10.6.8.


Good luck,

H

Pages and Terminal output different results of word count

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.