scrutinizer82

Q: Pages and Terminal output different results of word count

If pass my .rtf file to wc -w command, it returns me the figure of 6909 words, whereas the words counter of Apple's Pages shows 2617 ones. Why?

Mac OS X (10.7.5), MacBook Pro 15.4 mid-2012

Posted on Apr 29, 2016 3:42 PM

Close

Q: Pages and Terminal output different results of word count

  • All replies
  • Helpful answers

first Previous Page 3 of 3
  • by scrutinizer82,

    scrutinizer82 scrutinizer82 May 7, 2016 4:53 PM in response to VikingOSX
    Level 1 (43 points)
    Mac OS X
    May 7, 2016 4:53 PM in response to VikingOSX

    Shell script seems to be too hard for me as of now. I managed to solve the problem by adding "Run AppleScript" action and executing shell script from there.

     

    I passed a very simple script containing a variable:

     

    set WordCount to do shell script "wc -w /Users/myusername/Desktop/*.txt"

    tell application "Finder"

    display dialog WordCount

    end tell

     

    That's it, no torture anymore trying to painstakingly figure out all the subtleties of a very capricious shell (though I'm learning UNIX and still would like to be able to master shell-scripting one day to the level when I could utilize the full power of OS X which I need for my daily work).

     

    P.S. BTW, why "Run Apple Script" has that strange layout "on run, parameters blah-blah-blah (*your  text goes here*) end run"? It's utter non-sense: when I inserted my script the first time it went nowhere, so I just deleted all that junk and typed it in like I would do that in AppleScript Editor - and succeeded!

  • by Hiroto,

    Hiroto Hiroto May 8, 2016 5:06 AM in response to scrutinizer82
    Level 5 (7,281 points)
    May 8, 2016 5:06 AM in response to scrutinizer82

    Hello

     

    You might also try something like the following python script using PDFKit via PyObjC bridge. Note that definition of word is different from that of wc(1) which just counts chunks of text split by white spaces. If you prefer the wc's behaviour, it is easy to amend the code accordingly. Currently the DEBUG output is enabled by DEBUG = 1 in code. To disable it, change it to DEBUG = 0.

     

     

    #!/usr/bin/python
    # coding: utf-8
    # 
    #   file:
    #       pdf_wc.py
    #   
    #   function:
    #       print word count of pdf file(s) to stdout and show dialogue per file
    #       
    #   usage:
    #       ./pdf_wc.py file [file ...]
    #   
    #   version:
    #       0.20
    #           - using PDFSelection -selectionsByLine in order to treat 
    #               hyphenation character (to be removed) at end of line
    # 
    #   written by Hiroto, 2016-05
    # 
    DEBUG = 1
    
    import sys, os, re, subprocess
    from Quartz.PDFKit import PDFDocument
    from Foundation import NSURL
    
    r = re.compile(r'\w+', re.U)
    
    for f in [ a.decode('utf-8') for a in sys.argv[1:] if re.search(r'\.pdf$', a, re.I) ]:
        doc = PDFDocument.alloc().initWithURL_(NSURL.fileURLWithPath_(f))
        if not doc:
            sys.stderr.write('%s: not a pdf file\n' % f.encode('utf-8'))
            continue
        fn = os.path.basename(f)
        tt = []
        for line in doc.selectionForEntireDocument().selectionsByLine():
            t = line.attributedString().string()
            t = t.rstrip(' ') + ' '                         # normalise line so as to end with space
            t = re.sub(r'- $', '', t)                       # remove hyphenation character followed by space at end of line
            tt.append(t)
        s = ''.join(tt)
        if DEBUG:
            print '[DEBUG]'
            print fn.encode('utf-8')                        # file name
            for t in tt: print '[%s]' % t.encode('utf-8')   # lines in source text (preprocessed)
            print s.encode('utf-8')                         # source text (extracted from pdf)
            for w in r.findall(s): print w.encode('utf-8')  # list of words thereof
            print '[/DEBUG]'
    
        wc = len(r.findall(s))
        sys.stdout.write('%6d\t%s\n' % (wc, fn.encode('utf-8')))
        sys.stdout.flush()
        ascr='''
    on run argv
        tell application "System Events"
            activate
            display dialog "Word Count: " & argv's item 1 & return & "File Name: " & argv's item 2
        end tell
        return
    end run
    '''
        p = subprocess.Popen(['osascript', '-e', ascr, str(wc), fn])
        p.communicate()
    

     

     

     

    To use it in an Automator workflow, use "Run Shell Script" action as follows with a preceding action returning list of (pdf) files. (The python code above will ignore files other than *.pdf.)

     

    Run Shell Script action:

    - shell = /usr/bin/python

    - pass input = as arguments

    - code = as listed above

     

     

    E.g.,

     

    a.png

     

     

     

    Code is briefly tested with pyobjc 2.2b3 and python 2.6.1 under OS X 10.6.8.

     

    Good luck,

    H

  • by VikingOSX,

    VikingOSX VikingOSX May 8, 2016 5:48 AM in response to scrutinizer82
    Level 7 (20,606 points)
    Mac OS X
    May 8, 2016 5:48 AM in response to scrutinizer82

    The default code that you encounter in Run Shell Script, or Run AppleScript actions is just courtesy boiler plate, and can be replaced with your own code — depending on your overall workflow goal. The on run block is actually foundational for Automator Services that receive text selection input from an application.

first Previous Page 3 of 3