Skip navigation

Count words in a document

1357 Views 16 Replies Latest reply: Nov 16, 2012 5:34 AM by VikingOSX RSS
1 2 Previous Next
RL6001 Calculating status...
Currently Being Moderated
Nov 11, 2012 3:13 AM

Hello all,

 

I want to count words in a document. First I have made a program that count letters in a document:

 

 

#import <foundation/foundation.h>

 

int main(int argc, const char * argv[])

{

 

int count={0},n={0},SearchChar;

    FILE *document;

    document=fopen("/Users/mymac/Desktop/test.txt","r");

    SearchChar='H';

        while(n!=EOF)

        {

            if ((n=fgetc(document))==SearchChar) {

                count+=1;

            }

 

        }

        printf("%d\n",count);

 

    fclose(document);

    return 0;

 

}

 

 

I have no idea how to count words in a document (as you can see it is a .txt document). Which command do I need fgetc can't handle words, but is there a command that can handle words.

 

RL6001

  • Jongware Level 2 Level 2 (265 points)
    Currently Being Moderated
    Nov 11, 2012 4:47 AM (in response to RL6001)

    What is a word? If you can describe that, it's as simple as this:

     

    1. set counter to 0

    2. read one single character. If it is (possibly) a part of a word, keep reading until you find a character that 'ends' this word. Increase counter by 1.

    3. If it is *not* a word character, keep reading until you find a character that is not 'not' a word. Go to 1.

     

    So, all that's left is to describe what a 'word character' is and/or what a *word* actaully is.

     

    '?' for example is not a word character, and 'R' is. How about '-'? It's not a word character if used as an en-dash -- like this, but is 'computer-generated' one word or two? "and/or", one or two? Is a number a word? If not, is a digit in the middle of a 'word characters' sequence part of that word? (Can't think of a good example :) "born2run" could be 2 words -- arguably even 3. Perhaps "R2-D2"?)

     

    Are you familiar with GREP? Most implementations recognize the code "\w" for 'any word character', but every time I use it I have to allow for numerous exceptions of the kind I list above. On the other hand, it automatically recognizes accented characters (and, in the implementation I'm using, Greek, Cyrillic, Arabic, and Hebrew 'word' characters as well, as it supports the full Unicode set).

     

    For further consideration: if you are playing Scrabble (or possibly "WordFeud") and someone lays down "hdgxyjr", surely you'll complain "but that's not a word"?

  • Jongware Level 2 Level 2 (265 points)
    Currently Being Moderated
    Nov 11, 2012 4:54 AM (in response to Jongware)

    (g) Posted on Apple's own forum from my iPad, and so I can't edit my post!

     

    Anyway, the "go to step 1" in step 3 should actually be "step 2".

  • Jongware Level 2 Level 2 (265 points)
    Currently Being Moderated
    Nov 11, 2012 1:24 PM (in response to RL6001)

    I know what words are, and so do you. But can you program that into your computer?

     

    Right under my question "what is a word" I gave you an algorithm to count words, one character at at time.

  • VikingOSX Level 5 Level 5 (4,785 points)
    Currently Being Moderated
    Nov 14, 2012 9:21 AM (in response to RL6001)

    $ wc -mw < test.txt

     

    Seriously, here is a word count example from K & R, “The C Programming Language.”

     

    Screen Shot 2012-11-14 at 12.16.18 PM.png

  • Keith Barkley Level 5 Level 5 (5,140 points)
    Currently Being Moderated
    Nov 14, 2012 1:22 PM (in response to RL6001)

    Same problem. Once you isolate the word, you use a string compare to see if they match. If so, count it.

     

    If you didn't want to count the words, maybe you should not have said "I want to count the words in a document".

  • Jongware Level 2 Level 2 (265 points)
    Currently Being Moderated
    Nov 14, 2012 3:36 PM (in response to Keith Barkley)

    Keith Barkley wrote:

     

    Same problem. Once you isolate the word, you use a string compare to see if they match. If so, count it.

     

    I would suggest a different approach, based on RL6001's initial try:

     

    1. read any character; exit on EOF.

    2. Is it not a 'word character'? then go to 1

    3. Is it a 'word character'? Then it's the first of a word, and possibly even the first one of your search string. So ...

    3a. .. if it's *not* the first character of your search string, skip 'word characters' until you found a not-a-word character; go to 1 (There is no need to store this string.)

    3b. .. if it *is* the first character of your search string, keep on reading while the next character is your search string's next character. If you encounter a mismatch, go to 3a (skip the rest of this mis-matching word). If you encounter the end of your search string, read one more character. Now this, in turn, should be not-a-word character. If it is, go to 3a. This step is necessary because the input string may be "VikingOSXx", i.e., your string followed by more valid word characters. I guess you wouldn't want to include those in your word count.

    3c. If the following character is not a word character, increase your counter by 1 and go to step 1.

     

    You do not need to store the actual word you 'are' reading, which is a good thing for several reasons. Most important, you would need to have an inkling of how long a word could get -- or have some active memory allocation scheme, which I presume is out of your reach at this point

    Second, it would still require you to write a 'single word' scanner, which by itself already has to inspect each separate character. As long as you are doing that anyway, you might as well keep track of what word you are reading.

     

    The above assumes you do not want to count your target words inside another word, e.g., when searching for "the", you would not want to count the occurrences inside "there", "tithe", or "weather". But if you do, all it takes is an adjustment to step 3.

  • VikingOSX Level 5 Level 5 (4,785 points)
    Currently Being Moderated
    Nov 14, 2012 8:22 PM (in response to RL6001)

    The following C code will find n occurrences of a specified word in a document.

     

    The program reads a supplied text file, one line at a time until EOF.

    It loops through each line, and creates word tokens based on a custom delim string.

    For each word token, it performs a string compare against a command-line supplied search string.

    When there is a match, it increments the word count. Otherwise, it scarfs another word.

     

    strtok_r is the re-entrant version of strtok().

     

    It compiles cleanly on Mountain Lion with gcc.

     

    Screen Shot 2012-11-14 at 11.21.59 PM.png

  • Keith Barkley Level 5 Level 5 (5,140 points)
    Currently Being Moderated
    Nov 15, 2012 9:03 AM (in response to VikingOSX)

    Of course, plurals and such may need to be handled, too.

  • VikingOSX Level 5 Level 5 (4,785 points)
    Currently Being Moderated
    Nov 15, 2012 12:47 PM (in response to RL6001)

    You are welcome and thanks for the points.

     

    After I posted the above example and logged out, I noticed that I had not removed a line of unnecessary code:

     

    int len=0;

     

    Compilation:

     

    Mountain Lion:  {gcc, llvm-gcc, clang} -O2 -o prog prog.c

     

    Fedora 17:          same gcc syntax as above.

     

    Ubuntu 12.10:     gcc -Wno-unused-result -O2 -o prog prog.c

                               (overide default compiler warning when not checking fgets result status)

  • Jongware Level 2 Level 2 (265 points)
    Currently Being Moderated
    Nov 15, 2012 3:58 PM (in response to RL6001)

    >  It helped me verry much.

     

    Copying code helps you? Can you at least identify the problem areas I mentioned? Simple as it may be, VikingOSX's code contains several potential issues.

    ... Alternatively, you could wait for him to address them, of course.

  • VikingOSX Level 5 Level 5 (4,785 points)
    Currently Being Moderated
    Nov 15, 2012 8:09 PM (in response to Jongware)

    I made no attempt to code enterprise grade software, or address every potential issue. And, I have no intention of doing so, as there are other demands for my time. No one helped me write that example program from scratch, with exception of adapting the for statement instructions from strtok_r(3).

     

    The code example that I provided serves multiple purposes:

    1. It solved the OP's original goal (with portable, functional code).
    2. It may be reusable, or extensible, in future development efforts.
    3. It may have instructional value.

     

    Yes, there is value in your contribution as it raises thoughtful questions for requirements decisions and code derivation. If you wanted to truly mentor, then you can take my code, and heavily comment it with accurate, realistic, reasons why it contains several potential issues. Or, provide your own example, similarly commented. Perhaps, it will add even more educational value to the OP as would a map through a minefield.

1 2 Previous Next

Actions

More Like This

  • Retrieving data ...

Bookmarked By (0)

Legend

  • This solved my question - 10 points
  • This helped me - 5 points
This site contains user submitted content, comments and opinions and is for informational purposes only. Apple disclaims any and all liability for the acts, omissions and conduct of any third parties in connection with or related to your use of the site. All postings and use of the content on this site are subject to the Apple Support Communities Terms of Use.