1 2 Previous Next 16 Replies Latest reply: Nov 16, 2012 5:34 AM by VikingOSX
RL6001 Level 1 Level 1 (0 points)

Hello all,

 

I want to count words in a document. First I have made a program that count letters in a document:

 

 

#import <foundation/foundation.h>

 

int main(int argc, const char * argv[])

{

 

int count={0},n={0},SearchChar;

    FILE *document;

    document=fopen("/Users/mymac/Desktop/test.txt","r");

    SearchChar='H';

        while(n!=EOF)

        {

            if ((n=fgetc(document))==SearchChar) {

                count+=1;

            }

 

        }

        printf("%d\n",count);

 

    fclose(document);

    return 0;

 

}

 

 

I have no idea how to count words in a document (as you can see it is a .txt document). Which command do I need fgetc can't handle words, but is there a command that can handle words.

 

RL6001

  • 1. Re: Count words in a document
    Jongware Level 2 Level 2 (265 points)

    What is a word? If you can describe that, it's as simple as this:

     

    1. set counter to 0

    2. read one single character. If it is (possibly) a part of a word, keep reading until you find a character that 'ends' this word. Increase counter by 1.

    3. If it is *not* a word character, keep reading until you find a character that is not 'not' a word. Go to 1.

     

    So, all that's left is to describe what a 'word character' is and/or what a *word* actaully is.

     

    '?' for example is not a word character, and 'R' is. How about '-'? It's not a word character if used as an en-dash -- like this, but is 'computer-generated' one word or two? "and/or", one or two? Is a number a word? If not, is a digit in the middle of a 'word characters' sequence part of that word? (Can't think of a good example :) "born2run" could be 2 words -- arguably even 3. Perhaps "R2-D2"?)

     

    Are you familiar with GREP? Most implementations recognize the code "\w" for 'any word character', but every time I use it I have to allow for numerous exceptions of the kind I list above. On the other hand, it automatically recognizes accented characters (and, in the implementation I'm using, Greek, Cyrillic, Arabic, and Hebrew 'word' characters as well, as it supports the full Unicode set).

     

    For further consideration: if you are playing Scrabble (or possibly "WordFeud") and someone lays down "hdgxyjr", surely you'll complain "but that's not a word"?

  • 2. Re: Count words in a document
    Jongware Level 2 Level 2 (265 points)

    (g) Posted on Apple's own forum from my iPad, and so I can't edit my post!

     

    Anyway, the "go to step 1" in step 3 should actually be "step 2".

  • 3. Re: Count words in a document
    RL6001 Level 1 Level 1 (0 points)

    What I mean with a word? Just as you go to the finder and you type for example "Learn" and he found for you the pdf file : "Learn C on the mac". But if you type "Learn C on" it is the meaning that he can still found the book "Learn C on the mac" (and of course all the other documents, who has also the "Learn C on" in the name). The tough part about this is that fgetc can only read character for character. So I can make a NSString with the name "Learn" but he will never hit a "word" that is exactly the same, because fgetc can only give a character. I hope I'm clear enough so you can help me.

     

    RL6001

  • 4. Re: Count words in a document
    Jongware Level 2 Level 2 (265 points)

    I know what words are, and so do you. But can you program that into your computer?

     

    Right under my question "what is a word" I gave you an algorithm to count words, one character at at time.

  • 5. Re: Count words in a document
    VikingOSX Level 5 Level 5 (5,500 points)

    $ wc -mw < test.txt

     

    Seriously, here is a word count example from K & R, “The C Programming Language.”

     

    Screen Shot 2012-11-14 at 12.16.18 PM.png

  • 6. Re: Count words in a document
    RL6001 Level 1 Level 1 (0 points)

    The last days I was very busy with everything except programming (I hate that days, but I'm back now).

    @Jongware, sorry man I had read the first sentence and post a reply about that, didn't read the rest but thanks anyways it is always good to first visualize the problem.

    @VikingOSX It's about one given word. For example I want to know how much the word 'VikingOSX' stands in a document. I don't want to know how many words there are in the text.

     

    RL6001

  • 7. Re: Count words in a document
    Keith Barkley Level 5 Level 5 (5,260 points)

    Same problem. Once you isolate the word, you use a string compare to see if they match. If so, count it.

     

    If you didn't want to count the words, maybe you should not have said "I want to count the words in a document".

  • 8. Re: Count words in a document
    Jongware Level 2 Level 2 (265 points)

    Keith Barkley wrote:

     

    Same problem. Once you isolate the word, you use a string compare to see if they match. If so, count it.

     

    I would suggest a different approach, based on RL6001's initial try:

     

    1. read any character; exit on EOF.

    2. Is it not a 'word character'? then go to 1

    3. Is it a 'word character'? Then it's the first of a word, and possibly even the first one of your search string. So ...

    3a. .. if it's *not* the first character of your search string, skip 'word characters' until you found a not-a-word character; go to 1 (There is no need to store this string.)

    3b. .. if it *is* the first character of your search string, keep on reading while the next character is your search string's next character. If you encounter a mismatch, go to 3a (skip the rest of this mis-matching word). If you encounter the end of your search string, read one more character. Now this, in turn, should be not-a-word character. If it is, go to 3a. This step is necessary because the input string may be "VikingOSXx", i.e., your string followed by more valid word characters. I guess you wouldn't want to include those in your word count.

    3c. If the following character is not a word character, increase your counter by 1 and go to step 1.

     

    You do not need to store the actual word you 'are' reading, which is a good thing for several reasons. Most important, you would need to have an inkling of how long a word could get -- or have some active memory allocation scheme, which I presume is out of your reach at this point

    Second, it would still require you to write a 'single word' scanner, which by itself already has to inspect each separate character. As long as you are doing that anyway, you might as well keep track of what word you are reading.

     

    The above assumes you do not want to count your target words inside another word, e.g., when searching for "the", you would not want to count the occurrences inside "there", "tithe", or "weather". But if you do, all it takes is an adjustment to step 3.

  • 9. Re: Count words in a document
    VikingOSX Level 5 Level 5 (5,500 points)

    The following C code will find n occurrences of a specified word in a document.

     

    The program reads a supplied text file, one line at a time until EOF.

    It loops through each line, and creates word tokens based on a custom delim string.

    For each word token, it performs a string compare against a command-line supplied search string.

    When there is a match, it increments the word count. Otherwise, it scarfs another word.

     

    strtok_r is the re-entrant version of strtok().

     

    It compiles cleanly on Mountain Lion with gcc.

     

    Screen Shot 2012-11-14 at 11.21.59 PM.png

  • 10. Re: Count words in a document
    Keith Barkley Level 5 Level 5 (5,260 points)

    Of course, plurals and such may need to be handled, too.

  • 11. Re: Count words in a document
    RL6001 Level 1 Level 1 (0 points)

    Thanks VikingOSX,

     

    It helped me verry much.

  • 12. Re: Count words in a document
    VikingOSX Level 5 Level 5 (5,500 points)

    You are welcome and thanks for the points.

     

    After I posted the above example and logged out, I noticed that I had not removed a line of unnecessary code:

     

    int len=0;

     

    Compilation:

     

    Mountain Lion:  {gcc, llvm-gcc, clang} -O2 -o prog prog.c

     

    Fedora 17:          same gcc syntax as above.

     

    Ubuntu 12.10:     gcc -Wno-unused-result -O2 -o prog prog.c

                               (overide default compiler warning when not checking fgets result status)

  • 13. Re: Count words in a document
    Jongware Level 2 Level 2 (265 points)

    >  It helped me verry much.

     

    Copying code helps you? Can you at least identify the problem areas I mentioned? Simple as it may be, VikingOSX's code contains several potential issues.

    ... Alternatively, you could wait for him to address them, of course.

  • 14. Re: Count words in a document
    VikingOSX Level 5 Level 5 (5,500 points)

    I made no attempt to code enterprise grade software, or address every potential issue. And, I have no intention of doing so, as there are other demands for my time. No one helped me write that example program from scratch, with exception of adapting the for statement instructions from strtok_r(3).

     

    The code example that I provided serves multiple purposes:

    1. It solved the OP's original goal (with portable, functional code).
    2. It may be reusable, or extensible, in future development efforts.
    3. It may have instructional value.

     

    Yes, there is value in your contribution as it raises thoughtful questions for requirements decisions and code derivation. If you wanted to truly mentor, then you can take my code, and heavily comment it with accurate, realistic, reasons why it contains several potential issues. Or, provide your own example, similarly commented. Perhaps, it will add even more educational value to the OP as would a map through a minefield.

1 2 Previous Next