Count words in a document

Question

Count words in a document

Hello all,

I want to count words in a document. First I have made a program that count letters in a document:

#import <foundation/foundation.h>

int main(int argc, const char * argv[])

{

int count={0},n={0},SearchChar;

FILE *document;

document=fopen("/Users/mymac/Desktop/test.txt","r");

SearchChar='H';

while(n!=EOF)

{

if ((n=fgetc(document))==SearchChar) {

count+=1;

}

printf("%d\n",count);

fclose(document);

return 0;

}

I have no idea how to count words in a document (as you can see it is a .txt document). Which command do I need fgetc can't handle words, but is there a command that can handle words.

RL6001

Posted on Nov 11, 2012 3:09 AM

Reply

Answer 1

Jongware

Level 2

265 points

Nov 11, 2012 4:47 AM in response to Community User

What is a word? If you can describe that, it's as simple as this:

1. set counter to 0

2. read one single character. If it is (possibly) a part of a word, keep reading until you find a character that 'ends' this word. Increase counter by 1.

3. If it is *not* a word character, keep reading until you find a character that is not 'not' a word. Go to 1.

So, all that's left is to describe what a 'word character' is and/or what a *word* actaully is.

'?' for example is not a word character, and 'R' is. How about '-'? It's not a word character if used as an en-dash -- like this, but is 'computer-generated' one word or two? "and/or", one or two? Is a number a word? If not, is a digit in the middle of a 'word characters' sequence part of that word? (Can't think of a good example :) "born2run" could be 2 words -- arguably even 3. Perhaps "R2-D2"?)

Are you familiar with GREP? Most implementations recognize the code "\w" for 'any word character', but every time I use it I have to allow for numerous exceptions of the kind I list above. On the other hand, it automatically recognizes accented characters (and, in the implementation I'm using, Greek, Cyrillic, Arabic, and Hebrew 'word' characters as well, as it supports the full Unicode set).

For further consideration: if you are playing Scrabble (or possibly "WordFeud") and someone lays down "hdgxyjr", surely you'll complain "but that's not a word"?

Reply

Answer 2

Jongware

Level 2

265 points

Nov 11, 2012 4:54 AM in response to Jongware

(g) Posted on Apple's own forum from my iPad, and so I can't edit my post!

Anyway, the "go to step 1" in step 3 should actually be "step 2".

Reply

Answer 3

Nov 11, 2012 12:31 PM in response to Jongware

What I mean with a word? Just as you go to the finder and you type for example "Learn" and he found for you the pdf file : "Learn C on the mac". But if you type "Learn C on" it is the meaning that he can still found the book "Learn C on the mac" (and of course all the other documents, who has also the "Learn C on" in the name). The tough part about this is that fgetc can only read character for character. So I can make a NSString with the name "Learn" but he will never hit a "word" that is exactly the same, because fgetc can only give a character. I hope I'm clear enough so you can help me.

RL6001

Reply

Answer 4

Jongware

Level 2

265 points

Nov 11, 2012 1:24 PM in response to Community User

I know what words are, and so do you. But can you program that into your computer?

Right under my question "what is a word" I gave you an algorithm to count words, one character at at time.

Reply

Answer 5

Nov 14, 2012 9:21 AM in response to Community User

$ wc -mw < test.txt 😉

Seriously, here is a word count example from K & R, “The C Programming Language.”

Reply

Answer 6

Nov 14, 2012 12:21 PM in response to VikingOSX

The last days I was very busy with everything except programming (I hate that days, but I'm back now).

@Jongware, sorry man I had read the first sentence and post a reply about that, didn't read the rest but thanks anyways it is always good to first visualize the problem.

@VikingOSX It's about one given word. For example I want to know how much the word 'VikingOSX' stands in a document. I don't want to know how many words there are in the text.

RL6001

Reply

Answer 7

Nov 14, 2012 1:22 PM in response to Community User

Same problem. Once you isolate the word, you use a string compare to see if they match. If so, count it.

If you didn't want to count the words, maybe you should not have said "I want to count the words in a document".

Reply

Answer 8

Jongware

Level 2

265 points

Nov 14, 2012 3:36 PM in response to Keith Barkley

Keith Barkley wrote:

Same problem. Once you isolate the word, you use a string compare to see if they match. If so, count it.

I would suggest a different approach, based on RL6001's initial try:

1. read any character; exit on EOF.

2. Is it not a 'word character'? then go to 1

3. Is it a 'word character'? Then it's the first of a word, and possibly even the first one of your search string. So ...

3a. .. if it's *not* the first character of your search string, skip 'word characters' until you found a not-a-word character; go to 1 (There is no need to store this string.)

3b. .. if it *is* the first character of your search string, keep on reading while the next character is your search string's next character. If you encounter a mismatch, go to 3a (skip the rest of this mis-matching word). If you encounter the end of your search string, read one more character. Now this, in turn, should be not-a-word character. If it is, go to 3a. This step is necessary because the input string may be "VikingOSXx", i.e., your string followed by more valid word characters. I guess you wouldn't want to include those in your word count.

3c. If the following character is not a word character, increase your counter by 1 and go to step 1.

You do not need to store the actual word you 'are' reading, which is a good thing for several reasons. Most important, you would need to have an inkling of how long a word could get -- or have some active memory allocation scheme, which I presume is out of your reach at this point 😉

Second, it would still require you to write a 'single word' scanner, which by itself already has to inspect each separate character. As long as you are doing that anyway, you might as well keep track of what word you are reading.

The above assumes you do not want to count your target words inside another word, e.g., when searching for "the", you would not want to count the occurrences inside "there", "tithe", or "weather". But if you do, all it takes is an adjustment to step 3.

Reply

Answer 9

Nov 14, 2012 8:22 PM in response to Community User

The following C code will find n occurrences of a specified word in a document.

The program reads a supplied text file, one line at a time until EOF.

It loops through each line, and creates word tokens based on a custom delim string.

For each word token, it performs a string compare against a command-line supplied search string.

When there is a match, it increments the word count. Otherwise, it scarfs another word.

strtok_r is the re-entrant version of strtok().

It compiles cleanly on Mountain Lion with gcc.

Reply

Answer 10

Nov 15, 2012 9:03 AM in response to VikingOSX

Of course, plurals and such may need to be handled, too.

Reply

Answer 11

Nov 15, 2012 12:31 PM in response to Community User

Thanks VikingOSX,

It helped me verry much.

Reply

Answer 12

Nov 15, 2012 12:47 PM in response to Community User

You are welcome and thanks for the points.

After I posted the above example and logged out, I noticed that I had not removed a line of unnecessary code:

int len=0;

Compilation:

Mountain Lion: {gcc, llvm-gcc, clang} -O2 -o prog prog.c

Fedora 17: same gcc syntax as above.

Ubuntu 12.10: gcc -Wno-unused-result -O2 -o prog prog.c

(overide default compiler warning when not checking fgets result status)

Reply

Answer 13

Jongware

Level 2

265 points

Nov 15, 2012 3:58 PM in response to Community User

> It helped me verry much.

Copying code helps you? Can you at least identify the problem areas I mentioned? Simple as it may be, VikingOSX's code contains several potential issues.

... Alternatively, you could wait for him to address them, of course.

Reply

Answer 14

Nov 15, 2012 8:09 PM in response to Jongware

I made no attempt to code enterprise grade software, or address every potential issue. And, I have no intention of doing so, as there are other demands for my time. No one helped me write that example program from scratch, with exception of adapting the for statement instructions from strtok_r(3).

The code example that I provided serves multiple purposes:

It solved the OP's original goal (with portable, functional code).
It may be reusable, or extensible, in future development efforts.
It may have instructional value.

Yes, there is value in your contribution as it raises thoughtful questions for requirements decisions and code derivation. If you wanted to truly mentor, then you can take my code, and heavily comment it with accurate, realistic, reasons why it contains several potential issues. Or, provide your own example, similarly commented. Perhaps, it will add even more educational value to the OP as would a map through a minefield.

Reply

Answer 15

Jongware

Level 2

265 points

Nov 16, 2012 1:49 AM in response to VikingOSX

VikingOSX, rather than writing my own code, good or bad as it may turn out, I was in fact hoping the OP would be able to convert the suggested algorithm into code (or *at least* give it a honest try). That's based upon my personal assessment the OP is learning how to write code -- quite possibly this is even a school assignment --, rather than having to perform this task as part of a larger piece of best-selling software that's due to hit the App Store in a few months.

If the OP only needs to count words as a real-world task (unrelated to "programming" as a separate topic), he'd be best off with tried-and-tested code as provided by Apple:

grep -o "the" | wc -l

There is nothing intrinsically "wrong" with your code, as it demonstrates proper initialization, assignment, looping, and a potentially very useful function I didn't even know existed ("strtok_r"). And it does the job.

As for potential issues, a couple that pop into mind are:

1. buffer length is limited to 80 characters:

Reads characters from stream and stores them as a C string into str until (num-1) characters have been read or either a newline or the end-of-file is reached, whichever happens first.

If a single line of input exceeds 80 characters, it may end in the middle of a word. Simply increasing the buffer length does not solve the problem, it only shifts it forward 😉

2. keyword length is limited to 20. Same sort of problem as above; why a limit of 20 characters? You would be better off with a char *, and use strdup.

In itself there is no reason to create this variable and copy the argument into it; you could either use argv[1] directly, or make keyword a char * and have it point to argv[1].

3. The delim array should consist of *all* not-a-word characters.

4. The code assumes the text file is 8-bit plain ASCII. My suggested algorithm would work with 16- or 32-bit Unicode, and with UTF-8 encoded text, with very minor adjustments. (In case you are wondering: (a) instead of using fgetc(), write a function to read one single character code in your encoding of choice; (b) adjust is-a-word and/or is-not-a-word to work with the full range of available codes in your encoding of choice.)

5. The for loop starting on line 29 could be rewritten for clarity (it took me several readings to guess what happens in there).

Reply