I wanted to count the number of words in a string. I thought this would be easy.
So I set out to design an algorithm to simply count the number of spaces in a string. That wasn't a good idea because what if the string starts with a space or ends with a space? What if the string starts with multiple space? what if there are multiple spaces in-betwen words!
Clearly not the way to go.
The cleanest, most concise and optimised approach i could muster was the following:
short STR_WordCount(const char *s) { int wc = 0; for(int i=0;i<strlen(s);i++) { if(isspace(s[i])) continue; wc++; while(!isspace(s[i])) i++; } return wc; }
The basis behind this algorithm is to eat up any spaces(skip) you find as you traverse the string from start to finish and as soon as you find a valid character, increment the word count(wc) and continue traversing the string from that point while you have valid characters. As soon as you come across a space, start eating up (skipping) the characters until the next time you find a valid set of characters - so you can increment the word count again.
So essentially there are two phases the 1st level phase that walks the string and eats up spaces, and as soon as it finds a valid character, the 2nd phase starts which is to traverse the string through the range of valid characters until it finds another space - at which point it repeats the 1st phase. These two phases continue until the entire length of the input string is traversed.
These are my current tests cases to show what this algorithm can deal with:
assert(STR_WordCount(" Terry is a bullterrier with an attitude ") == 7); assert(STR_WordCount("One") == 1); assert(STR_WordCount(" Terry attitude ") == 2); assert(STR_WordCount(" attitude is everything") == 3); assert(STR_WordCount(" ") == 0); assert(STR_WordCount("") == 0); assert(STR_WordCount(" ") == 0);
Its certainly not perfect but it took me a good long while to tune it - more than i thought necessary! There are a few things i can see that could be improved:
- It only deals with whitespace delimiters - that can be easily fixed by passing the desired character as a parameter to STR_WordCount()
- I'm assuming that the word count will be -32,768 to 32,767 and we dont need negative word counts - this can be rectified by using a larger return type such as unsigned int
- I've not tested its performance against other algorithms - which are probably better.
Then I thought about writing a routine to mimic what awk allows you to do - print the nth word in a string(assuming a word is delimited by a space character). It seemed like a natural extension of the word counting problem above which is how I decided to give it a try.
GNU awk allows you to do this:
cat "Francois Pienaar" |awk '{print $2}'
and prints out;
Pienaar
This is also not as easy as I thought, considering the problem with spaces etc. But having solved the problem with spaces with STR_WordCount(), I had a better grasp of string processing.
This will print out the nth word of a string.
void STR_PrintWordN(const char *s, int n) { int wc = 0; for(int i=0;i<strlen(s);i++) { if(isspace(s[i])) continue; wc++; while(!isspace(s[i])&&i<strlen(s)) { if(wc==n) { putchar(s[i]); } i++; } if(wc==n) { return; } }
}
This takes a similar approach but its a bit more tricky. Still have the two phases that 1) eat spaces and 2) traverse the valid portion of the remaining string. The additional logic is to print only that characters that are being traversed as part of the valid portion of the string, while the within the word that the user wants to output(n). I've protected the 2nd phase with an additional check to ensure that we dont traverse past the end of the string - I didn't do that with STR_WordCount().
As an optimisation I can early terminate phase 1 as soon as I've printed all the characters for the word that the user asked for.
I've tested this with the following:
char* names[] = { "Francois Pienaar 6", "James Small 14", "David Campese 11" }; for(int i=0;i<3;i++) { STR_PrintWordN(names[i],2); putchar('\n'); }
Which leads to the following output as you'd expect:
Pienaar
Small
Campese
Its pretty fascinating how tricky this was but it was also actually quite a lot of fun!