Count Number of Occurrences of Each Word (Naive)

This is the exercise to practice using all the “tools” we learned so far in a real life example. I'm calling it "naive", because we will focus on using those tools, and not on the efficiency of the algorithm, which will be somewhat slow in terms of its time complexity.

By the number of various standard functions you need to use, it's probably the most difficult exercise so far in the course, but if you wrote the code for all exercises before this one, you must be able to solve it too!

The task is: given the input that has one word per line, with no punctuation, count how many times each word appears in the input.

Here is the list of everything you will need to remember, with links to the pages where we talk about the specific topic, in case if you need a refresher.

Arrays

We'll need to use arrays to store all the words you know. For now, we'll assume we won't have more than 1000 different words given. In this exercise, we'll use two arrays: one will store the words, and another one the counts.

How do we make an array of words? In the same way as we make an array of integers: a word is a string, that is, a pointer to its first character, so

char *words[1000];

will define an array words of 1000 pointers to chars. Similarly, we define an array for counts.

int counts[1000];

Besides that, we'll need to know how many words we have, so int word_count = 0; will help us remember that.

Reading input word by word

We'll use fgets to read from stdin file by line into a string. Let's assume that each word cannot be longer than 100 characters. So the main loop of your program will look similar to

char word[101]; /* storing a zero character too */

while (fgets(word, 101, stdin)) {
  /* word may contain \n as its last character; you need to remove it */
  /* your last line might be empty, ignore it */
  ...
}

Note the comments I left in the while loop above. Your current word may end with \n, or may not (if it's the last one), so you need to handle that: calculate its length with strlen and check the last character, if it's '\n', change it to 0 to effectively cut the word at that point:

int length = strlen(word);
if (word[length] == '\n') {
  word[length] = 0;
}

If, after this operation, your word is an empty string, which might happen for the very last line of the input, just skip it.

Duplicating a word

Since we know that we have up to 1000 words of length up to 100, we could technically just allocate a huge 2d array to store all those data; we'll need 100,000 bytes which is not a big deal for modern computers. But instead of doing that, we'd rather use strdup to store a copy of each word in the corresponding array cell.

String comparison

Whenever you read a word, you will need to go through all the words you already know, and compare if you word is one of those. You can use strcmp to compare two strings. If you found that your word is equal to words[i], increment counts[i]; otherwise, if it's a new word, save its copy to words[word_count], set counts[word_count] to 1, and increment words_count.

Free the memory when done

Before exiting the program, make sure to call free for all the copies of the words that you strdup'ed, because strdup calls malloc and it's your responsibility to call free at the end.

Ready to write the code?

The input will be one word per line, up to 1000 words, each of them not longer than 100 characters. Your program should output, for each word, the word and its count (separated by a space), one word per line, in any order. For example, for the following input:

test
word
here
is
the
test
word
again
word

the output might be

test 2
word 3
here 1
is 1
the 1
again 1

but the words can come in any order.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
  /* Your code here! */
}

Complexity

If we have N words, each of M characters, then the code, as I described above, will:

store the first word;
perform up to 1 string comparison for the second string: it either matches the first string, or not;
perform up to 2 string comparisons for the third string;
...
perform up to N-1 string comparisons for the last string.

The sum of all this in the worst case (all new strings, no matches) will be 1 + 2 + ... + N-1 = N(N-1)/2 as it's an arithmetic sequence. Note that it's quadratic of N.

Each string comparison compares up to M characters, so the resulting time complexity in the worst possible case becomes O(N²M). This is rather slow! You won't see any slowness for 1000 words, but for 100,000 words, you will very much notice it.

On the next page we'll look at another, quite magical, way to make it work faster!