Count Number of Occurrences of Each Word with a Hash Table

The “naive” version of this exercise which asks you to count number of occurrences using an array, is a prerequisite for this exercise, so go solve that one first, if you haven't already.

We discussed hash functions and I showed you a good enough hash function to map strings to numbers:

unsigned int hash(char *s) {
	unsigned int result = 0;
	for ( ; *s; ++s) {
		result = result * 37 + *s;
	}
	return result % N;
}

The idea of the solution which is much faster than the naive one is simple: for each word, we calculate the hash code and save this word in the corresponding element of an array of size N. Then, finding a word is as fast as calculating the hash function: the code will point to the right place immediately.

There is just one problem: collisions. Different words might produce the same hash codes, because there are only N different hash codes available, and chances are high that two or more words will result in the same code. How do we solve this?

Linear probing

The simplest solution of this problem is so straightforward that it feels like magic. If we see a collision, we keep moving forward (resetting to 0 if we reached N–1) until we either find the word we are looking for, or an unused place.

Of course, this can only work if you have enough empty elements in your array, otherwise you will have an infinite loop finding for a place where to put your word. But if your array–we call it hash table–is big enough, more than the number of different words, we'll be fine! And if the hash table is at least twice as large as the number of different words, the search will be so effective that we'll rarely jump more than one word to the right.

Example

Let me show how the linear probing works on a simple example. Let's assume we have 10 cells (N = 10) and ten words, and their hash values, as returned by the hash function defined above, are given in the table.

word     | hash code
---------+----------
    this |         4
      is |         0
      my |         4
sequence |         5
      of |         9
   words |         3
     for |         9
 testing |         2
    hash |         4
   table |         6

Let's imagine that the words are coming in this exact order: "this", "is", "my", "sequence", "of", "words", "for", "testing", "hash", "table"`. This is our initial table, it is empty:

#	value
0
1
2
3
4
5
6
7
8
9

Now, use the buttons to put words into the table, one by one, and see how they are placed according to their hash codes.

Next word:

Its hash code:

Notice how the words spread out, but as the table gets filled, it takes more and more time to find an empty cell for the word. That's why we really want to keep the table not more than 50% full.

How do we look for the word with linear probing? We first assign code = hash(word) and then check if the array element at the index code is empty. If it is, this means that the word does not exist in the table yet, and if we want to store it, we'll store it at this element. But if that element is not empty, we increment the code in a circular fashion: code = (code + 1) % N, and repeat the check for the next element, until we either find an empty element, or our word.

Now, let's redo the number of occurrences exercise, but using a hash table instead of searching the words in array. You will need two arrays; there will be at most 1000 words in this exercise and we want our hash table to be only 50% full, so we'll do

#define N 2000

char* words[N] = { NULL };
int counts[N] = { 0 };

Then, use the hash function defined above, and find the place in the table for each word, using strcmp to see if the given array element contains your word or not, and move to the next element if not. When you are finished, iterate over all elements of the array, and if any element is not NULL, print the word and its count.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
  /* Your code here! */
}

Complexity

If we have N words, each of M characters, then the code, for each word, will calculate the value of the hash function for this word (which takes O(M)), and put the word into the array. There might be collisions, but if we always keep the hash table not more than half full, the complexity of the hash search is very close to the constant, which makes the whole complexity O(MN)–quite an improvement from the quadratic O(N²M) that we got for the naive solution!