Hash Functions

We have recently looked at how to calculate number of occurrences of each word in a naive way, using a quadratic algorithm. I want to make a detour and show you a very fun way of doing that much faster. It does not have much to do with the C programming language per se, but it's a useful tool in your toolbox, and also quite a magical thing. After all, being able to find a word in an array of words in a time that (almost) does not depend on a number of words does indeed sound like magic!

We'll start from discussing what a hash function is.

You could've probably heard that term in the context of cryptography: MD5, SHA1, SHA256 and such. Those are, indeed, hash functions, but we are going to look at some easier examples.

A hash function, in general, is any function that converts its input, which in most cases is a string, to a number in the range [0, N). For example, a simple function that adds up values of all the characters in a string, and then returns the result modulo 100. Note that I want to return an unsigned int here:

unsigned int hash(char *s) {
	unsigned int result = 0;
	for ( ; *s; ++s) {  /* do you remember how to iterate a string? */
		result += *s;
	}
	return result % 100;
}

If you are confused with the for loop above, you can go back and refresh what you remember about string iteration.

This hash function will, indeed, map all possible input strings to a number between 0 and 99 inclusive. Try it:

#include <stdio.h>

unsigned int hash(char *s) {
	unsigned int result = 0;
	for ( ; *s; ++s) {  /* do you remember how to iterate a string? */
		result += *s;
	}
	return result % 100;
}

int main() {
	/* array of strings! */
	char *inputs[] = {
		"test",
		"TEST",
		"abcde",
		"abbee",
		"some longer string",
		NULL, /* mark the end to simplify the iteration below */
	};
	int i;

	for (i = 0; inputs[i]; ++i) { /* while inputs[i] is not NULL */
		printf("string: %s, hash value: %u\n", inputs[i], hash(inputs[i]));
  }

  return 0;
}

You can notice that this hash function returned the same result for "abcde" and for "abbee", because the sum of character values in these two strings is equal: it's 97+98+99+100+101 = 495 in the first string and 97+98+98+101+101 = 495 in the second string; after modulo 100, we get the result of 95 in both cases.

Since the number of all possible strings, even if you limit their length, is much larger than 100, it's inevitable that some strings will return the same results. It's completely expected; this situation is called a collision. In general, we want hash functions that distribute their input strings as evenly as possible, and the hash function above is not good at all; but we must understand that for any hash function collisions are possible, and for small values of N, will happen for sure.

Defining a macro

Speaking about N, you could've noticed that I used the literal 100 in the hash function definition. Quite often, you don't want to have such numbers spread out in your code, it makes it harder if you suddenly want to change the limit to a different number.

C is an an old enough language to not have “real” compile time constants, but it allows you to define a macro, asking the compiler to replace all occurrences of some identifier with some value before compilation. If you say

#define N 100

then everywhere in the code N will be replaced with 100 by a macro processor before the compiler even starts reading the code. Note that N is not a variable; it's just being literally replaced with 100.

There must be no semicolon after 100 in the #define line, otherwise N will get replaced with 100; which is definitely not what you want! The macro processor has no idea about your program, it just performs textual substitution.

A better hash function

When I need a simple hash function to convert a string into a number between 0 and N−1, inclusive, I often use the following function:

#define N 100

unsigned int hash(char *s) {
  unsigned int result = 0;
  for ( ; *s; ++s) {
		result = result * 37 + *s;
	}
	return result % N;
}

Here is how this function works on the same set of the test strings:

#include <stdio.h>

#define N 100

unsigned int hash(char *s) {
	unsigned int result = 0;
	for ( ; *s; ++s) {
		result = result * 37 + *s;
	}
	return result % N;
}

int main() {
	/* array of strings! */
	char *inputs[] = {
		"test",
		"TEST",
		"abcde",
		"abbee",
		"some longer string",
		NULL, /* mark the end to simplify the iteration below */
	};
	int i;

	for (i = 0; inputs[i]; ++i) { /* while inputs[i] is not NULL */
		printf("string: %s, hash value: %u\n", inputs[i], hash(inputs[i]));
  }

  return 0;
}

Even though, as we discussed, the collisions will inevitably happen, we can see that the distribution of the results seems a little bit better here.

Note that I don't care about a possibility of result value being overflown, that is, for the long enough string, the calculation might easily exceed 2³²−1. This is one reason why we use the unsigned int type here: it behaves predictably in the case of overflow, simply rolling back to 0.

Okay, now we have this hash function, what do we do with it? You'll see on the next page!