Attila's Blog

About writing your own programming language in C, Go and Swift

The theoretical background of hash maps

Sep 2, 2019 Comments

Categories: aspl
Tags: hash maps

It is time to be able to use variables in ASPL, but at the moment we have a problem that prevents us doing so. Creating and using variables means to store their values somewhere, attach a name to them so we can look them up later. Right now we do not have any means to do this. This is where hash maps can help us. As C does not have hash maps built–in, we will have to implement one by ourselves. Let’s see the theory behind hash maps that we will use later to implement a hash map on our own.

What is a hash map?

Simple, really. It associates a set of keys to a set of values. These key–value pairs are called the entries of the hash map. If you have a key, you can look up the corresponding value. Of course, you can add and remove entries from the table. Adding a new value to an existing key will usually overwrite the previous value.

One thing that makes hash maps incredibly powerful is the fact that regardless of the number of the entries stored in the hash map, looking up a value for a key takes constant time (average case look up, that is).

Okay, but how does it work?

From a very high level you can think of it as something that behaves similarly to a dynamic array. Initially the hash map has a certain number of empty buckets. By adding entries we slowly fill these empty buckets up with values. When we cannot store more (this usually does not mean that we used all available buckets, but that we have reached a predefined limit that is still lower than the capacity of the hash map), we allocate some more space and then we can keep adding new entries.

It’s not that simple though.

Every hash map has a capacity and a count. Capacity is the maximum number of entries that can be stored in the hash map. Count is the number of entries currently stored in the hash map. That means we have up to capacity number of empty buckets in our hash map, and when we add new entries it will be stored in one of those buckets somehow. If we divide the count by the capacity, we get something called the load factor of the hash map. It simply shows us how “full” the hash map is. It is an important number to keep in mind, because the current load factor can and will greatly affect the performance of the hash map. More on this later.

The key of the entry can be anything, given that it supports hashing. Hashing is a process that transforms a piece of hashable data (usually a string, but can be other data as well) into a number. The important rule here is that the same piece of data must be guaranteed to have the exact same hash value if we hash it multiple times. This guarantees that we can uniquely identify, and thus look up our values stored in the hash map. Hashing is done by something called the hashing function, which is an algorithm that will do the mentioned transformation for us.

I mentioned that look ups take constant time. This comes from the fact that (in theory, but not always in practice, as we will see later) every hash value has a well identified unique place in the hash map. That means that a hash value will always be stored in the same empty bucket (in practice it is not always true, and we will make adjustments to this statement). This is achieved by sending the hash value through a special mapping function that will use the capacity of the hash map to calculate the index of the bucket that is used to store the value. This mapping always gives us the same index for the same hash value if we apply it multiple times.

There is one more thing to mention. Given that there is an infinite number of possible keys, you might naturally ask if there is a chance that two or more different keys will have the same hash value?

Unfortunately, the answer is yes most of the time. Although there are so called perfect hash functions, using them is always a tradeoff. Usually they are more complex, and therefore slower than other simpler, but not that secure alternatives. In this context, security means the likelyhood that hash collisions will happen. Some programming languages implement several hashing functions of various speed and security, and it’s up to the developers to decide to use the one that fits best for their needs. When we have the same hash value for multiple different keys we say that these keys collide or that we have a hash collision.

Collision resolution

Okay, how can we prevent hash collisions?

Well, a naive way would be simply to start increasing the capacity. As the index of the bucket selected for a given hash value is in direct relation to the capacity value, it is easy to see that increasing the capacity would make it less likely to have a hash collision. Let’s also remember what we said about the load factor, that it is defined as the number of entries divided by the number of buckets. Combining this with the previous statement means that the smaller the load factor is, the less likely that a collision would occur.

Still, this would decrease the chance of collisions, but it would not prevent them. Also, we cannot just increase the size infinitely. What do we do then?

The simple answer is that it is better to prepare for collisions to happen and try to handle it gracefully when it does.

Remember when I said that the same hash value always gives us the same bucket index? Well this is a big problem in this case because it also means that in case of a hash collision several values are competing for the same bucket. If we let this happen it would mean that one value will replace the other one.

Unless we come up with a clever idea.

There are two main ways to handle collisions. One is called separate chaining, the other is open addressing.

Separate chaining

With this technique we will simply let collisions to happen and also let each bucket to contain multiple values. It is usually implemented by a linked list or similar data structure. The bucket does not contain the value directly, but contains a list of entries. To look an entry up we have to take two steps.

  1. First to find the bucket based on the hash value.
  2. Walk the list until we find the key we were looking for.

By using this technique, our worst case scenario is when every entry collides and stored in the same bucket. In this case the hash map would degrade into a linked list and the look up performance would be O(n). In practice we can prevent this though by carefully managing the load factor.

Maybe the biggest issue with this approach is that it’s not cache friendly. Due to the nature of the linked list the items in the list are not stored continuously in the memory and thus will prevent proper caching in the CPU.

Open addressing (also called closed hashing)

This technique stores all entries inside the available buckets. In case of hash collisions we simply find a different bucket. This is great for keeping the implementation cache friendly and fast. On the other hand, it makes all operations on the hash map more complicated.

Why?

Because when we insert a new entry the target bucket might be already occupied, or the hash map might be full. So there are some extra steps we need to take. The process of finding an available bucket is called probing.

There are lots of probing algorithms, and their significance is huge because it heavily affects the hash map’s performance.

We will pick something that is easy to implement but still gives good performance.

This is called linear probing. We simply start with the original target bucket. If it’s full, we check the next bucket and so on. If we reach the end of the array, we wrap around and continue at the beginning and keep going until we find an empty bucket. This algorithm is cache friendly, but it is prone to clustering. If we have a lots of similar keys with similar hash values we might end up with a lots of colliding buckets in each other’s vicinity. This might increase the look up’s cost dramatically.

As the load factor increases, the performance of linear probing degrades. Around the 80% mark this degradation becomes just too heavy. As a result, a good rule of thumb is to shoot for a 75% maximum load factor. When the hash map is 75% full, we increase the capacity and re–hash the whole map.

Re–hashing & dynamic resizing

When we reached the predefined limit for the hash map (usually 75% of the capacity), we need to increase the capacity. This is done the same way as for dynamic arrays. We allocate a new, bigger continuous memory area, we copy the existing contents over, and finally release the previously allocated area. At the lowest level this is done by the realloc standard library call. This is how we dynamically resize a hash map. In case of a hash map this is not enough though. We also need to re–hash our map.

So what does re–hashing mean, and why is it necessary?

I mentioned earlier that calculating the bucket’s index is directly related to the capacity of the hash map. That means if we change the capacity, we break the internal consistency of the current hash value - bucket index allocation, and we don’t have a choice but to calculate every existing entry’s location in the hash map again to make sure they are properly distributed among the available (and increased number of) buckets.

Hashing functions

So what is a hash function?

A hash function is any function that can be used to map data of arbitrary size to fixed-size values.

A good hash function has the following characteristics:

  1. Must be deterministic. The same input always transformed to the same hash value.
  2. Should be uniform. It should produce an evenly–distributed range of outputs. (To minimize clustering, remember?)
  3. Must be fast. As every operation on the hash map needs to calculate the hash value as its first step. If the hash function is slow, the whole hash map is slow. More often than not this means the whole application is slow.

There are lots of different hash functions out there. Every developer is responsible to choose the right one that fits best for their scenario. This means to pick the one that strikes the right balance between being secure and being fast.

For ASPL, we will implement a well–known algorithm called FNV-1a. It’s simple to implement, and provides good performance.

Okay, that’s all for now. It was a lot of theory for one day.

Next time we will put all this theory into practice and implement our own hash map.

Stay tuned. I’ll be back.

Tags: hash maps

← Implementing strings

comments powered by Disqus