A Case Study on Hash Table Implementations
Introduction
In mid 2024, I decided that I wanted to switch career paths away from a “pure” software engineering role and into something more hardware adjacent. For a long time, I had my sights set on working in the aerospace industry - specifically working on flight software. I knew that this was going to be a difficult transition to make, if for no other reason that it was going to be difficult to convince recuiters that someone working at a B2B SaaS company was qualified to write software that would run on a satellite. Knowing this, I reasoned that I would need to demonstrate a stellar (pun-intended) foundation of computer science fundamentals in my interviews. Part of my preparation was working through chapters and exercises in the famous Introduction To Algorithms textbook by Thomas H. Cormen, Charles E. Leiserson, Ronald Rivest, and Clifford Stein. I also reasoned, however, that a solid understanding of the concepts in this textbook would not be enough, and I could differentiate myself by reading and comprehending real-world implementations of many of the datastructures discussed in the book.
This post documents my notes and understanding of various hash-table datastructures used in practice. Namely, I was interested in learning how datastructures such as hash-tables are made performant in practice using hardware-aware optimizations. As such, the ultimate goal of this post is to understand and document Swiss Tables, the choice of hash-tables used in the Rust standard library.
If you’re curious, after many months my job search was successful and I landed a job at Umbra Space writing flight software for satellites that use Synthetic Aperture Radar (SAR) to provide all-weather imaging services of Earth.
Hash Maps/Sets
Note, for the better part of this article I will combine hash maps and hash sets into a single data structure - hash maps. While the supported operations of the two are obviously different, the core techniques needed to analyze and optimize them are the same. Thus, for the sake of brevity (and my own sanity), we’ll lump their discussion into a single data structure.
A hash map, or hash table, is a collection which supports the \(INSERT\), \(SEARCH\), and \(DELETE\) functions. The hash in hash map refers to a function which converts arbitrary objects into integers such that they may be used to index into an array. The term map in hash map comes from the mathematical definition of a map, where some object \(f\), defined by \(f: X \to Y\), associates objects in the domain \(X\) to the codomain \(Y\).
Speaking crudely, a hash map is implemented by using the following four steps:
- Allocate a large array
- Use a hash function which converts objects into numbers
- Store elements in the array at the index given by the output of their hash function
- Have some strategy to resolve collisions
We will dive more into the topic of hash functions in a moment. For now, however, we define an array \(A\) as having length \(m\). \(A\) will ultimately store the elements of our hash map. Then, for some set \(U\) which contains all objects that we can potentially store, we define \(h\) as the hash function on \(U\) where \(h: U \to {0,1,\dots,m-1}\).
An immediate consequence of this definition of \(h\) is that our hash function is not injective if the cardinality of \(U\) is greater than \(m\). This implies that there are two elements \(x, y\in U\) such that \(h(x) = h(y)\). Since \(h\) is used to index elements into \(A\), we need some strategy for resolving collisions.
Collisions Resolution by Chaining
A simple first approach to collision resolution is via chaining. The idea is simple - rather than elements of \(A\) being elements in \(U\), elements of \(A\) are actually collections of elements of \(U\). If two elements in \(U\) hash to the same index in \(A\), we store them in the collection associated with that index.
Chaining Using Linked Lists
A common first-choice data structure to use for chained hash tables is a linked list. Linked lists, omitted from a dedicated discussion in this article, are a simple data structure which stores a value and a reference to the next element in the list.
Before we dive into our first implementation, let us discuss the runtime analysis for the three hash map operations on a chained hash map using linked lists. Assuming our hash function \(h\) is a constant time operation, then \(INSERT\) is also a constant time operation. This can be achieved by always inserting elements to the beginning of the list. Searching, however, is not performant in the worst case. In the worst case, all elements hash to the same index, and our hash map performs no better than a linked list. As such, we are left with \(\Theta(n)\) for \(SEARCH\), where \(n\) is the number of elements stored in our hash map. The worst case runtime for \(DELETE\) depends on the interface we expose. If \(DELETE\) takes the linked list element, and the linked list is a doubly linked list, then \(DELETE\) becomes a constant time operation. To visualize this, consider the following hash map below:
A
0
1 -> a <-> b <-> c <-> d
..
..
m
To perform \(DELETE(c)\), the function is given the value \(C\) directly. Since \(C\) has pointers to both \(b\) and \(d\), it is able to remove itself from the list in constant time.
In most real world use cases, however, returning the wrapping element of a collection is a poor API as it breaks encapsulation. Assuming that \(DELETE(c)\) is not proceeded by a search, then \(DELETE(c)\) is also of order \(n\), since we must first find the element in the collection before removing it.
A reference implementation in Rust
Below is a low-tech implementation of a chained hash map in Rust, using the LinkedList
from the standard library.
This reference implementation brings a few things to light about the claims made above regarding the theoretical implementation.
First, implementing \(INSERT\) by appending to the front of the collection offers little in the way of performance benefits, since we cannot guarantee the unicity constraints of elements in a linked list. Subsequent insertions of the same key, for instance, will cause incorrect behavior if the key is later deleted. Consider the following, for example,
let mut map: ChainedHashMap<u64, u64> = ChainedHashMap::new();
map.insert(1, 2);
map.insert(1, 3);
map.delete(1);
map.contains_key(&1); // will yield true if we don't ensure elements of the LL are unique
One option is to make an invariant that calls to insert
guarantee the unicity of a key within the associated collection.
Another is to iterate over all elements of the collection in calls to delete
such that all elements with the given key are removed.
For the reference implementation we chose the former, as it also prevents “memory leaks” from repeatedly inserting redundant keys.
Source Code
#![feature(linked_list_remove)]
use std::{
collections::LinkedList,
fmt::{Debug, Display},
hash::{DefaultHasher, Hash, Hasher},
};
struct MapElement<K: Hash, V> {
key: K,
val: V,
}
struct ChainedHashMap<K: Hash, V> {
vec: Vec<LinkedList<MapElement<K, V>>>,
}
impl<K: Hash + Eq + Display + Debug, V> ChainedHashMap<K, V> {
fn new() -> Self {
let mut vec = Vec::with_capacity(1000);
for _ in 0..1000 {
vec.push(LinkedList::new());
}
Self { vec }
}
fn insert(&mut self, key: K, val: V) {
self.delete(&key);
let mut hasher = DefaultHasher::new();
key.hash(&mut hasher);
let hash = hasher.finish();
let index = hash % self.vec.len() as u64;
let entry = MapElement { key, val };
self.vec[index as usize].push_front(entry);
}
fn get(&self, key: K) -> Option<&V> {
let mut hasher = DefaultHasher::new();
key.hash(&mut hasher);
let hash = hasher.finish();
let index = hash % self.vec.len() as u64;
for element in self.vec[index as usize].iter() {
if element.key == key {
return Some(&element.val);
}
}
None
}
fn delete(&mut self, key: &K) {
let mut hasher = DefaultHasher::new();
key.hash(&mut hasher);
let hash = hasher.finish();
let index = hash % self.vec.len() as u64;
let list = &mut self.vec[index as usize];
if let Some(index) = list.iter().position(|x| x.key == *key) {
list.remove(index);
}
}
fn contains_key(&mut self, key: K) -> bool {
self.get(key).is_some()
}
}
fn main() {
let mut map: ChainedHashMap<u64, u64> = ChainedHashMap::new();
map.insert(1, 2);
map.insert(2, 3);
map.insert(1, 10);
println!("Contains 1: {}", map.contains_key(1));
println!("Contains 2: {}", map.contains_key(2));
println!("Contains 3: {}", map.contains_key(3));
map.delete(&1);
println!("Contains 1: {}", map.contains_key(1));
}
Case Study: Java’s HashMap
Although many standard libraries don’t implement hash tables using chaining, Java’s HashMap does. One interesting optimization of Java’s implementation is that the underlying collection is not static. For small entries, the collection is a linked list. Once it reaches a certain load factor, however, the linked list is replaced with a tree-like data structure, allowing logarithmic time search operations.
Another interesting look at Java’s implementation is how it handles insertions. While beginning-of-list insertions are appealing from a complexity standpoint, as described above they are suboptimal for dealing with insertions of the same key multiple times. Java solves this problem by always doing a search, and replacing the value associated with the key if found. See source code for the implementation.
Collision Resolution by Probing
Another approach to implementing a hash map is to use open addressing. Rather than having elements of \(A\) store collections of objects, elements of \(A\) are in fact elements of \(U\). As discussed above, however, we need a mechanism for resolving collisions when they happen.
Suppose that an element \(u\in U\) maps to element \(h(u) = i\) where \(i\) is used to index \(A\). If \(A_i\) is already occupied, then we can simply insert \(u\) into \(A_{i+1}\). If \(A_{i+1}\) is occupied we can try again for \(A_{i+2}\) and so on. This strategy is called probing.
One immediate consequence of using open addressing is that our map can now run out of space. Previously, our backing array \(A\) contained \(m\) collections which were unbounded in size. Since \(A\) now stores elements of \(U\) directly, \(A\) cannot store more than \(m\) elements. Formally, we define the ratio of occupied slots to total available space as the load factor \(\alpha\), where \(\alpha = n / m\), where \(n\) is the number of elements in a backing array of length \(m\). For tables using open addressing, our map fails to satisfy the behavior of \(INSERT\) when \(\alpha > 1\). In practice, as we will see, this problem is solved by resizing the underlying array when a predetermined load factor is met, usually a value around 0.75. A new array is allocated, and we iterate over elements in the previous array and reinsert them into the new array.
Linear Probing
Linear probing implements probing by searching for immediately adjacent vacant indices. In pseudocode, one might imagine the following
fn insert(key, val)
let hash = key.hash();
let index = hash % table_size
while table[index] is not empty:
index++
table[index] = (key, val)
Implementation Notes
Again using Rust, we define a new struct called LinearHashMap
which implements linear probing to resolve collisions.
The first noteworthy difference is that the underlying array must now store map members directly.
A tempting structure to use for this is Rust’s Option
type.
As we’ll soon see, however, this type is insufficient as it does not provide a method for marking array indices as deleted.
To motivate this, consider the following map.
A
0
1
...
i <C, D>
i+1 <E, F>
i+2 <G, H>
...
m
For the sake of this exercise, suppose that \(hash(C) = hash(E) = hash(G) = i\). Then clearly \(E\) and \(G\) have been inserted into \(A\) as a result of linear probing. Now, suppose that a client calls \(DELETE(E)\). This leaves us with the following table:
A
0
1
...
i <C, D>
i+1
i+2 <G, H>
...
m
Now, if a subsequent call asks for \(SEARCH(G)\), we will first compute \(hash(G) = i\) and iterate over map elements until we hit an empty slot. This iteration will terminate at \(i+1\), however, implying that the element is not contained when in fact it is. The key observation here is that linear probing creates an implicit collection of elements. Rather than being stored elsewhere - like in a linked list - they are stored in adjacent indices. The presence of an element at an index encodes whether or not there might be an element at the following index - one might even think of this as a pseudo “next” pointer in a linked list.
As such, we need a way to mark indices as deleted. They can be reused in the future to avoid memory leaks, but they should be considered occupied for the case of scanning operations. Thus, our final model for \(A\) will look something like the following:
A
0
1
...
i <C, D>
i+1 DELETED
i+2 <G, H>
...
m
To implement this, we define a custom enum with the following properties:
enum MapElement<K, V> {
None,
Deleted,
Some((K, V)),
}
Source Code
use std::{
fmt::{Debug, Display},
hash::{DefaultHasher, Hash, Hasher},
};
#[derive(PartialEq, Eq)]
enum MapElement<K, V> {
None,
Some((K, V)),
Deleted,
}
struct ChainedHashMap<K: Hash, V> {
vec: Vec<MapElement<K, V>>,
size: usize,
}
impl<K: Hash + Eq + Display + Debug, V> ChainedHashMap<K, V> {
fn new() -> Self {
let mut vec = Vec::with_capacity(1000);
for _ in 0..1000 {
vec.push(MapElement::None);
}
let size = 0;
Self { vec, size }
}
fn resize(&mut self) {
let c = self.vec.capacity() * 2;
let new_vec = Vec::with_capacity(c);
let old_vec = std::mem::replace(&mut self.vec, new_vec);
for _ in 0..c {
self.vec.push(MapElement::None);
}
for e in old_vec.into_iter() {
if let MapElement::Some((k, v)) = e {
Self::insert_with_vec(&mut self.vec, k, v)
}
}
}
fn insert_with_vec(vec: &mut [MapElement<K, V>], key: K, val: V) {
let mut hasher = DefaultHasher::new();
key.hash(&mut hasher);
let hash = hasher.finish();
let mut index = hash as usize;
loop {
let i = index % vec.len();
match &vec[i] {
MapElement::None => {
vec[i] = MapElement::Some((key, val));
break;
}
MapElement::Some((k, v)) => {
if *k == key {
vec[i] = MapElement::Some((key, val));
return;
}
index += 1;
}
_ => {
index += 1;
}
}
}
}
fn insert(&mut self, key: K, val: V) {
if self.size == self.vec.len() {
self.resize();
}
Self::insert_with_vec(&mut self.vec, key, val)
}
fn get(&self, key: K) -> Option<&V> {
let mut hasher = DefaultHasher::new();
key.hash(&mut hasher);
let hash = hasher.finish();
let mut index = hash as usize;
loop {
let i = index % self.vec.len();
match &self.vec[i] {
MapElement::None => {
return None;
}
MapElement::Deleted => {
index += 1;
}
MapElement::Some((k, v)) => {
if *k == key {
return Some(v);
}
index += 1;
}
}
}
}
fn delete(&mut self, key: &K) {
let mut hasher = DefaultHasher::new();
key.hash(&mut hasher);
let hash = hasher.finish();
let mut index = hash as usize;
loop {
let i = index % self.vec.len();
match &self.vec[i] {
MapElement::None => {
return;
}
MapElement::Deleted => {
index += 1;
}
MapElement::Some((k, _)) => {
if *k == *key {
self.vec[i] = MapElement::Deleted;
return;
}
index += 1;
}
}
}
}
fn contains_key(&mut self, key: K) -> bool {
self.get(key).is_some()
}
}
fn main() {
let mut map: ChainedHashMap<u64, u64> = ChainedHashMap::new();
map.insert(1, 2);
map.insert(2, 3);
map.insert(1, 10);
println!("Contains 1: {}", map.contains_key(1));
println!("Contains 2: {}", map.contains_key(2));
println!("Contains 3: {}", map.contains_key(3));
map.delete(&1);
println!("Contains 1: {}", map.contains_key(1));
}
Double Hashing
Double hashing is a technique for open addressing collision resolution where the hash function \(h(k)\) is replaced by \(h(i, k)\), where \(i\) represents the number of times the double hash has collided with an existing entry. We define \(h(i, k)\) as the sum of two hash function, \(h_1(k)\) and \(h_2(k)\), where \(h(i, k) = h_1(k) + i\cdot h_2(k)\).
Astute readers might notice that linear probing is a special case of double hasing where \(h_2(k) = 1\).
One charasteric of the secondary hash function, \(h_2(k)\), is that the output of the has function should be relatively prime to the underlying table size, \(m\). The reason for this is that we wish to never create cycles during our probing, otherwise we might be unable to find an empty slot in \(m\), even if one exists. A common solution to this requirement is to let \(m\) be some power of two, and force \(h_2(k)\) to be an odd number. This claim was non-obvious to me at first, so let’s write a proof of this fact:
Claim: Let \(x \in \mathbb{N}\) such that \(x = 2^k\) where \(k\in \mathbb{N}\). Then for any odd natural number \(n\), \(n\) and \(x\) are coprime.
Proof: Per the fundamental theorem of arithematic, we note that any integer has a unique prime factorization. Since \(x\) is defined as \(2^k\), and 2 is prime, we note that this is the prime factorization of \(x\). Thus, it suffices to show that 2 is not a prime factor of \(n\). For the sake of contradiction, suppose that \(n = 2\cdot \ell\). This would imply that \(n\) is even, which is false. Therefore, 2 is not a factor of \(n\) and hence \(x\) and \(n\) are coprime. \(\blacksquare\)
Case Study: Swiss Tables
Let’s wrap up our discussion of hash maps with a look at Swiss Tables, a very performant hash map created at Google.
At CppCon in 2017, Google engineer Matt Kulukundis presented the state of the art hash map being rolled out across Google’s C++ codebases.
Since then, Swiss Tables have become a standard for modern programming language hash tables.
Rust’s HashMap
, for example, uses a Swiss Table under the hood called HashBrown
.
A digression on computer architecture and modern CPU features
To understand what makes Swiss Tables so performant, we need a quick crash course on computer memory architecture and CPU features.
First, a note on computer memory architecture. Generalizing this section is difficult as different computer architectures have different memory models. In general, however, most systems will implement something called a multi-level cache. Lower numbered caches are often physically close to the CPU and are smaller in size. Higher numbered caches are further away from the CPU, hence making them slower, but are also larger in size. Most modern processors, for example, will implement L1, L2, and L3 caches which are on the order of a few kB, a few hundred kB, and a few MB, respectively. When data is fetched from main memory it will be moved into these caches.
Caching takes advantage of a common function of most computer programs - locality. The most common forms of locality are spatial locality and temporal locality. Described crudely, spatial locality means that once data is accessed in a specific location, it is very likely that data will be accessed in a nearby location soon. Temporal locality describes that data which is accessed once is likely to be accessed again. Designing data structures which take advantages of these caching heuristics means we are likely to create performant libraries in practice.
As a note, spatial locality is often a reason cited for why chained hash structures are not used in practice. Because entries in the map store collections, the actual underlying data is likely stored elsewhere on the heap. Traversing a linked list implemented using pointers, for instance, is unlikely to produce cache hits. On the contrary, performing a linear probe on a contiguous array is likely to have excellent cache hit performance because of spatial locality caching heuristics.
Next, let us briefly discuss some modern CPU features, namely SIMD.
SIMD stands for Single Instruction, Multiple Data, and it generally defines a type of parallelism where the same operation is applied to multiple operands.
Conventionally, computer processors act on scalar values, such as a char
, int
, and float
- you can think of scalars as objects which contain a single piece of information.
Consider the following X86 assembly code, for example, which moves 14 and 10 into the eax
and ebx
registers, respectively, then adds them and stores the output at ebx
.
mov eax, 14
mov ebx, 10
add eax, ebx
SIMD, on the other hand, operates on vectors, which are collections of scalar values.
Typically, a SIMD register is 128 bits long.
This means that if our scalar values are 8 bits long (i.e. a char
), then we can perform parallel computation of 16 scalar values.
Suppose, for example, that we had 16 8-bit numbers stored in the 128-bit Rs1
register.
Likewise, there are 16 8-bit numbers stored in the 128-bit Rs2
register.
Then we can do a pairwise add of elements in Rs1
and Rs2
by using the ADD8
instruction, like so
ADD8 Rd, Rs1, Rs2
Swiss Tables: Data Layout
A Swiss Table is comprised of two primary components - a control table, \(C\), and a slots table, \(S\). \(S\) is like a traditional hash table in that it stores the actual key/value pairs of our map. \(C\) is of particular interest, however, as it contains a summary of \(S\) which allows for hardware acceleration during probing.
First, let us define a hash function as usual - \(h(k)\). We note that \(h(k)\) returns a 64-bit integer. Likewise, we denote \(h_7(k)\) as the 7 top-most bits of our \(h(k)\) - more on this later.
To understand this, let’s take a look at a single group.
A group is a collection of 16 contiguous elements from \(S\), and it’s associated entry from \(C\).
We note that elements of \(C\) are 8-bits, and elements of \(S\) are sizeof(S)
bits.
Elements of \(C\) can take on three types of values:
0b1111_1111
- the element at this index is empty0b100000000
- the element at this index is deleted0b0XXXXXXXX
- the element at this index is non-empty.XXXXXXX
contains the value ofh_7(k)
for this entry.
To understand how this table works, let’s look at an example group. Note that indices here are relative to the group, not the table in it’s entirety.
C S
0 0b1111_1111 NULL
1 0b1111_1111 NULL
2 0b1111_1111 NULL
3 0b1000_0000 XXXX
4 0b1111_1111 NULL
5 0b1111_1111 NULL
6 0b1111_1111 NULL
7 0b1111_1111 NULL
8 0b1000_0000 XXXX
9 0b1111_1111 NULL
10 0b0101_1010 101_1010...1000, "Eggs": "Spam"
11 0b1111_1111 NULL
12 0b0100_0001 100_0001...0001, "Holy": "Hand Grenade"
13 0b1111_1111 NULL
14 0b1111_1111 NULL
15 0b1111_1111 NULL
This group contains two mappings, which we could imagine as being the output of the following calls
map.put("Eggs", "Spam");
map.put("Holy", "Hand Grenade");
Let’s say that we wanted to lookup the value associated with the key, Eggs
.
In pseudocode, we would do something like the following:
fn get(key):
let hash = hash(key)
let group = hash % table_size
for each element in group.control
if element is not deleted or empty:
if element.h_7 == hash.h_7
if group.slots[element index].hash == hash && group.slots[element index].key = key
return group.slods[element index].value
While the code here is relatively straightforward, it seems like a cumbersome way to find elements in our map. The key observation to make here, however, is that we can perform much of our searching in parallel.
First, note that elements of \(C\) are 8-bits long, and that there are 16 of them. It’s no coincidence that 16 8-bit pieces of data perfectly fit into a 128 bit SIMD register. Using just two SIMD instructions, we can extract all indices of \(C\) where the value of \(h_7\) matches that of our given key.
To do this, let’s compute two vectors. The first vector will simply be our value of \(h_7(k)\) repeated 16 times. The second vector will be a copy of \(C\). Note that 90 is the decimal representation of \(h_7(k)\) of “Eggs” (See table entry 10).
5A 5A 5A 5A 5A 5A 5A 5A 5A 5A 5A 5A 5A 5A 5A 5A
+
FF FF FF 7F FF FF FF FF 7F FF 5A FF 41 FF FF FF
=
00 00 00 00 00 00 00 00 00 00 FF 00 00 00 00 00
This operation is denoted as _mm_cmpeq_epi8
, and it simply tells the computer to compare the values between the two operands using 8-bit sizes for the vector elements.
The output is 00
for all indices where the elements don’t match, and FF
otherwise.
We can then collapse this 128-bit vector in a 16-element 1-bit vector with another instruction: mm_mask_mov_epi8
.
00 00 00 00 00 00 00 00 00 00 FF 00 00 00 00 00
=
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
Now, we simply iterate over the non-zero bits of our output to explore very likely indices in \(S\) where the key might be stored.
The success of Swiss Table’s performance is really two fold. First, SIMD intrinsic as described above allow parallel probing. Because the information stored in \(C\) is very dense, it is likely to be in lower caches at runtime. Additionally, Swiss Tables implement their own form of memory hierarchy by storing the value of \(h_7(k)\) in \(C\). By the time we have non-zero bits in the output of our SIMD instructions, it is very likely that the key stored in \(S\) matches our given key. We can provide hints to the compiler that this comparison is likely to evaluate to true (see Rust’s hashbrown implementation).