intro
Huffman codes are intuitive and in many cases optimal ways of encoding data losslessly. This post covers an implementation of simple huffman codes in Rust.
Feel free to checkout the code.
idea
David Huffman came up with this compression scheme while studying for an exam! Turns out this idea yields the most efficient lossless encoding when symbols are encoded separately. We’ll see the intuitive explanation for that optimality in a bit. But first how does it work?
Let’s say we have a string of only ASCII characters “deadbeef” and we want to compress this such that we use the smallest number of bits without losing any information. Huffman’s idea was to create a tree using a collapsing symbol (character in this case) count. So our symbol count looks like:
d | e | a | b | f |
---|---|---|---|---|
2 | 3 | 1 | 1 | 1 |
Then we create a node such that the two symbols with the smallest character counts are children. This new node is representative of a stand-in symbol and has a count equal to the sum of the counts of its two children. Our table now looks like this:
d | e | *(a, b) | f |
---|---|---|---|
2 | 3 | 2 | 1 |
*(a, b) represents a node with children corresponding to the characters ‘a’ and ‘b’.
We do this recursively until we end up with a single node. The next step is to combine *(a, b) and f:
d | e | *(*(a, b), f) |
---|---|---|
2 | 3 | 3 |
Now we combine d with e (we could also combine d with the node we just created since it has the same count as e):
*(d, e) | *(*(a, b), f) |
---|---|
5 | 3 |
Lastly we combine the two into a single node:
*(*(d, e), *(*(a, b), f)) |
---|
8 |
From this tree we can create a map from each of the original characters to bits by descending from the root node to leaves. As we descend we will use a 0 when we take a left branch and a 1 when we take a right branch. Let’s build these character codes (bit sequences):
*(d, e) | *(*(a, b), f) |
---|---|
0 | 1 |
Again we do this recursively (I’ll expand both sides of the table simultaneously for brevity):
d | e | *(a, b) | f |
---|---|---|---|
00 | 01 | 10 | 11 |
And finally we arrive at our complete table:
d | e | a | b | f |
---|---|---|---|---|
00 | 01 | 100 | 101 | 11 |
This is our encoding. Inverting the table gives us our decoding:
00 | 01 | 100 | 101 | 11 |
---|---|---|---|---|
d | e | a | b | f |
We decode by taking bigger and bigger initial chunks off of the encoded data and checking if those chunks are in our decoding table. Each of these chunks is called a prefix. Huffman codes are prefix codes because no code is a prefix of another.
optimality and correctness
So intuitively why is this optimal? We build from the bottom of the tree upwards. This means that groups of characters that have lower frequency occur closer to the bottom of the tree. Since we traverse from the top down when constructing our code table the character groups with the highest frequency are reached first. By combining individual character nodes into stand-in nodes (and also stand-in nodes into other stand-in nodes), we ensure that the most used nodes (representing a character or groups of characters) bubble to the top.
Correctness for lossless compression means that we don’t make any errors when decoding encoded data (i.e. our decoded encoded data is equal to our data. How could we make errors? Let’s say this was our code table:
d | e | a |
---|---|---|
100 | 10 | 01 |
If we try to encode the string “deed” we end up with the bits 1001010100. As we decode we are using the inverse of this table:
100 | 10 | 01 |
---|---|---|
d | e | a |
So as we look at prefixes of increasing length we end up with a decoding that looks like
- 1001010100
- e 01010100
- ea 010100
- eaa 0100
- eaaa 00
You will notice that we incorrectly decode the first character because the code for ‘e’ is a prefix for the code for ‘d’. As long as no code is a prefix for another and each code is unique (only corresponds with one character) our encoding/decoding will work correctly.
When developing a Huffman coding, because we terminate each code when we reach a leaf (a single character node) no code can be a prefix of another code. Because we constructed our tree such that each individual character is in its own leaf node, no two characters will correspond to the same code.
code
This started as an exercise in learning Rust so let’s look at some code. I am going to use the HashMap
and BinaryHeap
structs from the standard library. Also I will use the Ordering
trait since the BinaryHeap
is a max binary heap. My changing methods in the Ordering
trait and its traits we can easily get a min heap. Lastly I am going to use the bit-vec crate in order to do bitwise operations and concatenations. This crate uses ints (size may be specified) as the underlying storage for the bits.
First let’s get character counts from our data.
Now let’s build a min heap with all of these characters in it.
Now let’s build our huffman binary tree from this heap by popping off the smallest two nodes, attaching them to a parent node, and pushing the parent node back onto the heap.
Now we can get our codes!
All together in a function this flow is:
To encode something now all we need to do is map these codes over our data.
To decode we will need to invert our huffman codes as mentioned.
Then we can decode the same way we encoded… by mapping over the data.
other resources
- Huffman coding is covered well conceptually at geeksforgeeks.