Huffman Encoding

A method to decode Huffman codes

Huffman coding is a type of optimal prefix encoding that is used for lossless data compression. The Huffman coding algorithm takes as input the frequencies of symbols from a certain alphabet and constructs variable-length prefix codes that minimize the average length of encoded data. It compresses data by replacing each fixed-length input symbol with the corresponding variable-length prefix-free output code. To achive compression more common symbols should have shorter codes.
Encoding is a process of converting data written using given alphabet into a binary representation. The process of reconstructing data from binary representation is called decoding.

For introduction to Huffman codes refer to Wikipedia.

Properties of Huffman codes:

  • Prefix free i.e. no code is a prefix of another code
  • Variable-length i.e. different symbols can have codes of different lengths
  • Optimal code length

Obvious decoding

The most obvious method to decode the data encoded using prefix codes is based on traversing the binary tree built from these codes. It is well described for example in this article and in Wikipedia.

At each step the decoding algorithm consumes only one bit of the input by navigating to the left or the right subtree, producing output symbols upon reaching leaf node and jumping back to the root of the tree. This decoding algorithm can be easily implemented:

  var current = root;
  foreach (var bit in GetBitsFrom(encodedMessage))
  {
    current = bit ? current.right : current.left;
    if (IsLeafNode(current))
    {
       yield current.Symbol;
       current = root;
    }
  }
                      

Alternative decoding method

The approach presented here is based on the binary search. On each step, it will decode one input symbol. The algorithm should first preprocess codes. This preprocessing step is done once. Lets assume that each code is less than 32 bits i.e. it will fit in a primitive unsigned integer type. For each code preprocessing step generates two numbers a left-aligned integer code C and a mask M. For example, given the code 11001010111 it should produce the following numbers:

  • 1100 1010 111 - input code, length 11 bits.
  • 1100 1010 1110 0000 0000 0000 0000 0000 - C the left-aligned integer code (32 bits).
  • 1111 1111 1110 0000 0000 0000 0000 0000 - M the mask, 11 significant bits set to 1, the rest are 0.
Next, left-aligned integer codes should be ordered using conventional integer comparison (usigned). After preprocessing we should get the following arrays:
  • Ss - array of symbols, chars.
  • Ls - array of code lengths, usigned integers.
  • Cs - array of left-aligned codes, usigned integers.
  • Ms - array of masks, usigned integers.
Cs[i] is a left-aligned code of a character Ss[i] and Ms[i] its mask.

At the first step, algorithm takes 32 bits from the input and constructs unsigned integer E from them. Next, it performs binary search over Cs using mask for comparison. For a valid input a match should always be found at some position i such that: Cs[i] == (E & Ms[i]). This means a symbol Ss[i] is decoded. Next, Ls[i] bits should be removed from E. Technically this means, E should be right shifted by Ls[i] bits. And finally, the next Ls[i] bits from the input can be added to E. The algorithm then should continue to the binary search step untill all bits from the input are consumed.

The following code implements presented algorithm: VarLenCharEncoding.cs

This methods works for any prefix-free encoding scheme. Instead of usigned integers, other primitive integral types can be used if the longest code fits into it.

Consider the following example. Huffman tree and corresponding symbol codes are given below:

  • a -> 0
  • b -> 10
  • c -> 110
  • d -> 111
Lets see how the first method will proceed while decoding the message baad = [10 0 0 111].
  • Start from the root, node with label 1
  • First bit is one, select right node with label 2
  • Second bit is zero, select left node
  • Selected node is a leaf node, output symbol b
  • jump to the root
  • Next bit is zero, select left node
  • Selected node is a leaf node, output symbol a
  • jump to the root
  • Next bit is again zero, select left node
  • Selected node is a leaf node, output symbol a
  • jump to the root
  • Next bit is one, select right node with label 2
  • Next bit is one, select right node with label 3
  • Next bit is one, select right node
  • Selected node is a leaf node, output symbol d
  • jump to the root
For the second method. Notice that the longest code is 3 bits long. It will fit into an imaginary integral type of 4 bits long (for simplicity). After preprocessing step the following arrays should be computed:
  • Ss = [a, b, c, d]
  • Ls = [1, 2, 3, 3]
  • Cs = [0, 8,12,14] in binary form [0000, 1000, 1100, 1110]
  • Ms = [8,12,14,14] in binary form [1000, 1100, 1110, 1110]
Decoding should work like this:
  • E = 8 (10 0 0), rest = (111)
  • Binary search, start at (left + right)/2 = (0+4)/2 = 2
  • Cs[2] == 12, (E & Ms[2]) == 8, right = 2
    1. Binary search, next (left + right)/2 = (0+2)/2 = 1
      Cs[1] == 8, (E & Ms[1]) == 8, match
      Output b
  • E = (E << Ls[1]) = (8 << 2) = 0
  • E = (E | bit1 << 1 | bit2 ) = 0 | 2 | 1 = 3 (0 0 11), rest = (1)
    1. Binary search, start at (left + right)/2 = (0+4)/2 = 2
      Cs[2] == 12, (E & Ms[2]) == 2, right = 2
      Binary search, next (left + right)/2 = (0+2)/2 = 1
      Cs[1] == 8, (E & Ms[1]) == 0, right = 1
      Binary search, next (left + right)/2 = (0+1)/2 = 0
      Cs[0] == 0, (E & Ms[0]) == 0, match
  • Output a
  • E = (E << Ls[0]) = (3 << 1) = 6
  • E = (E | bit1 ) = 6 | 1 = 7 (0 111), rest = ()
    1. Binary search, start at (left + right)/2 = (0+4)/2 = 2
      Cs[2] == 12, (E & Ms[2]) == 6, right = 2
      Binary search, next (left + right)/2 = (0+2)/2 = 1
      Cs[1] == 8, (E & Ms[1]) == 4, right = 1
      Binary search, next (left + right)/2 = (0+1)/2 = 0
      Cs[0] == 0, (E & Ms[0]) == 0, match
  • Output a
  • E = (E << Ls[0]) = (7 << 1) = 14
  • No more input, E = 14 (111), rest = ()
    1. Binary search, start at (left + right)/2 = (0+4)/2 = 2
      Cs[2] == 12, (E & Ms[2]) == 14, left = 3
      Binary search, next (left + right)/2 = (3+4)/2 = 3
      Cs[3] == 14, (E & Ms[3]) == 14, match
  • Output d
It seems that the second method does more computations, but when carefully implemented it has superior performance.

Last updated: 20.09.2018,
created: 26.12.2017, 18.09.2018.