Stanford EE274: Data Compression I 2023 I Lecture 4 - Huffman Codes

18 Apr 2024 (12 months ago)

Huffman Coding

Huffman codes are prefix codes that will be discussed in the lecture.
Huffman coding is a greedy algorithm for constructing a prefix code for a given set of probabilities.
The algorithm works by repeatedly merging the two nodes with the smallest probabilities until only one node is left.
The expected code length of a Huffman code is between the entropy of the source and the entropy plus 1.
Huffman coding can be implemented efficiently using a heap or priority queue.
Table-based decoding is a low-level technique used in compression algorithms, especially for large volumes of data such as videos and logs.

Entropy and Related Concepts

Recap of previous lecture:
- Kraft's inequality: A necessary and sufficient condition for a prefix code.
- Entropy: A fundamental quantity for lossless compression, defined as the expected number of bits per symbol.
- Joint entropy for IID variables: Can be calculated in two steps using independence and identical distribution.
- K Divergence: A distance measure between two probability distributions, greater than or equal to zero with equality if and only if the distributions are equal.
- Main theorem of compression: For any prefix code, the expected code length is lower bounded by the entropy.
- Shannon codes: Not optimal, but can achieve entropy or get arbitrarily close to it when working with blocks.
Quiz review:
- Entropy of a Bernoulli random variable:
- H(X) = -p log p - (1-p) log (1-p)
- H(0) = H(1) = 0
- H(1/2) = 1
- K Divergence between two Bernoulli random variables:
- D(p || q) = p log (p/q) + (1-p) log ((1-p)/(1-q))
The maximum Kullback-Leibler (KL) Divergence between two probability distributions, (P) and (Q), can be infinite when (P=1) and (Q=0).
Shannon codes are not optimal for highly skewed distributions, resulting in an expected code length that is much larger than the entropy.
Block coding can improve the efficiency of Shannon codes by grouping symbols together and assigning shorter codewords to more probable pairs.
The optimal prefix code for a given distribution can be found by satisfying two conditions:
- Symbols with higher probabilities should have shorter or equal length codewords compared to symbols with lower probabilities.
- The two longest codewords should have the same length.

Practical Considerations and Examples

The Fibonacci sequence is an adverse serial case for Huffman coding, resulting in long code words and a maximum depth close to the number of symbols.
A practical example of Huffman coding is shown using the frequency of letters on a keyboard, where commonly used letters have shorter code words.
Huffman coding was used to compress various files, including a book, a Java library, and an encrypted file.
The compression rate for the book was 44%, and for the Java library, it was also around 40%.
The encrypted file, however, could not be compressed because encryption algorithms typically output uniformly distributed data, making it appear random to compressors.
Compressing encrypted data is not effective as encryption destroys the data's structure, resulting in a compression rate of one (no compression).

Next Lecture

The next lecture will cover theoretical concepts such as entropy and block coding, exploring their importance and interaction with probability and large laws of numbers.
Professor Saki will lead the next lecture, followed by discussions on practical block and stream codes used in state-of-the-art compressors.