Stanford EE274: Data Compression I 2023 I Lecture 4 - Huffman Codes
18 Apr 2024 (9 months ago)
Huffman Coding
- Huffman codes are prefix codes that will be discussed in the lecture.
- Huffman coding is a greedy algorithm for constructing a prefix code for a given set of probabilities.
- The algorithm works by repeatedly merging the two nodes with the smallest probabilities until only one node is left.
- The expected code length of a Huffman code is between the entropy of the source and the entropy plus 1.
- Huffman coding can be implemented efficiently using a heap or priority queue.
- Table-based decoding is a low-level technique used in compression algorithms, especially for large volumes of data such as videos and logs.
- Recap of previous lecture:
- Kraft's inequality: A necessary and sufficient condition for a prefix code.
- Entropy: A fundamental quantity for lossless compression, defined as the expected number of bits per symbol.
- Joint entropy for IID variables: Can be calculated in two steps using independence and identical distribution.
- K Divergence: A distance measure between two probability distributions, greater than or equal to zero with equality if and only if the distributions are equal.
- Main theorem of compression: For any prefix code, the expected code length is lower bounded by the entropy.
- Shannon codes: Not optimal, but can achieve entropy or get arbitrarily close to it when working with blocks.
- Quiz review:
- Entropy of a Bernoulli random variable:
- H(X) = -p log p - (1-p) log (1-p)
- H(0) = H(1) = 0
- H(1/2) = 1
- K Divergence between two Bernoulli random variables:
- D(p || q) = p log (p/q) + (1-p) log ((1-p)/(1-q))
- The maximum Kullback-Leibler (KL) Divergence between two probability distributions, (P) and (Q), can be infinite when (P=1) and (Q=0).
- Shannon codes are not optimal for highly skewed distributions, resulting in an expected code length that is much larger than the entropy.
- Block coding can improve the efficiency of Shannon codes by grouping symbols together and assigning shorter codewords to more probable pairs.
- The optimal prefix code for a given distribution can be found by satisfying two conditions:
- Symbols with higher probabilities should have shorter or equal length codewords compared to symbols with lower probabilities.
- The two longest codewords should have the same length.
Practical Considerations and Examples
- The Fibonacci sequence is an adverse serial case for Huffman coding, resulting in long code words and a maximum depth close to the number of symbols.
- A practical example of Huffman coding is shown using the frequency of letters on a keyboard, where commonly used letters have shorter code words.
- Huffman coding was used to compress various files, including a book, a Java library, and an encrypted file.
- The compression rate for the book was 44%, and for the Java library, it was also around 40%.
- The encrypted file, however, could not be compressed because encryption algorithms typically output uniformly distributed data, making it appear random to compressors.
- Compressing encrypted data is not effective as encryption destroys the data's structure, resulting in a compression rate of one (no compression).
Next Lecture
- The next lecture will cover theoretical concepts such as entropy and block coding, exploring their importance and interaction with probability and large laws of numbers.
- Professor Saki will lead the next lecture, followed by discussions on practical block and stream codes used in state-of-the-art compressors.