Stanford EE274: Data Compression I 2023 I Lecture 2 - Prefix Free Codes

18 Apr 2024 (12 months ago)

Introduction

The video reviews a quiz from the previous lecture, covering topics such as fixed-length codes, bits per symbol, and the optimality of using larger block sizes for encoding.
The concept of lossless compression is introduced, and two specific categories of codes are mentioned.
An exercise from the previous lecture involving decoding an encoded sequence is reviewed, and the decoding process is explained.

Uniquely decodable codes are defined as codes where no two input sequences of length greater than or equal to 1 are encoded to the same output.
Morse code is not uniquely decodable in the context of modern compression algorithms because it does not have a way to denote the separation between words without using spaces or gaps.
Commonly used letters in a language tend to have shorter code words in uniquely decodable codes, such as Morse code, to optimize the overall message length.

Prefix codes are uniquely decodable, meaning that no two sequences map to the same code.
Prefix codes can be represented as binary trees where each leaf represents a code word.
A decoding algorithm for prefix codes involves starting at the root of the tree and following the branches corresponding to the input symbols until a leaf is reached, at which point the corresponding code word is decoded.
Prefix codes are equivalent to binary trees where all code words are leaves.
Prefix-free codes are good enough for most practical purposes, so uniquely decodable codes that are not prefix-free will not be studied in detail.

A good prefix-free code minimizes the expected length of the codeword while satisfying the prefix condition.
The first property states that if the probability of symbol S1 is greater than the probability of symbol S2, then the length of the codeword for S1 should be smaller than the length of the codeword for S2.
The second property states that the length of the codeword for a symbol X should be roughly equal to the logarithm base 2 of the inverse of the probability of X.

Shannon code is a prefix code that achieves the second property, i.e., the length of the codeword for any symbol X is equal to the logarithm base 2 of the inverse of the probability of X.
To construct a Shannon code, compute the length of the codeword for each symbol, sort the symbols in increasing order of their lengths, and then assign codewords greedily starting from the leaves of the tree.

The video explains how to construct a prefix-free code for a given distribution of symbols.
The construction algorithm involves sorting the symbols in increasing order of length and then assigning codewords to each symbol starting from the shortest symbol.
The codewords are assigned in such a way that no codeword is a prefix of another codeword.
The resulting code is a prefix-free code, which means that it can be uniquely decoded.

The optimality of the constructed code is discussed, and it is shown that the code is not necessarily the best possible code in terms of expected code length.
A proof is provided to show that the construction algorithm always works, i.e., a leaf at any depth is always available without violating the prefix condition.
The proof involves mathematical manipulations and properties of logarithms.

The prefix-free property states that no node in a tree can have a prefix that is also a prefix of another node.
The proof shows that even after removing all nodes that violate the prefix-free property, there is still at least one valid node left.
The total number of leaves at depth LX M +1 minus the summation from 1 to M of the number of leaves not allowed because they are descended from existing code words is greater than or equal to one.
This means that there is at least one leaf left at every step which can be assigned to the next symbol to XM + 1.

The last two topics, Kraft's inequality and the proof of the construction algorithm, will be covered in the next lecture.
Students are encouraged to make examples and trees to help them understand the concepts.