Stanford EE274: Data Compression I 2023 I Lecture 10 - LZ and Universal Compression

18 Apr 2024 (12 months ago)

LZ77 Compression Algorithm

LZ77 is a lossless data compression algorithm that achieves optimal results for any stationary source without prior knowledge of the source distribution.
LZ77 works by dividing an input sequence into three streams: unmatched literals, match length, and match offset.
LZ77 can efficiently encode short matches that occur far apart, saving bits compared to encoding the individual characters.
LZ77 uses a sliding window to find matches within a limited scope, rather than keeping the entire past in memory.
Match finding is the core of an LZ77 encoder, and it involves indexing past occurrences of sequences in a data structure like a hash table.
LZ77 parsing aims to find the best match to reduce literals and improve compression.
LZ77 converts a complex stationary distribution into a simpler IID distribution, allowing the use of entropy coders for efficient compression.

LZ77 Compression Techniques

The greedy approach takes the first match it finds, while the lazy approach skips short matches in the hope of finding longer ones in the future.
Increasing the minimum match length leads to more literals and fewer matches, as short matches are disallowed.
The lazy strategy outperforms the greedy strategy for any minimum match length.

LZ77 Compressor Comparison

Modern LZ77 compressors like Zstandard outperform older ones like gzip due to better algorithms, memory usage, and implementation techniques.
LZ4 is the fastest compressor for both compression and decompression, but it provides less compression compared to other compressors.
Zstandard is a modern alternative to gzip with better compression and decompression speed, but it uses more memory.
For long-term archival purposes, compressors like XZ and bzip2 offer better compression but are slower.
Advanced techniques like context-based coding and dictionary compression can achieve even higher compression ratios but are more resource-intensive.

LZ77 Compression Considerations

Before compressing data, consider if all the data is necessary and if there are compressible parts.
Think about the data access pattern and store data accordingly, such as using cheaper cold storage for less frequently accessed data.
Lossy compression may be necessary for practical purposes, especially for data like videos.
Avoid implementing your own compressor from scratch or buying an expensive license without first understanding the application and requirements.
Understand the application, speed, memory usage, and location of compression and decompression before choosing a compressor.
Consider the command-line interface and library availability of different compressors.
The choice of compressor depends on the homogeneity of the data and the desired compression speed and ratio.
When using Zstandard, it's important to choose the right compression level and parameters for optimal performance.
Reusing the Zstandard context when compressing multiple files can speed up the process.
Dictionary compression is useful for very small files where finding good matches within the file is unlikely.
Zstandard has better compression and speed compared to gzip due to its improved Huffman coding, advanced match-finding strategies, and efficient implementation.
Before creating a custom compressor, consider if the compression gap is significant, if the data has specific characteristics that can be exploited, if it's a closed ecosystem, and if there's enough data or a strong need for it.
Creating a custom compressor involves designing, implementing, testing, and maintaining the compression format, ensuring backward compatibility, and supporting various platforms and architectures.
Domain-specific compressors can be valuable in certain scenarios, such as genomics, but they may limit interoperability.
If considering a domain-specific compressor, it's recommended to collaborate on a project to benefit from expertise and avoid fragmentation.