Mastering Superfast Data Planes: Boosting Cloud Performance for Millions of Packets per Second
23 Sep 2024 (2 months ago)
- 100 Gig network interfaces are common in networking and can process approximately 10 million packets per second. (30s)
- With a 100 nanosecond time budget per packet, there are 300 CPU clock cycles available to process each packet. (38s)
Packet Processing Device
- A simplified packet processing device is presented that matches packet headers, applies rewrite policies, and rewrites specific header sections. (4m19s)
- The process packet function will be called approximately 10 million times per second. (6m15s)
Optimization Techniques
- Using the "inline" keyword, specifically the "always inline" attribute in C, can improve performance by eliminating function call overhead. (7m30s)
- Utilizing vector instructions, such as the VP instruction for logical AND operations on 256-bit vectors, can significantly enhance performance by processing multiple data elements simultaneously. (11m1s)
- Intel Intrinsics are functions provided by Intel that make it easier to use low-level vector instructions. (12m30s)
- AVX 512, the next iteration of vector instructions, introduces Ternary Logic Operations, which allow binary logic between three arguments simultaneously. (13m42s)
Swiss Table Data Structure
- A Swiss table, a data structure developed by Google, splits a hash into two parts: H1 identifies the group and bucket location, while H2, stored in a metadata array, enables direct entry access. (15m53s)
- Packed metadata arrays can be efficiently compared using vector instructions, minimizing entry probing time. (18m21s)
- Using a Swiss table implementation with a similar hash function results in a performance improvement from 400 clock cycles per packet to 300. (19m5s)
Interleaving and Prefetching
- Interleaving involves prefetching memory required for packet processing, minimizing memory stall time by overlapping memory access with the processing of other packets. (21m20s)
- The program writer's understanding of code execution allows for more efficient interleaving compared to relying solely on the execution unit's optimization techniques. (22m49s)
- Instead of processing individual packets, a burst of 20 packets is processed at a time. (23m14s)
- To improve the performance of processing network packets, a technique called prefetching is used to load necessary data into the cache before it's needed. (23m54s)
- Prefetching the metadata array for the Swiss table, which is used for packet lookup, significantly reduces processing time from 300 clock cycles per packet to 80. (23m57s)
Loop Unrolling
- Loop unrolling is another technique that can be used to further optimize packet processing by reducing loop overheads and enabling parallel instruction execution. (26m12s)
- Reducing clock cycles from 80 to 65, a difference of 15, leads to a significant performance boost. (29m15s)
Optimization Trade-offs
- While techniques like inlining and loop unrolling can enhance performance, they can also increase code size, potentially leading to more instruction cache misses and reduced performance. (30m30s)
- Excessive prefetching of memory, especially into the small L1 cache, can result in cache eviction, where prefetched data is replaced before being used, negatively impacting performance. (31m11s)
Rust Programming Language
- The Rust programming language's default hashmap implementation utilizes a Swiss table data structure. (35m12s)
Optimization Considerations
- When optimizing code, it is important to consider the trade-off between impact and complexity, with techniques falling into quadrants of easy/low impact, easy/high impact, hard/low impact, and hard/high impact. (35m26s)
- Vune is a powerful tool for identifying memory stores during performance benchmarking. (39m0s)
- Developers should use both micro-benchmarks for rapid iteration and large-scale performance tests for end-to-end validation to mitigate performance issues. (41m24s)
Programming Language Selection
- When selecting a programming language for a performance-critical project, it's essential to choose a language that provides fine-grained control over optimization, such as Rust, which allows direct access to Intel intrinsics. (42m30s)
Premature Optimization
- Premature optimization should be avoided, and benchmarking should be performed early and continuously throughout the development process to identify actual bottlenecks and prevent wasted effort on unnecessary optimizations. (44m34s)