How Netflix Ensures Highly-Reliable Online Stateful Systems

12 Mar 2024 (over 1 year ago)

Reliable Stateful Systems

Reliable stateful systems ensure successful read and write operations with expected consistency models, within designated latency service level objectives, and always in a reliable state.
Different failure modes in stateful services require different solutions.
Netflix focuses on building reliable stateful servers, pairing them with reliable stateful clients, and designing APIs to use reliability techniques.

Netflix's Techniques for Reliable Stateful Services

Netflix uses a combination of techniques to achieve reliability for its stateful services, including:
- Decoupling the stateful process from the OS kernel.
- Using snapshot restoration to move state between instances quickly.
- Monitoring drives and JVMs for potential problems.
- Limiting the frequency of maintenance operations.
- Using in-place imaging to roll out software changes faster.
Netflix also uses caching to improve the reliability of its stateful services.
- Caches are treated as materialized view engines and are highly reliable.
- Caches are placed in front of services to protect them from load.
- Netflix has developed a technique called total near caching where the source of truth data store is not involved in the read path.
Netflix uses a number of techniques to make its stateful clients reliable, including:
- Signaling service level objectives per namespace and access pattern.
- Hedging requests.
- Using exponential backoff.
- Load unbalancing.
- Setting concurrency limits.
Compression reduces bite scent, adds useful properties like checksumming, and improves reliability by reducing the chances of SLO (Service Level Objective) busters.
SLOs define target and maximum latency objectives for services, and communicate concurrency limits to clients.
SLOs can be tuned based on namespace, client, and observed average latency.
Hedging involves sending multiple requests to different servers to improve reliability and meet SLOs.
Dynamic hedging adjusts the hedging strategy based on whether the client is likely to get a positive result.
Concurrency limiting prevents too much load from going to backend services.
GC-tolerant timers are used to prevent incorrect timeouts caused by garbage collection pauses.
Load balancing strategies like choice of two and weighted choice of N are used to avoid slow servers and improve latency.
Weighted choice of N exploits a priori knowledge about networks in the cloud to route requests to the closest replica.

Resilient Stateful APIs at Netflix

The video discusses techniques for building resilient stateful APIs at Netflix.
The key techniques mentioned are hedging, retrying, and breaking down work into smaller units.
Item potency tokens are used to ensure that mutable APIs are safe to retry.
Different types of item potency tokens are discussed, including client monotonic clocks, regional isolated tokens, and global isolated tokens.
The reliability and consistency trade-offs of different item potency token types are explained.
Real-world examples of how these concepts are applied in Netflix's key-value and time-series services are provided.
The importance of measuring and understanding the behavior of clocks in distributed systems is emphasized.
The video concludes by discussing how these techniques are implemented in Netflix's stateful APIs, including the use of paginated APIs and dynamic concurrency control for scans.
Item potency tokens are used to ensure idempotent writes and avoid duplicate operations.
Large responses are broken down into multiple pages to improve throughput and meet service level objectives (SLOs).

Additional Techniques

Decommissioned instances are detached but not terminated immediately to prevent them from re-entering the fleet.
Pre-flight checks help reject degraded hardware from re-entering the fleet.
Netflix targets an 80/20 split, with 80% of services using abstraction layers and 20% accessing storage engines directly.
Netflix provides a 25-page memo to users who access storage engines directly, outlining best practices for item potency, data store backups, and avoiding data loss.
Netflix offers data store client libraries in all major supported languages.
Netflix encourages users to use APIs with built-in item potency and resiliency techniques.
Netflix sometimes contributes resiliency techniques back to the open-source community.