How Netflix Ensures Highly-Reliable Online Stateful Systems

12 Mar 2024 (9 months ago)
How Netflix Ensures Highly-Reliable Online Stateful Systems

Reliable Stateful Systems

  • Reliable stateful systems ensure successful read and write operations with expected consistency models, within designated latency service level objectives, and always in a reliable state.
  • Different failure modes in stateful services require different solutions.
  • Netflix focuses on building reliable stateful servers, pairing them with reliable stateful clients, and designing APIs to use reliability techniques.

Netflix's Techniques for Reliable Stateful Services

  • Netflix uses a combination of techniques to achieve reliability for its stateful services, including:
    • Decoupling the stateful process from the OS kernel.
    • Using snapshot restoration to move state between instances quickly.
    • Monitoring drives and JVMs for potential problems.
    • Limiting the frequency of maintenance operations.
    • Using in-place imaging to roll out software changes faster.
  • Netflix also uses caching to improve the reliability of its stateful services.
    • Caches are treated as materialized view engines and are highly reliable.
    • Caches are placed in front of services to protect them from load.
    • Netflix has developed a technique called total near caching where the source of truth data store is not involved in the read path.
  • Netflix uses a number of techniques to make its stateful clients reliable, including:
    • Signaling service level objectives per namespace and access pattern.
    • Hedging requests.
    • Using exponential backoff.
    • Load unbalancing.
    • Setting concurrency limits.
  • Compression reduces bite scent, adds useful properties like checksumming, and improves reliability by reducing the chances of SLO (Service Level Objective) busters.
  • SLOs define target and maximum latency objectives for services, and communicate concurrency limits to clients.
  • SLOs can be tuned based on namespace, client, and observed average latency.
  • Hedging involves sending multiple requests to different servers to improve reliability and meet SLOs.
  • Dynamic hedging adjusts the hedging strategy based on whether the client is likely to get a positive result.
  • Concurrency limiting prevents too much load from going to backend services.
  • GC-tolerant timers are used to prevent incorrect timeouts caused by garbage collection pauses.
  • Load balancing strategies like choice of two and weighted choice of N are used to avoid slow servers and improve latency.
  • Weighted choice of N exploits a priori knowledge about networks in the cloud to route requests to the closest replica.

Resilient Stateful APIs at Netflix

  • The video discusses techniques for building resilient stateful APIs at Netflix.
  • The key techniques mentioned are hedging, retrying, and breaking down work into smaller units.
  • Item potency tokens are used to ensure that mutable APIs are safe to retry.
  • Different types of item potency tokens are discussed, including client monotonic clocks, regional isolated tokens, and global isolated tokens.
  • The reliability and consistency trade-offs of different item potency token types are explained.
  • Real-world examples of how these concepts are applied in Netflix's key-value and time-series services are provided.
  • The importance of measuring and understanding the behavior of clocks in distributed systems is emphasized.
  • The video concludes by discussing how these techniques are implemented in Netflix's stateful APIs, including the use of paginated APIs and dynamic concurrency control for scans.
  • Item potency tokens are used to ensure idempotent writes and avoid duplicate operations.
  • Large responses are broken down into multiple pages to improve throughput and meet service level objectives (SLOs).

Additional Techniques

  • Decommissioned instances are detached but not terminated immediately to prevent them from re-entering the fleet.
  • Pre-flight checks help reject degraded hardware from re-entering the fleet.
  • Netflix targets an 80/20 split, with 80% of services using abstraction layers and 20% accessing storage engines directly.
  • Netflix provides a 25-page memo to users who access storage engines directly, outlining best practices for item potency, data store backups, and avoiding data loss.
  • Netflix offers data store client libraries in all major supported languages.
  • Netflix encourages users to use APIs with built-in item potency and resiliency techniques.
  • Netflix sometimes contributes resiliency techniques back to the open-source community.

Overwhelmed by Endless Content?