How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency

10 Feb 2024 (11 months ago)
How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency

System Overview

  • The company sends around 13 billion push notifications daily and had a team of 10 engineers to manage the backend.
  • The original system used synchronous PostgreSQL writes, blocking the HTTP request until the write completed.
  • Traffic spikes occurred at specific times (hourly and half-hourly) due to customers scheduling notifications.

Solution Implementation

  • Introduced a layer of queuing using Apache Kafka, making the system asynchronous.
  • Kafka is a distributed streaming platform that uses topics to logically group messages.
  • Each message in a topic has a numerical ID called an offset that starts at zero and increases over time.
  • Consumers pull messages from Kafka topics and process them.
  • Partitions are numbered logs of messages within a topic that can be consumed independently by multiple instances of a consumer.
  • Subpartition processing is a technique used to process Kafka messages concurrently within each partition in memory, allowing for increased concurrency and flexibility.
  • Created more CUs (queues) to ensure updates for the same row are processed concurrently.
  • Added a cap on the number of messages each consumer instance can hold in memory to prevent overloading.

Issue Identification and Resolution

  • Observed high lag and low CPU usage, contradicting expectations.
  • Implemented centralized logging to gain more observability.
  • Discovered that a single customer (Closely) was dominating the updates, with a single row ID receiving constant incompatible updates.
  • Identified that the updates were related to the "set email" method in the SDK, which was causing 4.8 million user updates to be mirrored to a single record.
  • Updates to the closely app admin record were skipped, and limits were implemented to prevent customers from linking too many records together.

Lessons Learned

  • Shifting intensive API workloads to asynchronous workers reduces operational burden.
  • Subpartition queuing increases consumer currency.
  • Centralized observability is crucial in tracking down issues.
  • Customers can be more creative than engineering, design, and product teams in finding unexpected use cases.

Overwhelmed by Endless Content?