Delivering Millions of Notifications within Seconds During the Super Bowl

21 Aug 2024 (4 months ago)
Delivering Millions of Notifications within Seconds During the Super Bowl

Duolingo's Super Bowl Notification Project

  • Duolingo is the most downloaded language learning app worldwide, with over 31.4 million daily active users and more than 100 courses offered in over 40 languages. (46s)
  • Duolingo's marketing team planned to run a 5-second ad during Super Bowl (3m51s) and requested to send out 4 million push notifications to users within 5 seconds of the ad playing. (5m49s)
  • Duolingo's notification system aimed to send approximately 4 million notifications in under 5 seconds. The project had a deadline of the Super Bowl on February 11th of the following year. (9m28s)

Project Challenges and Solutions

  • The requirements for the project changed multiple times, including the number of notifications, the number of markets, and the target number of devices. (11m9s)
  • Research showed that sending out that many notifications within the given time frame would be 80 times faster than their typical notification speed of 10,000 notifications per second. (6m1s)
  • The team decided to focus on what wouldn't change, which was their operating principle of testing first. They decided not to ship anything that couldn't be tested. (11m26s)

Technical Implementation Details

  • Sending four million notifications in five seconds requires a rate of 800,000 messages per second. This was achieved by batching users, with 500 iOS users and 250 Android users per batch. (19m46s)
  • To prevent duplicate messages, a fivefold first-in-first-out queue from AWS sqs service was utilized. This queue deduplicates messages based on identifiers and has a five-minute deduplication window. (21m13s)
  • The Super Bowl notification service was given its own dedicated ECS cluster. (20m55s)
  • To ensure sufficient cloud resources were available, an infrastructure event management document was created with AWS. This document included details on scaling resources, cache connection limits, and Dynamo limits. (20m16s)

Testing and Optimization

  • Initial testing of the notification system, using silent notifications (empty payloads), revealed a bottleneck in the thread count due to issues with Python's global interpreter lock. Decreasing the number of threads mitigated this bottleneck. (23m37s)
  • To address scaling challenges, specifically the time it took to scale both the Super Bowl service and the backend, a dedicated ECS cluster was implemented for the Super Bowl service. This separation eliminated the virtual queue and allowed for faster scaling of both services. (27m15s)
  • In October, a Halloween-themed notification was tested on 1 million users. (28m0s)
  • In November, a "year in review" themed notification was tested. (28m5s)
  • In January, a "welcome back from New Year" message was tested on 4 million users. (28m11s)
  • To minimize risk and cost, testing was conducted with millions of real users during actual campaigns, supplemented by cloud tests to assess infrastructure scalability. (43m29s)
  • Optimization efforts focused on achieving fast notification delivery, with performance monitoring and bottleneck analysis conducted using CloudWatch logs and process time analysis. (45m49s)

Cost and Risk Mitigation

  • Duolingo's system used 7 GB of memory for each task to send out push notifications, leading to high cloud costs. (34m43s)
  • The team discovered that clicking on the notification triggered multiple requests to the backend, multiplying the load. (35m31s)
  • While serverless computing with AWS Lambda was suggested, the team felt more comfortable using their existing container and ECS services. (40m21s)
  • The team was initially concerned about rate limiting from FCM and APNs but reached out to Google and Apple representatives early in the project to avoid issues. (41m48s)

Overwhelmed by Endless Content?