Thriving through change: A decade of engineering evolution at Spotify
15 Nov 2024 (1 month ago)
Introduction
- Spotify operates on a large scale with over 600 million users in 180 countries, making it a truly global platform, and has over 2,800 engineers working across 500 teams to deliver exciting and innovative features (1m4s).
- The engineering team makes almost 3,000 production deployments every day, which translates to two changes every minute, 365 days a year, without disrupting the music experience (1m41s).
- Spotify has grown from music to podcasts and audiobooks, catering to unique tastes of its users, and has released audiobooks in multiple countries (2m6s).
- The company has over 190,000 GitHub repositories spread across 2,800 GitHub organizations, and 8.8 million lines of code are managed by automatic bots making automated changes (2m37s).
- Spotify's path to its current scale involved a huge technological and cultural transformation over the last 10 years, which will be discussed in the presentation (2m49s).
- The presentation will cover the early days at Spotify, cultural shifts, technological shifts, fragmentation, and AI investments, as well as exciting prospects for the future (3m22s).
Spotify's Culture and Collaboration with GitHub
- Spotify's culture is built around five core values: innovative, playful, collaborative, passionate, and sincere, with a focus on collaboration and adaptability to change (3m56s).
- The company's long-standing collaboration with GitHub is an example of its adaptability to technological shifts, and its technical strategy is defined by its ability to react to these changes (4m26s).
- Software engineering is a team sport that involves people, tools, and processes, and building large-scale applications requires collaboration and defining processes to work together effectively (4m45s).
Early Days and Server DB
- In the early 2010s, Spotify was a small company with 23 SREs managing data centers and available in only 28 markets, with servers named after Greek mythology and provisioning of new hardware coordinated through spreadsheets (5m9s).
- To improve the process, a system called Server DB was built to automate the process of importing and exporting server information, giving human names and purposes to servers, and tracking hardware (5m40s).
- At that time, there were no fancy interfaces, and everything was driven by a command-line interface (6m11s).
Migration to GitHub
- In 2014, Spotify was using Gerrit to manage its Git repositories, but the industry was shifting towards GitHub, which was becoming the industry standard for version control systems (6m26s).
- Spotify's primary goals for evolving its version control system were to increase code quality through collaborative code review and ensure that every commit was backed by continuous delivery (6m56s).
- A kickoff pilot was done in late 2014, and user surveys showed that engineers wanted to use GitHub, which they were already familiar with from open-source projects (7m36s).
- In May 2015, Spotify successfully shut down its Gerrit instance and fully migrated to GitHub Enterprise Server, allowing engineers to release more often and create more features (8m1s).
Service DB and Growth
- As the company grew, the number of components and services increased, and Spotify needed a better way to keep track of them, leading to the creation of a new service called Service DB (8m32s).
- Service DB was linked to Server DB, allowing engineers to browse services and figure out where they were deployed, and providing a way for engineers to work together more effectively (8m44s).
Transition to the Cloud and Backstage
- As Spotify grew, its engineers transitioned from physical hardware to the cloud, opening up opportunities for automation, and a single click of a button could handle tasks that previously took weeks or months with multiple tickets between developers and operations (9m13s).
- The platform organizations invented a new set of tools for developers, but they were scattered and difficult to find, leading to an explosion of new services and hypergrowth, with the number of components growing exponentially over time (9m54s).
- The discoverability problem led to the creation of Backstage, a unified experience that started as a small pilot project and became the most widely used developer tool inside Spotify (10m27s).
- Backstage aimed to provide more information about services, such as whether they were running in production, who owned them, and who to page when something went wrong (10m47s).
Backstage Features and Development
- The software catalog, a searchable and browsable list of all software inside Spotify, was created to keep metadata about every piece of software, including ownership and relationships to other software (11m3s).
- The catalog was stored alongside the source code in GitHub, distributing the responsibility of keeping the metadata up to date to the teams managing the source code (11m21s).
- Technical documentation was also added, built on top of markdown files stored in the git repository and linked to Backstage, providing a dedicated space for each service in the catalog (11m49s).
- Each service had an overview page with general information, browsing documentation, and additional fields for build history, capacity management, deploy status, and traffic routing (12m0s).
- The system was linked together with GitHub at the heart, containing all the data in source repositories, and an API for configurability using pull requests (12m39s).
Addressing Developer Experience Challenges
- The focus on solving discoverability problems over the years led to the creation of a comprehensive system, and the discussion will now be handed over to Sanjana to talk about other developer experience problems, such as fragmentation (12m53s).
- Spotify started investing in developer experience about eight years ago due to troubling trends, including slower development and shipping of new features, and harder onboarding of new developers (13m4s).
- Spotify experienced years of hyper-growth, with the number of developers quadrupling every year, resulting in a huge and chaotic organization where it was hard to keep track of who owned what and affecting developer productivity (13m24s).
- To address this issue, an intense look was taken at the day-to-day life of developers to understand what was making their lives difficult and find a solution to unblock faster development (13m59s).
- Two major problems were identified: it was hard to do things the right way fast due to a scattered software ecosystem, and the burden and maintenance toil that came with hyper-growth, including managing bills, deployments, security audits, and upstream dependency breaking changes (14m16s).
- A developer portal was created to make it easier to find things, but it was clear that more needed to be done to centralize and manage the software life cycle and take away infrastructure pain from developers (15m28s).
Tech Health Strategy and Golden Paths
- The software ecosystem was extremely fragmented, with every team building and deploying services differently, making it hard to automate (15m48s).
- A Tech Health strategy was cultivated to enable developers to become productive and efficient, including the creation of "Golden paths," a gold standard of how to develop software at Spotify, which documented best practices and were used as tutorials for new hires (16m13s).
- The first Golden path was created for backend development and covered how to set up an environment and deploy a backend application, initially 21 pages long but later condensed to cover more ground (16m37s).
- "Golden Technologies" were introduced, a standard set of languages, infrastructure, and frameworks expected to be used for all production components at Spotify, allowing developers to focus on end-user features (17m23s).
- The concept of "Golden state" was also introduced, the desired state of where Spotify wants all its software components to be, a target state in which the tech stack can be easily managed and kept up to date (17m48s).
- To keep up with the ever-evolving technology landscape, a simple yet effective tech strategy called "Golden Path" using "Golden Technologies" to reach the "Golden State" was implemented (18m12s).
- To make it easier to adhere to best practices, software templates were created, allowing developers to spin up their own services with just a click of a button in Backstage, setting up their entire service with all configurations and features already set up and ready to go (18m30s).
- This standardization allowed for the automation of some infrastructure toil, making it easier to create new software and ensuring that it adhered to current standards by default (19m19s).
Fleet Management and Automation
- A dependency graph of all services in production at Spotify was used to visualize the complexity of the system, with every edge being a dependency and every node being a service (19m50s).
- To automate away infrastructure toil, a new program called Fleet Management was introduced, which is split into three pieces: making sure everything is versioned and codified, making it easy to make changes at the click of a button, and doing this at scale across hundreds of thousands of repositories (20m20s).
- Fleet Management was able to address all three parts of the problem, resulting in huge savings in infrastructure toil and allowing developers to focus on building end-user features (21m12s).
- An example of the impact of Fleet Management is the resolution of the Log4j vulnerability, where 80% of the fleet was secured in just 9 hours, and the entire software ecosystem was completely secure within 8 days (21m56s).
Measuring the Impact and Results
- The effects of these systems were measured, and the change was astounding, with a significant reduction in the time for a new hire to create their 10th pull request (22m33s).
- New hires at Spotify can now contribute to the code in 20 days, down from 60 days initially, due to improvements in finding information, documentation, and standardizing technology, resulting in a significant return on investment that compounds every year as new developers are hired (22m56s).
- Existing engineers have seen a 25% reduction in high-urgency incidents and a 56% reduction in local build times, while Fleet Management has enabled faster feature shipping, reducing the time from 200 days to less than a week (23m46s).
- The reliance on bots and tight integration with GitHub has enabled zero-touch code changes, with bots creating and committing three times more code changes than humans, leading to faster feature delivery to end-users (24m29s).
- Developers who frequently use golden paths, Fleet Management, and other preset tools built into Backstage are 2.3 times more active in GitHub, make twice the number of code changes with 177% less cycle time, and deploy twice as often with deployments staying deployed three times as long (24m59s).
AI Adoption and Feature Launches
- The adoption of AI tools, such as Co-Pilot, has increased from 13% to 56% among engineers, resulting in a 43 percentage point increase, with Co-Pilot being used in daily workflows (27m25s).
- Spotify's investment in developer experience has enabled the launch of features such as Audiobooks and AI DJ, with GitHub being at the center of these changes (26m42s).
Streamlining GitHub Co-Pilot Access
- Spotify operates on a large scale and has a small team managing GitHub, so they needed to automate and streamline the process of granting users access to GitHub Co-Pilot, a tool that assists with coding tasks, to avoid a manual and time-consuming process (27m42s).
- To achieve this, Spotify worked with GitHub to design a streamlined process that allows developers to opt-in to internal tools using Backstage with a single click, supported by their underlying identity provider system (28m5s).
- GitHub makes it easy to link the identity provider system into GitHub organizations, enabling users to have automatic access to Co-Pilot when added to a GitHub organization (28m31s).
- By combining these two systems, Spotify sent out an announcement to their mailing group, allowing engineers to self-onboard at their own pace, following documentation written in Tech dos, and granting them automatic access to Co-Pilot (28m50s).
- Over a thousand engineers signed up and gained access to Co-Pilot within the first two weeks, with almost no handholding or marketing required (29m9s).
- Spotify thanks GitHub Professional Services for helping them build a streamlined process that was easy to follow (29m25s).
Spotify Portal for Backstage and Future Outlook
- The company has been working towards improving their developer experience over the last 10 years, continually evolving and innovating (29m49s).
- Spotify has recently launched Spotify Portal for Backstage, a cloud offering that packages up their learning and experience, allowing users to sign up and get started easily (30m13s).
- This cloud offering comes with integrations to GitHub APIs built-in, ensuring developers can focus on delivering end-user features without worrying about configurations, updates, or maintenance (30m34s).
- The portal is accessible in the cloud, and users can learn more by visiting backstage.spotify.com (30m59s).
- Looking to the future, Spotify is excited to evaluate GitHub Enterprise Cloud with data residency to bring new features to developers in real-time, improving their productivity and happiness (31m19s).
- Some benefits Spotify is interested in include real-time access to newly released features, zero downtime for upgrades, and isolated company space compared to shared space on github.com (31m39s).
Final Thoughts and Conclusion
- Spotify has undergone significant engineering evolution over the past decade, with a focus on proper planning, testing, and security measures before implementing changes (32m10s).
- The company encourages developers to use GitHub APIs to build, integrate, and customize workflows to improve productivity and efficiency (32m36s).
- Spotify's experience with GitHub has been positive, with flexible APIs and workflows helping the company's developers to be more productive and efficient (33m6s).
- The company's biggest takeaway is to encourage developers to think outside the box and focus on end-user features to improve productivity and efficiency (32m58s).
- Spotify's 10-year developer experience was condensed into a 40-minute talk, providing a brief glimpse into the company's evolution and experience with GitHub (33m18s).
- The company invites attendees to chat with them in the speaker section and wishes everyone a great rest of the GitHub universe (33m40s).