Evolving Trainline Architecture for Scale, Reliability and Productivity

26 Nov 2024 (7 months ago)

Introduction and Overview of Trainline

The presentation will cover lessons learned from scaling Trainline's architecture, including handling more traffic, enabling multiple engineers to work on the architecture simultaneously, and scaling the efficiency of the platform for cost-effectiveness and growth (38s).
Trainline is Europe's number one rail digital platform, retailing rail tickets for users worldwide and providing services throughout the rail journey, including platform information, disruption assistance, and compensation for delays (1m42s).
The company provides its services through its B2C brand, Trainline.com, as well as a white-label solution to partners in the carrier space and the wider travel ecosystem (2m24s).
Trainline is a public company, well-established, and profitable, with a significant size of business, although specific numbers will be discussed later in the presentation (3m35s).
The presentation will also cover the business lens of productivity, team impact, cost efficiency, and financial business impact, in addition to actual traffic growth and handling more business (1m16s).
The speaker will discuss how Trainline has made its architecture possible for more engineers to work on it simultaneously, allowing for a faster pace of growth and innovation (56s).
The presentation will last around 25-30 minutes, with time for questions at the end, and attendees are encouraged to note down any questions that arise during the presentation (3m19s).
Trainline has a significant scale in terms of technical impact, with around five billion net ticket sales and 350 searches per second for journeys and origin-destination pairs across 3.8 million monthly unique routes (3m45s).
The company has around 500 people in its tech and product organization, with the majority being tech professionals (4m28s).
Trainline has real-time information on the location of each live train in Europe, which presents a large problem space in terms of data and actions required when trains are delayed, canceled, or changed (4m46s).
The company has over 270 API integrations with individual rail and bus carriers, with zero standardization in this space, resulting in high maintenance costs and non-trivial integration challenges (6m27s).
The lack of standardization in rail APIs is in contrast to the airline industry, which has standardized APIs through systems like Amadeos (6m48s).
The complexity of the problem space was not immediately apparent, but it became clear over time that certain problems, such as the aggregation of supply, were harder than initially thought (6m13s).
The rail industry faces a problem with inconsistent and disintegrated APIs, which have different access patterns and limitations, such as look-to-book ratios and rate limits, making it challenging to aggregate supply from various rail companies (7m17s).
Europe has 100 times more train stations than airports, adding to the complexity and scale of the problem, particularly when it comes to handling journey searches across multiple APIs (8m11s).

Challenges in the Rail Industry

The aggregation of supply in the rail industry is unique, but the problem of handling transactions over a finite inventory is not, and is similar to the classic "Ticket Master problem" (8m34s).
Selling seats on unique trains with limited inventory is more complex than selling digital products, as it requires checking inventory and handling transactions reliably and quickly (9m3s).
The company currently handles around 1300 transactions per minute at peak times, which has grown from 800 in the past three years, partly due to the recovery of rail travel and the company's growth in Europe (9m33s).
The speed at which people expect to receive their tickets, literally within a second, adds to the complexity of the problem, requiring instant fulfillment (10m33s).
When buying rail travel, most people purchase tickets a couple of months or weeks in advance, but about 60% of tickets bought on Trainline are purchased on the day, often just before boarding the train, requiring instant processing and industry-level standard interactions with processes to validate the ticket at barrier gates (10m48s).
The expectations for processing and validating tickets are high, involving complex interactions with industry-level processes to ensure the ticket is recognized as valid at barrier gates (11m32s).

Lessons Learned from Scaling

The talk will cover three lessons learned from scaling, focusing on team and productivity, cost efficiency, and scaling with growth in traffic and achieving higher reliability (11m45s).
The first lesson will focus on the impact of architecture on team productivity, highlighting how it can both enable and hinder progress as team sizes change (11m51s).
The second lesson will discuss cost efficiency and scaling the efficiency of the platform (12m5s).
The third lesson will cover scaling with growth in traffic, achieving higher reliability, and dealing with availability issues (12m10s).

Team Productivity and Organizational Structure

When the speaker joined Trainline in July 2021, the company had around 350 engineers organized in a cluster model, with teams focused on specific parts of the technical stack, such as Android, iOS, web, and backend (12m49s).
However, this organizational structure led to low team productivity, as most projects required collaboration between at least five and often up to 10 different teams, resulting in complex project management and delays (13m43s).
The previous team structure was slow and not suitable for the scale, leading to a massive reorganization in January 2022, where the team was restructured into a platform and verticals model, with platform owning the technical stack and verticals having people with different skill sets, shaped around clear ownership of product and business goals (14m26s).
The platform and verticals model improved alignment of the team to goals, but came with challenges like tension between platform and vertical teams, where vertical teams wanted to deliver quickly and platform teams wanted to ensure proper implementation and refactoring (15m14s).
The tension between platform and vertical teams is an embedded function of the model, but can be frustrating for people on both sides, and is a good thing to have, but sometimes needs to be managed (15m46s).
Recently, the team reorganized again, reducing the ownership of the platform to very core services and moving 50% of the tech surface to the verticals, which were renamed diagonals, to get the best parts of both models and streamline the process (15m58s).
The new model aims to remove some of the tension between platform and verticals, and the team is trying to find the right balance between the two (16m33s).
The key question in all three models is who owns each part of the technical surface, who is in charge of sustain work, mandatory technical upgrades, and driving the technical strategy and vision for that part of the codebase (16m57s).
The team needs to determine who effectively does the core work to sustain and drive the technical roadmap, in addition to who builds all the features (17m26s).
The architectural implications of these changes are relevant to the audience, and the team is trying to find the right balance between different models to achieve their goals (16m46s).
The team's goals and alignment can be defined by answering two questions: which product or business or tech goal is the team on the hook for, and what are the key performance indicators (KPIs) for that goal, with different teams having different answers to these questions (17m37s).
The three key performance indicators (KPIs) are A (alignment of engineering investment to business goals), P (productivity), and Q (quality of technical work), which translate to the risk of adding technical debt or making the platform worse or better (18m16s).
In the past, clusters were used, but they were poor for alignment, as engineers didn't care about product or business goals and only focused on their part of the code, resulting in poor end-to-end productivity and good quality due to people working in a restrained small part of the technology surface (18m35s).
When the team moved to a platform of verticals, alignment became super crisp and clear, with engineers on the hook for specific business goals, resulting in perfect alignment, better productivity, and good quality, but with some tension between verticals and the platform team (19m7s).
The current model has removed the platform team's policing of contributions, slightly diluting the clarity of alignment, but enabling productivity, and it's essential to note that different models work better at different times for a company, and shifting the model can bring a different balance and kind of productivity (20m5s).

Architectural Implications and Ownership

Customer and business needs do not respect architectural boundaries, and it is essential to acknowledge this fact when designing a platform (20m54s).
Even with a well-designed platform, business priorities and needs can change, requiring the architecture to evolve accordingly (21m29s).
Business strategy, technology ownership, and organizational structure will also change over time, and it is crucial to adapt to these changes (21m39s).
Conway's Law states that technology ends up getting the shape of the organization that makes it, and how teams are organized affects the technology built (22m18s).
There is no perfect reverse Conway maneuver, making it challenging to change the technical architecture designed by a certain organizational structure (22m39s).
It is essential to build technology and architectures with the fact of ownership transfers and external contributions in mind (23m7s).
Enforcing consistency is crucial, and leaders should establish a company-wide approach to technology, rather than allowing individual preferences to dictate the way things are done (23m34s).
This approach is necessary because team members and structures are likely to change over time, and a consistent approach ensures that technology can be maintained and updated efficiently (23m46s).
To achieve scale, reliability, and productivity, it is essential to enforce consistency within an organization by using as few languages and technologies as possible, even if it means sacrificing individual autonomy, to make it easier to transfer ownership and allow for external contributions (24m8s).
Consistency is key, as it allows for the transfer of "Lego blocks" rather than custom-built items, making it easier to reassemble them in different ways (24m19s).
Trainline has been around for 20 years and still has technology in production that was written 15 years ago, highlighting the importance of keeping things consistent with few languages and technologies (25m3s).

Cost Efficiency and Optimization

Production costs are a significant concern, with Trainline's AWS bill accounting for about 25% of their overall software engineer compensation bill (26m11s).
The cost of the platform was growing faster than traffic, prompting the need to take a goal to run things in an efficient way and make sure the organization is disciplined and running a tight ship (27m14s).
The goal is not to make drastic cuts but to ensure the organization is running efficiently and making the most of its resources (27m20s).
The goal was to drive down the annual run rate of production costs by 10% to reduce the entire annual bill. (27m26s)
The team has a massive surface area with over 700 microservices and more than 100 databases, making it challenging to identify areas for cost reduction. (27m43s)
To achieve the goal, the team considered various levers, including cleaning up unneeded data, consolidating non-production environments, reviewing old low-value services, and right-sizing the platform. (28m25s)
They also looked at data retention policies and reviewed architectural choices, such as the use of cloud functions and lambdas, to determine if they were efficient. (28m55s)
The team decided to delegate the problem to individual teams that own parts of the technology stack, tasking each team with driving 10% of the bill down for their area. (29m58s)
The plan was to use attribution to track what's driving the costs and hold each team accountable for reducing their costs. (30m6s)
A goal was set for each team to reduce their part of the bill by 10%, but this led to some teams taking risks and doing things that ultimately caused problems, such as outages, due to underprovisioning of services (30m39s).
Instead of reviewing architecture choices and making changes, most teams simply scaled down their services, which sometimes worked but also led to outages (31m27s).
In a three-month period, there were four outages caused by underprovisioning, which occurred when the platform hit peak traffic, usually on Tuesdays (31m42s).
The goal of reducing costs by 10% was achieved, but it was not the best value for money, as engineers spent more time than necessary, and there were unintended consequences, such as outages (32m28s).
Cost management is important for the long-term efficiency of the platform, but understanding where to make cuts in a large, fragmented microservice-based system requires centralized thinking and cannot be delegated to individual teams (32m55s).
Predicting which cost-reducing efforts are worth it can be tricky, especially for those without a full understanding of the system, and blindly pushing down cost-saving goals to individual teams can lead to more problems than solutions (33m21s).
A centralized task force that works with individual teams to evaluate where investments to save costs are worth it would be a better approach than delegating cost-saving goals to teams (33m42s).

Scaling for Growth and Reliability

The third lesson learned is about scaling for growth in traffic and reliability, which includes managing cost system cost savings efforts centrally and avoiding fully delegating, as it can backfire (33m55s).
The speaker briefly covers three big bouts of outages, which could each be a talk on their own, and asks the audience to keep the information confidential (34m26s).
The first outage occurred in October 2021, when Trainline went down for four hours one day and two hours the next day, due to the platform struggling to handle the sudden increase in traffic after the COVID-19 pandemic (34m53s).
The cause of the outage was the contention in terms of database connections, as many new microservices had been added, each maintaining connections to the database, leading to a bottleneck in the relational databases (36m16s).
The relational databases were hosted on a single machine, which couldn't keep up with the many connections, but the issue was eventually tweaked and tuned, and the platform survived the period (36m43s).
A year later, in October, another outage occurred, also related to the database, but this time involving old Oracle databases (37m14s).
The company experienced a significant outage due to a gradual increase in load on the orders database, which was caused by the addition of features related to the journey experience, such as following a journey and receiving notifications about platform changes, over the course of a year (37m46s).
The company's observability was primarily focused on transactional flows, and as a result, the increase in load on the orders database went unnoticed until it caused an outage (38m24s).
The company had to implement database-related fixes to resolve the issue and prevent similar outages in the future (38m55s).
The company recently experienced a series of DDoS attacks, which may have been attributed to nation-state actors, and had to tighten up its DDoS protections and make other changes to mitigate the attacks (39m4s).
Despite initial concerns that the platform was being DDoS attacked again, it was discovered that the issue was actually caused by sloppy retry strategies throughout the stack, which allowed small issues to snowball into larger problems and eventually bring the platform down (39m56s).
The company did not have a coordinated retry strategy, which contributed to the problem, and everything from client-side retries to backend service retries ended up creating a 10x load that brought the platform down (40m25s).
The architectural lesson learned from past experiences is that none of the issues were caused by a single team, change, or regression, but rather by a buildup of small problems over time, making it difficult to predict and detect bottlenecks in a large microservice-based system (40m51s).
Predicting bottlenecks in such systems is challenging due to the complexity and spread of microservices, with each team chasing their own goals and contributing to the overall problem, often resulting in a tragedy of the commons (41m40s).
The best approach to handling this issue is to regularly review longer-term traffic mix or load changes, such as reviewing changes in critical databases or services over a period of six months, to identify potential bottlenecks and guide teams accordingly (42m15s).
Service fleet coordination is critical in guiding teams and ensuring a strong architecture function or principal engineering function to oversee the big picture and prevent issues from arising due to individual team ownership and decisions (42m53s).
Observing over longer terms and coordinating microservice leads are essential lessons learned from past experiences, with consistency being key to productivity in the long run, especially for startups or seed-stage companies (43m30s).
When building a business where technology should survive for five to 15 years, it is essential to insist on consistency and manage system cost-saving efforts centrally, even if engineers may not love it, to avoid losing the wider context and to coordinate microservice fleets to avoid outages (43m52s).

Knowledge Transfer and Service Ownership

Building an architecture towards changing the structure in people can be challenging, as it goes against optimizing for knowledge and change management in the near term, but it is crucial to balance short, medium, and long-term goals (45m3s).
The "build it, you own it" strategy is effective for the first six to 12 months of a service, but after that, the person who built it is likely to move on, and the service needs to be handed over to others, requiring a transition period to ensure adoption and knowledge transfer (45m39s).
The platform verticals model can be used to facilitate this transition, where a vertical builds a new service and is on the hook for it for six to nine months, with the relevant parts of the platform advising through that period (46m17s).
The goal is to create "Lego blocks" that anyone can pick up and take care of, even if they are not an expert, to enable an agile organization that can focus on the most important things (47m10s).
The transition period can be challenging, and it is essential to pull teams back from the mindset of having a single person who knows the service, as this can cause contention and make it difficult to touch the service (46m49s).
The "bus factor of one" is a common problem, where only one person knows the service, and it is essential to avoid this by creating a culture of knowledge sharing and transfer (46m56s).

Microservices, Consistency, and Technology Adoption

Microservices were initially adopted for team autonomy, allowing different languages to be used, but now consistency is considered key, raising questions about its future and potential downsides (47m41s).
Having multiple languages for front-end development, such as Android and iOS, and different back-end languages, like .NET and Ruby, can lead to fragmentation of skill sets within an organization (48m14s).
This fragmentation can make it challenging to assemble cross-functional teams to deliver simple features, requiring a large number of people with different skill sets (48m44s).
The complexity of native platforms like Android and iOS, each with their own languages, contributes to this challenge (49m11s).
To address this, it's essential to strike a balance between allowing innovation and trying new things, while also having a path for making successful new technologies official and widely adopted (49m30s).
Allowing everyone to choose their own technologies without a clear path for adoption can lead to unmanageability, especially as the business grows (49m44s).
A successful business needs a clear strategy for managing technology adoption to avoid complications in the long run (49m55s).