Mastering Long-Running Processes in Modern Architecture: Real-Life Examples & Tools for Engineers

14 Oct 2024 (6 months ago)

Introduction to Long-Running Processes

Long-running processes can be compared to ordering food, such as pizza, where there are different ways to place an order, including phone calls and emails (17s).
Phone calls represent synchronous blocking communication, where the caller is blocked until the other person answers, and a direct feedback loop is established once the call is answered (42s).
However, this method has limitations, such as being temporarily coupled to the availability of the other side, and if the person is not available, the caller must try again or wait (1m13s).
An alternative to phone calls is sending an email, which represents asynchronous non-blocking communication, allowing the sender to send the message even if the recipient is not available (1m40s).
Emails lack a direct feedback loop, but the recipient can still respond to the email, providing a feedback loop, albeit asynchronously (2m6s).
The key difference between synchronous and asynchronous communication is not the technology used, but the interaction pattern, which can be decoupled from the technology (2m36s).
In the case of ordering pizza, the feedback loop is not the same as the result, as the customer is still hungry after receiving confirmation of their order, and the actual result is the pizza being delivered (2m51s).
The task of making pizza is a long-running process that takes time, involving multiple steps, such as baking and delivery, and this pattern is seen in many other interactions beyond just ordering food (3m10s).
Synchronous blocking behavior is not suitable for long-running processes, as it would require the customer to wait for an extended period, and asynchronous results are more appropriate (3m39s).
The process of making coffee at a machine is described as synchronous and blocking, meaning that while waiting for the coffee, no other tasks can be performed, leading to inefficiencies and poor user experience, especially when there is a queue. (4m12s)
An article by Gor Hooper discusses how Starbucks scales its coffee-making process by separating the ordering and payment from the actual coffee preparation, allowing baristas to work independently and improving scalability and user experience. (4m48s)
Fast food chains are increasingly using apps for ordering to streamline the initial steps of the process, although the actual preparation, such as coffee making, often still involves human workers like baristas. (5m41s)

Challenges of Long-Running Processes

Long-running processes are defined as those that involve waiting, which can be due to human tasks such as approvals or decisions, or simply waiting for a response from a customer. These processes can take from hours to weeks. (6m14s)
An example of a startup is given where they automated a service but intentionally added a delay to simulate human processing time, highlighting the importance of managing waiting times in processes. (7m15s)
Waiting is challenging because it requires remembering the state of the process over potentially long periods, necessitating persistent state management to ensure continuity when the process resumes. (7m44s)
Persistent state can be a problem, despite the existence of databases, due to subsequent requirements such as understanding what is being waited for, escalating if waiting for too long, versioning problems, and running at scale (8m9s).
These technical challenges can be difficult to solve without adding accidental complexity, and homegrown workflow engines are often not a good solution (9m1s).

Workflow Engines as a Solution

The speaker has experience working on workflow engines, CR engines, and orchestration engines, having co-founded Kamuna, a workflow orchestration company, and worked on open-source workflow engines (9m42s).
A workflow engine, also known as an orchestration engine or process engine, can solve long-running issues by defining workflows and running instances of them, settling requirements such as versioning and escalation (10m23s).
A demo of a workflow engine was given to illustrate its capabilities and provide a common understanding of what a workflow engine is (10m45s).
The demo used an example of an onboarding process, which is a common process in many companies, such as opening a new bank account or mobile phone contract (11m32s).
The workflow engine used in the demo is available on GitHub, allowing others to run it themselves and experiment with workflow engines (11m2s).
BPMN (Business Process Model and Notation) is an ISO standard used to define processes graphically, and it's not a proprietary thing, allowing for standardized process modeling (11m51s).
BPMN models can be used to define manual tasks, such as scoring a customer and approving an order, and can also include automated tasks and escalations (12m9s).
The BPMN model can be used to define a duration with a period of time, such as 10 seconds, to determine when a task is taking too long and should be escalated (12m48s).
A Java application, in this case a Spring Boot application, can be used to connect to a workflow engine, deploy the process, and provide a small web UI (13m2s).
The application can trigger a REST call to start a process instance within the workflow engine, and tools like Operate can be used to look into what's going on and see the versioning and instances running (13m45s).
The workflow engine can be used to automate tasks, such as sending an email, and can also be integrated with other systems, such as a CRM system, using custom Java code or pre-built connectors (15m1s).
The workflow engine can also be used to send emails, and the email can be configured using a pre-built connector, such as the one for SendGrid (15m12s).
The workflow engine is running in the background, and the workflow model has instances running through it, with code or UI attached to connect to systems or humans (15m36s).
The workflow engine used in this example is Camunda as a Service, and it's integrated with a Spring Boot application (15m49s).
Workflow isn't just for small-scale processes, but can be run at a huge scale, with thousands of process instances per second, and can be distributed across multiple data centers in different geographic locations, such as the US and UK, which can add latency but doesn't bring throughput down (16m10s).

Technical Reasons for Waiting in Long-Running Processes

There are technical reasons why processes may need to wait, including asynchronous communication, where a message may not be received immediately, and failure scenarios where a message is not received at all (17m1s).
In distributed systems, peer services may not always be available, requiring processes to wait for them to become available before proceeding (17m33s).
A common example of a long-running process is checking in for a flight, where a user may receive an email notification to check in, but the process may fail, requiring a retry, which can be done in a stateful manner, where the retry is scheduled for a later time (18m23s).
In the case of the flight check-in example, the user may need to wait for a few hours before retrying, and can use a calendar entry to remind them to try again, illustrating a stateful retry in a long-running process (19m24s).
The situation can be envisioned as having a web interface, a check-in microservice, and a background process that handles the check-in, which may need to wait for certain conditions to be met before proceeding (19m51s).
A personal experience of a failed check-in process due to a barcode generation issue is used to illustrate the importance of resiliency in distributed systems, where certain parts are always broken or network connections are always down (19m59s).
The failure of a single component, such as barcode generation, should not bring down the entire system, and a well-designed system should be able to handle such failures locally without affecting the overall user experience (20m51s).
A chain reaction of failures can occur when a problem is passed on to the user, making them responsible for resolving the issue, which is a bad design (21m12s).

Resilience in Long-Running Processes

A better design would be for the check-in service to handle the issue locally, for example, by checking in the user and sending the boarding pass later, which requires long-running capabilities within the service (22m44s).
Many teams prefer to be stateless and avoid keeping state, which can lead to rethrowing errors instead of handling them locally (23m23s).
Customers often expect a synchronous response, such as seeing a confirmation message and receiving a boarding pass immediately, which can make it challenging to implement long-running processes (23m37s).
A more resilient design would prioritize handling errors locally and providing a better user experience, even if it means not providing an immediate synchronous response (23m2s).

Handling Long-Running Processes in Payments

The discussion extends the example of flight bookings to include payment collection, specifically handling credit card payments, which typically involves using an external API service like Stripe. (24m15s)
There is a challenge with service availability when charging credit cards, as the service might not be available at the time of the transaction, necessitating alternative solutions to avoid disappointing customers. (25m10s)
In distributed systems, remote call exceptions can arise from various issues, such as network problems or service provider failures, making it difficult to determine the exact cause of the failure. (26m0s)
Handling exceptions in distributed systems is complex because it is unclear whether a transaction was completed, which can lead to issues like double charging if not managed properly. (26m29s)
Solutions to these issues include using workflows or running periodic reconciliation jobs to ensure transactions are correctly processed and any discrepancies are addressed. (26m49s)
Embracing asynchronous thinking is recommended, where APIs are designed to acknowledge requests and provide results later, using HTTP codes to communicate the status of the request. (27m16s)
Long-running processes can extend the options of what an API can do, and making APIs asynchronous allows for better handling of long-running tasks within services, giving more freedom to implement requirements as desired (27m47s).
Extending payment options to include customer credits on their account, similar to some companies that offer credits for returned goods, or PayPal's system of holding funds before deducting from a bank account, can provide more options for handling payments (28m11s).
Implementing long-running processes can pose new problems around consistency, such as handling transactions across different services, like credit handling or credit card charging, and ensuring that all steps are technically transactional (29m0s).
In distributed systems, failing a payment process can require compensating actions, such as rebooking customer credits, to maintain consistency, and this complexity can arise quickly when considering all implications (29m33s).

Service Boundaries and Long-Running Processes

Having long-running capabilities is necessary for designing good services and service boundaries, and this technical capability should be present in the architecture (30m17s).
A booking service can tell a payment service to retrieve payment via a message or REST call, and if the credit card is rejected, the next step would be to ask the customer to provide new details, allowing them to still book their flight (30m41s).
Long-running processes can be used to handle scenarios where a customer needs to provide new credit card details after the initial rejection, and this can be achieved through a workflow that includes compensating actions (31m40s).
GitHub subscriptions have a fully automated process for renewal, but if the credit card is invalid, an email is sent to update the card, introducing a long-running process that requires handling (31m59s).
A common reaction to this requirement is to pass it to a component that already handles long-running processes, such as booking, but this can lead to domain concept leakage and added complexity (32m43s).
Booking should not know about credit card details, as it only cares about receiving payment, and handling payment methods should be separate (33m21s).
Domain-driven design (DDD) also emphasizes the importance of separating domain language and concepts, and in this case, the booking service should not care about credit card rejection (33m43s).
To handle long-running requirements within payment, it's essential to make it easy for teams to implement, and potentially using workflows or orchestration can help (34m12s).
Payment might be fast and synchronous in most cases, but handling edge cases where it's not is crucial, and designing an API that can handle both cases is necessary (34m33s).
Using workflows or orchestration can help implement long-running processes, and having these capabilities available in different services can make it easier to distribute the process correctly among microservices (35m10s).
Not having long-running capabilities in a service like payment can lead to monolithic design if the logic is moved to another service, such as booking, just because it has the capability (35m47s).
Having long-running capabilities at the disposal of every service avoids the creation of monolithic "God Services" and makes it easier to distribute responsibilities correctly, as well as embracing long-running, asynchronous, and non-blocking processes (36m1s).

Organizational Strategies for Long-Running Processes

A good architecture requires a process orchestration capability, which can be obtained as a service, either internally or externally, and can be easily implemented by a team (37m17s).
Organizations that successfully use process orchestration often have a Center of Excellence, a dedicated team that cares about process orchestration, process automation, and related topics (37m58s).
A Center of Excellence should focus on enablement and providing a platform, rather than building solutions, and should enable others to build things by consulting, helping, and providing technology (38m50s).
The traditional model of central teams being involved in solution creation has been replaced by a model where central teams focus on enabling others, and this shift is driven by the need for autonomy and freedom in decision-making (39m32s).
The creation of a Center of Excellence is not a step backward towards centralization, but rather a way to enable teams to make their own decisions while still providing guidance and support (40m10s).
The concept of team topologies is discussed, emphasizing different types of teams to enhance development efficiency. These include stream-aligned teams focused on business logic, enabling teams with a consulting function, and platform teams providing necessary technology. (40m17s)
Stream-aligned teams are designed to maximize productivity and reduce friction, allowing them to deliver business value effectively. (40m50s)
Enabling teams assist by consulting across projects, while platform teams supply the technology needed, preventing teams from having to figure out everything independently. (41m11s)
The complicated subsystem team is mentioned but not emphasized, as it deals with specialized tasks like fraud checks or AI services. (41m24s)
Organizations can map these team structures effectively, often using a center of excellence for process orchestration and automation, with tools like Camunda or RPA tools. (41m36s)
This approach prevents teams from spending excessive time in evaluation mode without delivering business value. (42m20s)
Spotify's "Golden Path" concept from 2020 is highlighted, where defined solution templates are provided for building specific types of applications, making it easy and desirable for teams to use them without being forced. (42m39s)
The "Golden Path" approach helps avoid "rumor-driven development," which is not scalable and can lead to inefficient technology use. (43m40s)
Spotify also developed an open-source tool called Backstage.io to support this approach. (44m9s)
Spotify's approach to development emphasizes autonomy for teams, but as the company grows, the software ecosystem becomes more complex and fragmented, leading to slower development speeds (44m28s).
Standardization of services and tooling can help free engineers from infrastructure complexity, rather than restricting autonomy, as seen in the concept of the "standards paradox" (44m47s).
Companies like Twilio offer pre-built services, known as the "PaaS path," that allow teams to get up and running quickly, and creating an incentive structure can encourage teams to take this path (45m10s).

Graphical Models and Long-Running Processes

Graphical models, such as BPMN, can be used to express complex processes in a simple and powerful way, and can be used for living documentation, test cases, and operations (45m55s).
Graphical models can also be used to discuss complex processes with different stakeholders, including non-developers, and can help elevate decisions about long-running behavior to the business level (46m50s).
Visualizing complex processes is important for making decisions about long-running behavior, and can help redesign the customer journey to leverage new architecture (47m41s).
Redesigning the customer journey is necessary to fully leverage new architecture, and graphical models can be a powerful tool in this process (47m59s).

Real-World Examples and Conclusion

The airline industry has seen significant changes in customer experience over the last five years, with automation playing a key role in improving services, such as automatic check-in for flights (48m28s).
A personal experience with a delayed and canceled flight to London demonstrated the use of automation in rebooking and providing updates through email and a mobile app, although some issues still required human intervention (48m38s).
The use of long-running capabilities, process orchestration platforms, and workflow engines can help design better service boundaries, reduce complexity, and provide a better customer experience (50m22s).
Embracing asynchronicity and using these technologies can also increase operational efficiency, automation, and compliance, while reducing risk and documenting processes (50m46s).
To successfully adopt these technologies across an organization, central enablement is necessary, and resources such as books, websites, and conferences can provide more information on the topic (51m8s).