Understanding Architectures for Multi-Region Data Residency

25 Jul 2024 (11 months ago)

Data Residency and Compliance

The speaker emphasizes the importance of understanding the customer's needs and motivations for implementing multi-region data residency.
While often associated with GDPR, the speaker clarifies that it's not a direct requirement of the regulation.
The speaker highlights that companies in the EU may request data residency due to concerns about data privacy and regulatory exposure to the US.
The speaker emphasizes the need for clear communication with legal teams, sales, and customers to define the specific requirements and promises regarding data residency.
The speaker mentions Rippling's experience with German customers who specifically requested data residency, leading to restrictions on data sharing between regions.
The speaker acknowledges that while not directly for GDPR compliance, multi-region data residency can contribute to a broader compliance strategy, including meeting requirements for specific jurisdictions like the UK.
Data Residency and Compliance: Having data in different regions can help with compliance requirements, especially when dealing with governments that have incompatible regulations. It allows for specific data to be processed in the region where it applies.

Defining Data Residency and Trust

Defining the Source of Truth: In a multi-region deployment, it's crucial to establish a clear "source of truth" for data. This means defining an "atom" – a container where all data related to a specific entity (like a company, team, or organization) resides within a single region. Any cross-region data access would then be considered a cross-atom call.
Trust and Inter-Region Communication: When different regions hold data, you need to determine the level of trust between them. This is especially important when dealing with sensitive data or when regions have different regulatory environments. For example, a US company might need to decide how much trust it can place in an EU-issued authentication token.
Threat Modeling and Security: Before implementing a multi-region architecture, it's essential to conduct a threat model. This involves understanding the specific threats and regulatory requirements in each region and determining what level of trust is acceptable between them.
Key Considerations for Trust: Two key areas to consider when establishing trust between regions are access tokens and database access. These are fundamental aspects that every application using a multi-region architecture needs to address.

Multi-Region Design Principles

When designing systems for multi-region data residency, it's crucial to consider the implications of token revocation delays and replication strategies. These considerations often lead to discussions about other business-relevant factors.
A core principle in multi-region design, and software engineering in general, is to avoid unnecessary forks in the road. This principle is particularly important in multi-region scenarios due to the potential for increased complexity.
An example of this principle is illustrated through a network routing scenario. A seemingly reasonable design that prioritizes speed with a queue for overflow can lead to significant performance degradation and potential failure under high load.
This principle applies to multi-region data residency as well. Instead of routing users to the nearest data center most of the time, it's better to have a consistent routing strategy that determines the appropriate data center for each user.
To further enhance the reliability of multi-region systems, it's important to minimize code branches. This helps to reduce the potential for errors and inconsistencies across different regions.
The probability of a code path working is directly proportional to the fraction of users that use it divided by the complexity of the feature.
A low percentage of users routing between regions, combined with the high complexity of cross-region operations, leads to a low probability of success.
Using the geographic nearest region for data residency can create rarely used code paths that are difficult to test and prone to breaking.
Replicating data from a primary region (e.g., US) to other regions (e.g., EU) can lead to inconsistencies in data access and behavior, as the EU region might not have synchronous consistency.
Cross-region data access can introduce performance issues due to increased latency, potentially causing applications to break.
To avoid these issues, it is recommended to maintain symmetry between regions, ensuring that all regions have the same code paths and data access patterns.
This symmetrical approach reduces complexity and increases the probability of success, as any issues will be detected quickly due to the high usage of the code paths.

Routing Strategies for Multi-Region Architectures

Client Routing: This approach uses subdomains (e.g., EU1.yourapp.com, US1.yourapp.com) to direct clients to specific regions. While effective, it requires clients to handle routing logic and can break integrations if a client's region changes.
Gateway Routing: This method uses an atom ID (representing an indivisible part of the application) in requests to route traffic to the correct region. It works well for traffic within the atom boundary but struggles with cross-atom traffic.
Region-to-Region Communication: This approach allows regions to communicate with each other, enabling cross-region data access. It's useful for queries involving multiple companies in different regions but can introduce latency and complexity.
Batch APIs: These APIs can be used to gather data from multiple regions by leveraging atom IDs. This approach can optimize data retrieval but requires careful consideration of latency and data requirements.
Cross-Region Audits: For tasks requiring access to data in all regions, a dedicated service can be set up to bypass atom-based routing. However, this can lead to significant latency due to cross-region requests.

Database Architectures for Multi-Region Data Residency

The speaker discusses different database architectures and their suitability for multi-region data residency.
Global databases, often marketed as "global," may offer replication and region-specific data storage, but their suitability depends on the specific use case.
Accelerators, like CDNs, provide eventually consistent copies of data across regions, which can be useful for data that is consistent across regions and doesn't change frequently, like tax codes.
True global databases offer active-active topologies, allowing writes and reads from any region with eventual consistency. However, this can lead to conflicts if multiple regions attempt to modify the same data simultaneously.
The speaker emphasizes that database proxies are not a universal solution for data residency and require careful consideration of specific requirements.

Key Considerations for Multi-Region Data Residency

The speaker stresses the importance of defining clear data residency requirements and building an architecture that enforces these requirements.
The speaker recommends avoiding forks in the road related to region resolution in code to maintain consistency and simplify testing and future changes.
The key differentiator between using data residency for faster user experience and data sovereignty is the ability to replicate data.
If data replication is not an option, techniques like database accelerators and audit jobs can be used for faster user experience, especially if eventual consistency is acceptable.
For data sovereignty, where cross-atom calls are not allowed, gateway routing can be a suitable design pattern.
Trust is a fundamental principle in data residency, and authentication to locality can be a policy decision at runtime.
Rippling uses a policy where users need a company-specific token to access their employment information, regardless of region. This ensures that users are authenticated to the specific company and region.

Managing Latency and Client Expectations

The time bound for client requests can vary depending on the distance between regions.
There are no specific rules of thumb for setting time bounds, but it is recommended to have a generous time bound for all regions to avoid per-region configuration.
Using per-region configuration can violate the principle of doing the same thing every time.
The speaker suggests using application layer routing to hide cross-regionality and latency from clients. This ensures a consistent experience for developers during development and production.
The speaker acknowledges that achieving a truly seamless experience across regions is challenging due to varying network latencies.
The speaker recommends using atomic routing within the scatter-gather pattern to minimize the likelihood of accessing data in very remote regions, unless there's a specific use case.
The speaker emphasizes the importance of managing client expectations regarding latency, especially when dealing with highly remote data access.

Handling Changes in Data Residency Requirements

The speaker raises the question of how to handle changes in data residency requirements over time, particularly when new customers have different needs.
The speaker highlights the importance of enabling customer movement between regions, even if it's not a launch blocker for multi-region functionality.
The speaker argues that failing to provide customer movement functionality could lead to challenges during renewal cycles, as customers may demand it.
The speaker suggests that enabling customer movement should be prioritized within the first year of multi-region deployment.
When deciding on data residency, it's important to balance the complexity of breaking up data with the potential pain of moving the line later.
The speaker recommends drawing the line relatively high, including all customer data within the same atom.
Rippling, for example, considers most data as company data, even personal data related to employment.
The speaker suggests that drawing the line too small can lead to significant pain later on.

Managing Dynamic Regions

The speaker also discusses the challenges of dealing with dynamic regions, especially in the context of ephemeral instances.
He recommends following the Gateway routing pattern for a smoother experience with ephemeral regions.
Updating the global data store that maps regions to instances is crucial for managing dynamic regions.
The speaker acknowledges that dynamically replicating databases to different regions is a complex and uncommon practice.

Conclusion

The speaker encourages attendees to consider the implications of multi-region architectures, even if they are not currently implementing them.
He emphasizes the importance of aligning architectures with the needs of multi-region deployments.
The speaker concludes by thanking the audience and encouraging them to discuss the topic with their leadership teams.