Monorepos: beyond the Technicalities

16 Dec 2024 (6 months ago)

Introduction to Monorepos and Polyrepos

Monorepos and polyrepos are two approaches to organizing code, with monorepos being a single repository that produces multiple artifacts, and polyrepos being multiple repositories that produce a single artifact or multiple artifacts with complex dependency relationships (10s).
A polyrepo setting can be identified by multiple repositories with clearly defined dependency relationships that work together to produce a single artifact, or multiple repositories with complex dependency relationships and code sharing that produce multiple artifacts (1m18s).
A monorepo setting can be identified by a single repository that produces multiple artifacts, with code sharing between internal modules (2m18s).
The key factor in determining whether a setting is a polyrepo or monorepo is the way code is shared between repositories, with polyrepos having code sharing between multiple repositories and monorepos having code sharing within a single repository (3m36s).

Characteristics of Monorepos and Polyrepos

Monolithic applications are not always the result of monorepos, and simply putting code together in the same repository does not necessarily make it a monorepo (4m20s).
A monorepo requires well-defined relationships between internal modules, allowing for a clear diagram of dependency relationships to be drawn (4m45s).
There are also cases where it is not clear whether a setting is a polyrepo or monorepo, such as when multiple repositories produce a single artifact without code sharing, or when a single repository has internal modules that do not share code (2m33s).
Most real-world setups are a mix of polyrepo and monorepo characteristics, making it difficult to categorize them as one or the other (3m11s).
Modules in a monorepo are part of the same build system, similar to Maven builds in Java or Go, producing a single artifact or multiple smaller artifacts that may be important for publication or consumption (5m4s).
A monorepo is not necessarily big or messy, but rather a repository that contains code producing more than one interesting artifact after the build, such as publishing multiple Docker images from one single repo (5m57s).
Code sharing is also a key aspect of monorepos, allowing a single module to be reused by final artifacts (6m18s).
It's rare to see a company with everything inside the same repository, but many companies have monorepos in the sense that they have a single repository publishing multiple artifacts (6m44s).
Using the given definition, it can be said that everyone has a monorepo, as many repositories have multiple libraries that get published (7m21s).

Discussion on Monorepo vs Polyrepo

This session aims to answer whether one should have a monorepo by discussing codebase structure, team operations, software development, and code reuse (7m52s).
The discussion will be based on general terms and oversimplified examples, as it's hard to capture the uniqueness of each company or organization (8m39s).
The conversation will also touch on how people interact to produce software, which can easily become about management and separation (9m31s).
The structure of a company's codebase may closely mirror the structure of its teams and management, so it's essential to consider this when discussing monorepos or polyrepos (9m42s).
A typical example of a codebase consists of multiple microservices, backend-for-frontend repos, frontend applications, common code repos, and an end-to-end tests repository, which many people can relate to, although it's oversimplified (9m53s).
When presenting a simple example, people often have different opinions on what should be done, which can hinder the conversation about whether to pursue a monorepo or polyrepo setting (10m41s).
To have a productive conversation, it's crucial to get everyone moving in the same direction, exposing them to the same problems and understanding the implications of library building, Ops Team deployment, and other factors (11m20s).
The conversation about monorepos or polyrepos is slow, lengthy, and complex, and cannot be decided in a single meeting with stakeholders or architects (11m53s).
The uniqueness of a company's operating dynamics and team alignment come before choosing a monorepo or polyrepo approach, and it's essential to document the decision-making process and ensure everyone understands why a particular approach is chosen (12m26s).

Team Dynamics and Codebase Structure

Breaking down a codebase into teams can reveal affinities between repos and help understand how people should work together, but it's also important to consider the unique operating dynamics of each organization (13m7s).
Different teams may have different names, operating dynamics, and collaboration styles, which can affect how they work on repos and tests (13m34s).
It's common for multiple teams to collaborate on the same repo or end-to-end tests, and everyone may own the tests, but it's essential to consider the unique aspects of each organization's operations (14m3s).
Different teams within an organization may have varying needs and priorities, such as a focused team for security and another for tests, which can inform decisions about monorepos and polyrepos (14m35s).
As decision-makers, individuals have a better understanding of their team's operations and can evaluate the benefits of monorepos and polyrepos for specific parts of their organization (14m50s).
Teams can choose to use monorepos for certain parts of their organization and polyrepos for others, rather than having to choose one approach exclusively (15m20s).

Dependency Management and Complexity

The dependency relationships between repositories can be complex and may involve multiple layers of dependencies, including common libraries and microservices (15m39s).
There are different ways to understand and visualize these dependency relationships, and teams may have varying approaches to managing them (15m51s).
In addition to internal dependencies, teams must also consider third-party dependencies and versioning, which can add complexity to the development process (16m42s).
Some teams may choose to version and publish their own libraries, which can be consumed by their services, but this can also create challenges in terms of dependency management (16m55s).
To address these challenges, teams may adopt strategies such as updating all services to use the same version of a library whenever a new release is published, or automating the process of publishing changes to services (17m22s).
Ultimately, the complexities of software development and dependency management can be addressed in various ways, and teams must find the approach that works best for their specific needs (17m55s).

Managing Dependencies in Polyrepos

Managing dependencies in a polyrepo setting can be difficult, especially when dealing with third-party dependencies, as it requires managing complexity, upgrading, tracing, and ensuring reliability (18m58s).
Making software reusable by others adds another layer of complexity, as it requires considering who is using the software and managing the impact of changes (19m7s).
A polyrepo is typically understood as a small repository containing software for a specific purpose, producing a single artifact or a few, and the responsibility of upgrading dependencies lies with the users (19m37s).

Advantages of Monorepos

A monorepo, on the other hand, contains multiple modules that can or cannot be related, and code is reused internally by pointing to local artifacts (20m3s).
Monorepos can have fast builds if set up correctly, as only the necessary parts need to be built, without requiring a full rebuild of the entire repository (20m22s).
The concepts of upstream and downstream are important in understanding the relationships between repositories and artifacts, with upstream referring to libraries and services used, and downstream referring to libraries, Docker images, and services that depend on the repository (21m11s).

Upstream and Downstream Dependencies

In a polyrepo setting, changes to a repository can have implications for multiple downstream artifacts, making it essential to manage these dependencies carefully (21m57s).
Publishing libraries using semantic versioning is a common setup for sharing code, but it can lead to a chain of updates and version publications, which can be wasteful if not managed efficiently (22m11s).
Using a monorepo can potentially simplify this process by eliminating the need for multiple version publications and allowing for more efficient management of dependencies (23m16s).

Making Changes in a Monorepo

In a monorepo, every component is in a module inside the same repository under the same build system, allowing for changes to be made and released together (23m19s).
This approach can be beneficial, but it also means that making big changes can take weeks, especially for major version bumps, as all changes must be made at once (23m52s).
The responsibility for updating downstream code in a monorepo depends on the team's dynamics and preferences, and this decision has an impact on whether a monorepo should be used (24m26s).
In a monorepo, introducing changes can be more complex, as the team making the changes may need to ask for help updating downstream code, but once done, the new artifacts are deployed and aligned for everyone (24m57s).

Monorepos and Polyrepos as Complementary Techniques

Monorepos and polyrepos are not mutually exclusive, but rather complementary techniques that can coexist in the same codebase, depending on the organization's needs and structure (25m21s).
The decision to use a monorepo or polyrepo depends on the specific needs of each part of the organization, with some teams requiring independence and others needing alignment and synchronicity (25m58s).
Some parts of an organization may be better suited to a monorepo, where alignment and synchronicity are crucial, while others may be better suited to a polyrepo, where independence is necessary (26m28s).
The choice between monorepo and polyrepo is not a one-size-fits-all solution, but rather a decision that depends on the specific context and needs of each organization (26m43s).

Example: Apache Key Tools Project

The Apache Key Tools project is an example of a large monorepo, with 200 packages, a custom build system, and a package.json file for each package, which defines how to build it and its relationships with other modules (27m11s).
A monorepo is used to build a section, allowing the selection of the exact part of the tree to be built, with almost 50 artifacts coming from the monorepo, including Docker images, VS Code extensions, and Maven modules and applications (27m56s).
Standardized script names are used, with each package having "build Dev" and "build prod" commands, and standalone-developed packages having a "start" command, to put everyone under the same build system (28m22s).
Configuration is done through environment variables, borrowing from the 12 Factor App Manifesto, and an internal tool manages a large amount of environment variables to configure things like logo pass, optimizer, minifier, and tests (28m40s).
Environment variables are used to make references to other packages, and symbolic links and definitions in package.json are used to safely reference dependencies (29m7s).
A system is in place to prevent mistakes during builds by only referencing declared dependencies, and partial builds of the monorepo are possible in PR checks, depending on the files changed (29m34s).
The ability to partially build the monorepo in PR checks is helpful, with a script figuring out what packages need to be rebuilt and retested, and the slowest part of a run taking 16 minutes (29m59s).
The monorepo can scale well, with the ability to split builds into partitions and sections of the tree, and it supports multiple languages, including Java, TypeScript, Go, and container images (30m36s).
Sparse checkout ability is available, allowing users to clone the repo and select only a portion of it, even if the repo gets very big (31m20s).

Challenges and Improvements in the Apache Key Tools Project

Challenges exist, including the lack of a user manual, with knowledge currently residing in people's heads and private messages, but a user manual is being written (31m46s).
Improvements are being made to the development experience for Maven-based packages, including better importation in IDEs and accurate reference picking, with issues being highlighted when they are incorrect (32m12s).
A problem exists where changes to the top-level lock file are not understood by the partitioning system, affecting which model modules are impacted, but a solution is being researched and implemented (32m30s).
A full build will not be required for every code change, unless it's a root-level file, and a merge queue is being considered to simulate merges and prevent semantic conflicts (32m50s).
The merge queue will allow code to be merged automatically after passing checks, preventing breaks to the main branch and reducing conflicts (33m21s).
Multiple cores will be made available for each package to build, enabling parallel builds, with environment variables likely being used to control core allocation (33m42s).
Research is being conducted on using environment variables to configure parameters that distinguish between production and development builds, reducing duplication in build commands (34m10s).
The use of Turbo Repo is being explored, which includes a test runner that understands packages and files, and has caching capabilities that can speed up development and onboarding (34m26s).

Recommendations for Building a Monorepo

When building a monorepo, it's recommended to start small, choosing a few languages or one, and a single build tool, rather than trying to incorporate the entire codebase at once (35m15s).
It's also recommended to establish defaults and conventions from the beginning, even if they may not be perfect, to ensure everyone is working in the same environment and can provide feedback (35m52s).
When implementing a monorepo, it's essential to make the relationship between modules easy to visualize to avoid confusion and dependencies, as it can be challenging to navigate with many small modules (36m20s).
Be prepared to write custom tools for unique build necessities, such as dealing with network issues, old platforms, or special build requirements (36m42s).
Be prepared to discuss the monorepo extensively, as it's a controversial topic that may require explaining the reasoning behind it multiple times (37m18s).
Optimize the monorepo for development by making it easy for people to clone and start working immediately, with minimal configuration steps (37m35s).
A monorepo should have everything turned off by default, targeting development, local host, and no production references or dependencies (37m52s).
When organizing a monorepo, do not group by technology, but instead group by operating dynamics, team affinity, and how teams interact with each other (38m12s).
Do not compromise on quality, be thorough about decision-making, and avoid adding low-quality code to the monorepo (39m3s).
Avoid doing too much at once when implementing a monorepo, and prioritize the most essential features (39m43s).

Re-evaluating and Adjusting the Monorepo

Be open to reevaluating and adjusting the monorepo if it's not working out, incorporating feedback, learning from mistakes, and prioritizing the well-being of the development team (40m14s).

Q&A and Discussion

The conversation has come to a close, and the speaker is open to answering questions from the audience, with a preference for discussing topics related to build tools and coding styles in monorepos (40m50s).
The choice of build tool, such as Maven or Gradle, depends on the team's preference and skill set, and there is no one-size-fits-all solution (41m16s).
The speaker's team has struggled to transition from a Maven-based approach to a less structured one using tools like npm and JavaScript (41m42s).
In a polyglot monorepo, the speaker recommends using a flat structure to organize code, with a prefix for package names and minimal nesting to facilitate visualization of relationships between internal modules (42m44s).
The speaker suggests that coding styles should be specific to each language and can be found in the user manual for each language (43m38s).
A member of the audience shares their experience with using a monorepo for their entire codebase and warns about the potential drawbacks of polyglot repos, including dependency hell (43m50s).
The speaker acknowledges the concerns about polyglot monorepos but notes that their use case requires cross-language dependencies, and their structure allows them to navigate complex dependency trees (44m40s).
The speaker clarifies that their previous statement about polyglot monorepos was referring to individual repositories having multiple languages, rather than a single monorepo with multiple languages (45m31s).
The company has a coarse-grained dependency structure, with a large codebase, and this structure is not working out for the organization, leading to a decision to move out of it, but some modules will remain in the same repository (45m38s).
The company's problem was that changes to the package lock file or pnpm lock file would cause downstream things to build, and this was an issue due to the scripting system's behavior when the lock file in the root folder changed (46m15s).
The company's goal is to build only downstream things when a dependency changes, but the previous issue was that everything would get built when the lock file changed, due to the scripting system's understanding that a root file changed (46m35s).
The company has implemented a solution using the turbo repo diffing algorithm to understand which packages are affected by the dependencies that changed inside the lock file, allowing for more targeted building (46m54s).
The new solution enables building only downstream things when a dependency change occurs, which is a desirable outcome (47m4s).