How Wix Builds a Platform as a Runtime (PaaR) for Faster, High-Quality Development

21 Oct 2024 (2 months ago)
How Wix Builds a Platform as a Runtime (PaaR) for Faster, High-Quality Development

Monolith vs. Microservice vs. Serverless

  • The question of whether to build a monolith, microservice, or serverless architecture for a next service is complex and depends on the technology and stack chosen, as each has its own strengths and weaknesses (27s).
  • Monoliths are efficient and performant because everything runs in-process, while microservices allow for ownership and scalability, and serverless provides easy deployment and scalability (1m29s).
  • The growing complexity in modern software systems is a major challenge for software developers, and finding a way to regain control without losing the benefits of different technologies is crucial (1m33s).
  • When choosing a technology, developers must trade off between three pillars: how to code, how to deploy code, and how to run or operate code (2m6s).

Wix's Scale and Engineering Challenges

  • Arian is the VP of Engineering at Wix, a leading website builder platform that provides a range of solutions, including e-commerce, events, and booking, and allows users to write code and use serverless technology (3m10s).
  • Wix has over 250 million website builders using its platform, 7% of the internet's websites run on its platform, and it has a billion human users visiting its websites, with 4,000 microservices clusters across three data centers (4m5s).
  • As engineers, it is their responsibility to deliver business value, and successful companies need to provide high-quality code fast to beat their competition (4m49s).

Monolithic Architecture

  • Building a monolith can be complicated, as it involves building a single, large application (5m21s).
  • A monolithic service is simple to start with, as everything runs in the same process, making life easier for developers, especially for startups, with a single service to manage (5m23s).
  • The pros of a monolithic service include easy coding, accessibility, and testing, as well as simple topology, but as it scales, it can lead to mixing domains, spaghetti code, and difficulties in synchronizing between teams (5m45s).

Microservices and Serverless Architectures

  • Microservices and serverless systems are more complex, with distributed systems and indirect dependencies, making it harder to manage and test, but allowing for single responsibility and easier deployment (6m26s).
  • The trade-offs of microservices and serverless systems include complexity, cross-cutting concerns, and difficulties in refactoring and breaking APIs, but they also offer clear ownership and scalability (7m38s).

The Ideal System and the Task Management Example

  • The goal is to find a system that solves most of the issues with code, deploy, and run, and to build something that tackles each pillar and figures out how to build something else (8m39s).
  • Dana, a new developer, is tasked with writing a simple task management system, which requires domain modeling, API design, request flow, authentication, authorization, input validation, and object mapping (9m12s).
  • The task management system is a simple example, but it still requires consideration of various aspects, such as request flow, authentication, and input validation, making it a complex task (9m35s).
  • A developer, Daniel, is tasked with building a simple task management system but faces numerous challenges, including figuring out APIs, database connections, RPC calls, secret fetching, data access, domain events, error handling, GDPR, PII, caching, logging, and testing (10m49s).
  • Daniel becomes frustrated with the complexity of the task and the numerous considerations required to complete it (11m27s).

Frameworks and Best Practices

  • The solution to Daniel's problem is to use best practices for each of the challenges, which are documented in a framework (12m19s).
  • The framework provides recommendations and documentation for using identity, understanding GDPR laws, sending webhooks, and communicating with databases (12m41s).
  • Despite the framework's help, Daniel is still overwhelmed, leading to the realization that increasing programmer productivity is not about writing more lines of code but about eliminating unnecessary code (13m39s).

The Journey to Code Faster

  • The goal is to eliminate 80% of the code that needs to be written for an app, and the fastest way to write code is to not write it at all (14m2s).
  • A journey to code faster in a complex environment began four years ago, involving weekly meetings with the CEO to review thousands of lines of code and identify unnecessary lines (14m29s).
  • The focus is on removing lines of code that do not contribute to business logic, as that is what developers are paid to build (15m5s).

Platform as a Runtime (PaaR)

  • The solution to removing unnecessary code is to build a platform as a runtime (PaaR) (15m14s).
  • A platform called Nile was built to codify guidelines and best practices, allowing developers to work within the platform and avoid cross-cutting concerns, with features such as a framework and search integration (15m17s).
  • The Nile platform automates tasks such as indexing and updating documents in Elastic Search, eliminating the need for developers to annotate domain objects (15m55s).
  • The platform was built to help with coding and is one of the pillars of the development process, with the goal of making development faster and more efficient (16m25s).

Serverless Platform Development

  • In parallel with building the Nile platform, a serverless platform was also developed to provide a customized solution for the company's needs (16m36s).
  • The serverless platform was built to address the issue of having a large footprint in production environments, where most of the code running is not written by the company, but rather consists of frameworks and libraries (18m21s).
  • The company's software stack typically consists of virtual machines, containers, microservice frameworks, internal frameworks, and business logic, with the goal of building a Pyramid of software development (17m17s).
  • Microservices are packaged with frameworks, libraries, and business logic, but most of the code running is not written by the company, which can become a problem when running thousands of microservices (18m19s).
  • The life cycle of frameworks and libraries is tied to the product life cycle, making it difficult to update and manage multiple versions in production (19m7s).
  • The company faces challenges in updating and patching common libraries and frameworks, which can be a long and complicated process, especially when dealing with thousands of microservices (19m38s).
  • Many companies have services without owners, and supporting multiple languages can be an issue, requiring duplication of platforms and frameworks, which is costly and time-consuming (20m10s).

The Ideal Platform and Wix's Approach

  • The ideal platform would have easy and fast code, minimal integration tests, and no boilerplate, with fast deployment, scalability, and low cost (20m49s).
  • Every system and stack has pros and cons, and the goal is to take the best of microservices, serverless, monoliths, and managed platforms (21m27s).
  • Wix is a managed platform for users, allowing them to write code on top of their website, with Wix handling deployment, Kubernetes, and database provisioning (21m45s).

First Attempt at PaaR with Node.js

  • The concept of "Platform as a Runtime" (PaaR) was developed to address these issues, with the goal of building a platform that can run multiple languages and frameworks (21m55s).
  • The first attempt at PaaR used Node.js, with an application framework, service integration layer, and data services layer, allowing users to build their business logic on top (23m3s).
  • Node.js was chosen for its lightweight nature, dynamic code loading, and ease of learning, and the application framework handles HTTP headers, authentication, and monitoring (23m39s).
  • The service integration layer provides RPC clients and libraries, allowing users to integrate their applications without doing lookups or integration (24m5s).
  • The data services layer uses DynamoDB as a key-value store, and the platform is packaged and put on the cloud, with user code deployed on top (24m33s).
  • Code can be loaded dynamically into a platform without the need for the entire platform, just the interfaces, allowing for a trusted environment where multiple functions or small services can be added to the same container (25m5s).
  • This approach differs from AWS Lambda, which is a non-trusted environment requiring everything to be packaged together, making it impossible to share the platform between different services (25m19s).
  • The platform has no integrations as they are provided by the platform itself, resulting in less testing, faster deployment, and zero boilerplate code (26m16s).
  • Developers can focus on small functions without the overhead of large packages and frameworks, making deployment very fast (26m27s).

Improved Developer Experience

  • The developer experience is improved by taking concepts from a previous project, Nile, and re-evaluating every line of code to determine if it should be provided by the platform (26m49s).
  • A developer, Dana, has a change request to add an API to retrieve task details and assigned persons, and write an audit log to a database, which can be done as a separate service in a serverless function world (27m9s).
  • Dana only needs to write a small amount of code to import the task server, expose a new API endpoint, call the task server, extract the contact ID, call the contact server, and return the task and contact details (28m35s).
  • Adding a Kafka consumer is also simplified, as Dana can get the Kafka consumer and data source from the context without writing any connection handling code (29m20s).
  • A developer, Dana, can write code without having to worry about boilerplate code, and once she pushes the code, it's running on production within one to two minutes, thanks to the platform's minimal code and small deployable size (29m52s).

Optimizing the Run Pillar

  • The deployment pillar is built as a platform as a service, and the concept is being applied to the Run pillar to optimize it (30m32s).
  • The Run pillar uses a serverless node runtime, allowing for ownership and control, even with a serverless strategy, by giving containers to each team or business unit (31m3s).
  • The platform can handle functions running in-process and dynamically loaded into the platform, making it easy for developers like Dana to write code without worrying about the underlying infrastructure (31m57s).
  • The platform can scale by putting functions on another container and load balancing between them, and it can also optimize function affinity by deploying frequently called functions in-process (32m19s).
  • The platform can optimize the function affinity by deploying frequently called functions in-process, replacing network calls with in-process calls, and reducing latency (32m43s).
  • The platform's deployment strategy allows for small functions to be deployed without the framework, a single version of frameworks and libraries, and decouples the life cycle of frameworks and libraries from the life cycle of products (33m28s).
  • The platform's deployment strategy also allows for easy deployment of the platform, without requiring a cross-company effort, and enables all teams to get the updated platform at once (33m55s).

Adding Multi-Language Support

  • The platform is missing the ability to add an additional language, which is being worked on in version two, to support languages other than TypeScript, such as Scala (34m27s).
  • A lot of progress has been made on the platform, with thousands of functions running for two years, and developers love it, with the next step being the Wix single runtime (34m51s).

Wix Single Runtime

  • The Wix single runtime involves tradeoffs, including switching from in-process to out-of-process calls, with the product teams building business logic using an SDK that communicates with the host platform (35m11s).
  • The architecture consists of a demon set with one container for incoming and outgoing calls, and every function or microservice runs on its own pod on the same machine, with local host communication being faster than network communication (35m56s).
  • The tradeoff of using out-of-process calls is offset by gaining the Kubernetes ecosystem for deployment and autoscaling, which would have had to be reinvented otherwise (36m51s).
  • The solution has a larger footprint than the previous TypeScript solution but is still 50% lower than packaging the entire framework inside a microservice, with a JVM footprint (37m12s).
  • Benchmarks showed a performance loss of about 2 milliseconds compared to an in-process microservice, which is tolerable, and it's believed that this loss can be gained back in a distributed system (37m39s).
  • The deployment strategy involves local host communication, which can potentially offset the 2-millisecond overhead, and performance tests showed that up to 15,000 RPMs, the overhead is about 2 milliseconds (38m40s).
  • Services with millions of RPMs may not be suitable for this solution and can be packaged as standalone microservices instead (39m3s).
  • The work setup has been changed to allow packaging the platform inside a regular microservice with just a flag on the build system, and developers still don't need to change their code (39m26s).
  • A high-scale system can be packaged together with a standard framework, allowing for relatively low RPMs, and can be used for systems with up to 30k-50k RPMs, which is still a lot, and provides cost savings and a single framework for fast deployment of security changes or legal constraints (39m41s).

Benefits of the Platform

  • The platform takes care of GDPR for users and supports multiple languages because it switches from an in-process to an auto-process, requiring only the investment of writing an SDK to support a different language, which is much cheaper than keeping feature parity between frameworks (40m17s).
  • The platform is coded, deployed as a service, and run as a bar or as a virtual monolith, taking the best of all worlds and creating a platform as a runtime, allowing for business value to be brought fast (40m55s).
  • The improved velocity of developers increased by 50 to 80% due to writing less code, and costs on compute can be reduced by about 50% because more density can be pushed into a single node, with the footprint of a single service being about 50% smaller (41m31s).
  • The platform allows for faster deployment of security changes or legal constraints, such as GDPR, and provides support for multiple languages, making it a cost-effective solution (40m11s).

Backward Compatibility and Platform Evolution

  • When a breaking change occurs, backward compatibility is kept, and proxies are created to change the previous API to the new API, ensuring a smooth transition (43m41s).
  • The platform's framework is designed to keep up with architectural requirements, and changes can be made to accommodate different types of requirements, such as dedicated different types of services (44m25s).

Opinionated System and Developer Freedom

  • A highly opinionated system is necessary for scaling, which means limiting the freedoms of developers to ensure consistency across different stacks of technologies (44m46s).
  • The platform provides a set of tools and technologies that developers are expected to use, similar to how AWS or Google Cloud provide their own set of tools and technologies (45m7s).
  • While the platform provides a lot of value to developers, there are cases where developers need to opt out of the platform, and in such cases, they should first try to solve their issues within the platform (45m34s).
  • The platform is built with layers, allowing developers to take the lower levels of the platform and use the same core libraries, but without the synthetic sugars and automations (45m41s).
  • In rare cases, developers may need to use a different technology stack, such as the media platform at Wix, which requires video encoding and image manipulations and uses a different platform (46m1s).
  • For most cases (80-90%), developers should stick with the platform, as it provides a lot of value and makes development easier (46m41s).
  • Developers tend to try to stay within the platform and avoid opting out because they realize the value it provides (46m54s).

Testing and Local Runtime

  • The concept of using a boring architecture is similar to the idea of using a platform, as it provides a set of proven and reliable technologies (47m6s).
  • Testing functions and APIs can be done without having the whole environment, and Wix provides a local runtime for testing, as well as the option to test on production using a test tenant (48m3s).
  • Wix's multi-tenant system allows developers to create their own test tenant and test end-to-end flows without corrupting other tenants (48m30s).
  • The deploy preview feature, part of Wix's CD/CI system, allows developers to deploy and test their code in a production-like environment (48m43s).
  • In a preview GA, a specific artifact is deployed but doesn't receive any traffic, unless a test request is made with a special header, which routes the test calls to the new deploy artifact, allowing it to interact with other artifacts and APIs without corrupting data (48m52s).
  • The platform prevents users from making tenancy mistakes by injecting the tenant ID into queries, which are not allowed to be tampered with, and the platform injects the tenant ID from the authentication headers (49m27s).
  • The system is designed to be fairly safe, preventing users from corrupting any data, thanks to the platform's handling of tenancy and injection of tenant IDs into queries (49m44s).

Overwhelmed by Endless Content?