From Open Source to SaaS: the Journey of ClickHouse
16 Feb 2024 (9 months ago)
ClickHouse
- ClickHouse is a fast, open-source OLAP database that supports replication, sharding, multimaster, and cross-region.
- ClickHouse is column-oriented, making it faster than row-oriented databases for aggregation queries.
- ClickHouse has many bottom-up optimizations that make it even faster than other columnar databases.
- ClickHouse is used for real-time data processing, business intelligence, logging, metrics, and machine learning.
ClickHouse Cloud
- ClickHouse Cloud is a service that offers ClickHouse as a managed database in the cloud.
- The guiding principles for ClickHouse Cloud were:
- Serverless experience
- Performance
- Separation of compute and storage
- Tenant isolation and security
- Multicloud support
- Kubernetes was chosen as the compute platform due to its serverless experience, separation of compute and storage, and multicloud support.
- The ClickHouse Cloud architecture consists of a control plane and a data plane.
- The control plane handles customer-facing tasks such as cluster and user management, authentication, user communication, and billing.
- The data plane hosts the ClickHouse clusters and provides features such as auto-scaling, metrics, and a Kubernetes operator.
- Users connect to their clusters through a shared load balancer per region, which hands off requests to Istio based on routing rules.
- ClickHouse interacts with data stored in S3, which serves as persistent and durable storage for all customer clusters' data.
- ClickHouse on AWS introduced local disks and network latency to support data storage on S3.
- EBS volumes and SSDs are used for caching to achieve similar performance to self-hosted ClickHouse.
- A shared load balancer is used instead of dedicated load balancers per cluster to improve user experience and reduce costs.
- Psyllium, a Kubernetes network plugin, is used for network policies and logical isolation between clusters.
- Vertical auto-scaling adjusts the size of individual replicas but can be disruptive and cause cache loss.
- Horizontal auto-scaling adjusts the size of the cluster but can lead to data integrity issues and communication problems.
- Beta launch included vertical auto-scaling only, with horizontal auto-scaling still being worked on.
- Vertical auto-scaling is automated by publishing usage metrics to a central metric store and using those metrics to make scaling decisions.
Development and Milestones
- Development milestones included a private preview in May 2022, public beta in October, and GA in December.
- Private preview focused on basic cloud offerings, security, and self-service capabilities.
- Public beta introduced autoscaling, metering, enhanced security features, and rigorous testing.
- GA addressed customer feedback, enhanced the cloud console, introduced developer-friendly features, and prioritized reliability and security for uptime SLA and compliance.
Success Factors
- Success factors included milestone-driven development, respecting timelines while adjusting priorities, and emphasizing reliability and security as core features from the start.
- Gathering user feedback early and often is crucial for building an accurate product that addresses customer pain points.
- Customer feedback during the private preview and public beta phases led to enhancements in security features, console, and Developer Edition, demonstrating the team's ability to respond quickly and delight customers.
- The introduction of SSDs resulted in a significant performance increase, eliminating network latency and matching the performance of self-hosted ClickHouse.
- The team utilizes ISO not only for load balancing and traffic management but also for idling instances during periods of inactivity to optimize costs for both the company and customers.