Reliable Architectures through Observability
12 Feb 2024 (10 months ago)
Architecture Reliability Observability
- Architecture reliability observability involves handling change and being resilient to changes in the environment and customer behavior.
- Telemetry and observability are necessary to understand systems in the cloud.
- Metrics provide pre-aggregated answers but are limited for debugging.
- Logs require clever queries and correlation between systems and events.
- Tracing provides more information about the system, including hierarchy, ordering, sequencing, and timing.
- Tracing is crucial for understanding cross-service interactions and requires propagating trace information.
OpenTelemetry (OTel)
- OpenTelemetry (OTel) is an open-source vendor-independent standard for telemetry data formats and protocols.
- OTel includes standards for traces, metrics, and logs, with libraries available in various programming languages.
- The OpenTelemetry Collector is a processing proxy that receives data from various sources, processes it, and exports it to different destinations.
- Auto-instrumentation and manual instrumentation are two approaches for collecting telemetry data.
- Auto-instrumentation works well for low-level components like operating systems and containers.
- Manual instrumentation is valuable at the application level to capture specific details.
- OpenTelemetry provides a way to trace user behavior and monitor applications, potentially replacing traditional RUM (Real User Monitoring) tools.
Tracing
- Tracing helps understand cause-and-effect relationships in asynchronous architectures.
- Span links and secondary traces connect traces across different processes or services.
- Baggage allows data propagation between services but should be filtered to prevent unintended propagation.
- Planning for observability and integrating tracing from the beginning of a project is valuable for debugging and understanding application behavior.
- OpenTelemetry is a cost-effective solution for tracing, and it's recommended to record too much data initially and then trim it down as needed.
OpenTelemetry Collector and Processors
- The OTel Collector receives telemetry data from sources and forwards it to exporters.
- Processors transform and filter telemetry data before it is exported.
- Exporters send telemetry data to various destinations, such as Prometheus, third-party vendors, or custom systems.
Sampling and Service Level Objectives (SLOs)
- Sampling reduces the amount of telemetry data collected and processed.
- Service Level Objectives (SLOs) define the expected performance of a system and track how well it meets those expectations.
Observability in Legacy Systems and OpenTelemetry Community
- Observability can be added incrementally to legacy systems by starting with a collector and then adding instrumentation as needed.
- OpenTelemetry is an open-source project supported by a large community of contributors.
Batching and OpenTelemetry Maturity
- To avoid tracking issues with large batches, set limits on batch size and duration.
- Decorate all traces with a batch ID to easily query and identify related traces within a batch.
- Tracing is relatively mature and works well in most languages.
- Matrix is still stabilizing, with some languages having better support than others.
- Logs are still in development, especially in areas like sampling.
- Despite these challenges, investing in OpenTelemetry now is worthwhile, starting with tracing and gradually adopting other features as they mature.