Reliable Architectures through Observability

12 Feb 2024 (10 months ago)
Reliable Architectures through Observability

Architecture Reliability Observability

  • Architecture reliability observability involves handling change and being resilient to changes in the environment and customer behavior.
  • Telemetry and observability are necessary to understand systems in the cloud.
  • Metrics provide pre-aggregated answers but are limited for debugging.
  • Logs require clever queries and correlation between systems and events.
  • Tracing provides more information about the system, including hierarchy, ordering, sequencing, and timing.
  • Tracing is crucial for understanding cross-service interactions and requires propagating trace information.

OpenTelemetry (OTel)

  • OpenTelemetry (OTel) is an open-source vendor-independent standard for telemetry data formats and protocols.
  • OTel includes standards for traces, metrics, and logs, with libraries available in various programming languages.
  • The OpenTelemetry Collector is a processing proxy that receives data from various sources, processes it, and exports it to different destinations.
  • Auto-instrumentation and manual instrumentation are two approaches for collecting telemetry data.
  • Auto-instrumentation works well for low-level components like operating systems and containers.
  • Manual instrumentation is valuable at the application level to capture specific details.
  • OpenTelemetry provides a way to trace user behavior and monitor applications, potentially replacing traditional RUM (Real User Monitoring) tools.

Tracing

  • Tracing helps understand cause-and-effect relationships in asynchronous architectures.
  • Span links and secondary traces connect traces across different processes or services.
  • Baggage allows data propagation between services but should be filtered to prevent unintended propagation.
  • Planning for observability and integrating tracing from the beginning of a project is valuable for debugging and understanding application behavior.
  • OpenTelemetry is a cost-effective solution for tracing, and it's recommended to record too much data initially and then trim it down as needed.

OpenTelemetry Collector and Processors

  • The OTel Collector receives telemetry data from sources and forwards it to exporters.
  • Processors transform and filter telemetry data before it is exported.
  • Exporters send telemetry data to various destinations, such as Prometheus, third-party vendors, or custom systems.

Sampling and Service Level Objectives (SLOs)

  • Sampling reduces the amount of telemetry data collected and processed.
  • Service Level Objectives (SLOs) define the expected performance of a system and track how well it meets those expectations.

Observability in Legacy Systems and OpenTelemetry Community

  • Observability can be added incrementally to legacy systems by starting with a collector and then adding instrumentation as needed.
  • OpenTelemetry is an open-source project supported by a large community of contributors.

Batching and OpenTelemetry Maturity

  • To avoid tracking issues with large batches, set limits on batch size and duration.
  • Decorate all traces with a batch ID to easily query and identify related traces within a batch.
  • Tracing is relatively mature and works well in most languages.
  • Matrix is still stabilizing, with some languages having better support than others.
  • Logs are still in development, especially in areas like sampling.
  • Despite these challenges, investing in OpenTelemetry now is worthwhile, starting with tracing and gradually adopting other features as they mature.

Overwhelmed by Endless Content?