Yuri Shkuro on Tracing Distributed Systems Using Jaeger
01 Oct 2024 (3 months ago)
Tracing and Observability
- OpenTracing is a standardized API used to instrument code for tracing data, similar to a logging API. (1m5s)
- Jaeger is a tracing backend that implements the OpenTracing API, enabling applications instrumented with OpenTracing to send data to Jaeger. (1m31s)
Jaeger vs. Zipkin
- Jaeger and Zipkin are similar in functionality, but Jaeger has additional features like adaptive sampling and advanced visualization tools. (4m23s)
- Jaeger was chosen over Zipkin because Zipkin uses a proprietary API that would lock users into their system. Jaeger uses a standardized API that allows for flexibility. (5m49s)
Tracing in Complex Systems
- Companies with large and complex systems, such as Uber, require tracing to understand the entirety of their systems. Using only metrics and logs would not be sufficient in understanding and troubleshooting issues in a system composed of many microservices. (8m6s)
- Simply implementing tracing does not mean that a company's observability problems are solved. Companies need to understand what issues they are trying to solve and how tracing can be used to solve them. (11m8s)
Benefits of Tracing
- At Uber, there is a proprietary aggregation system that continuously aggregates traces, which is instrumental in root cause analysis and outage mitigation. (11m20s)
- Jaeger can be used as a documentation tool for new engineers by providing an up-to-date system architecture. (12m12s)
Challenges of Implementing Tracing
- A significant challenge in implementing tracing is the organizational effort required, as the benefits of tracing are realized at an organizational level rather than at an individual service level. (15m59s)
- The biggest challenge of implementing tracing in distributed systems is getting instrumentation in all the right places across an organization. (16m37s)
- A continuous challenge with tracing is achieving 100% coverage for all services, especially in large organizations with microservices often written in a variety of languages and frameworks. (18m10s)
Context Propagation
- While service meshes can help with tracing, they do not eliminate the need for context propagation within applications, as correlating inbound and outbound requests in real-time requires applications to pass tracing context. (18m34s)
- Context propagation, which involves carrying a correlation ID throughout a request's journey across microservices, is highly dependent on the programming language used. (20m20s)
- Thread locals do not work out of the box and require special instrumentation of the concurrency primitive so that the context is properly transferred. (21m56s)
Jaeger's Capabilities
- Jaeger generates billions of traces a day even with heavy sampling rates. (23m2s)
- Uber has built an internally used closed source tool that helps isolate problems in system architecture during outages by comparing traces with errors to an aggregated model of successful traces. (25m15s)
Serverless Tracing
- Serverless tracing with Jaeger is conceptually similar to microservice tracing, with practical considerations for service identity and reporting spans due to the absence of sidecar deployments. (27m3s)
Jaeger's Roadmap
- Jaeger's roadmap includes a new visualization tool called Deep Dependency Graphs, which provides a more comprehensive view of service dependencies by analyzing the entire request path. (28m31s)
- An adaptive sampling feature is being developed for Jaeger, aiming to dynamically adjust sampling rates based on service and operation traffic volume to optimize data collection. (29m40s)