Distributed tracing
Links
- The OpenTracing Semantic Specification
- OpenTelemetry Specification
- Pinterest - Analyzing distributed trace data
- ⭐ Distributed Tracing — we’ve been doing it wrong
- Spans are too low-level a construct for effective root cause analysis
- More high level visualizations would better benefit RCA. For instance, dynamically generaeted service topology graphs or aggregation of trace data to surface anomalous flows.
- Twitter thread on how distributed tracing products don’t provide enough value
- Lessons from Building Observability Tools at Netflix
- “In summary, the key learnings from our effort are that tying multiple request traces into a logical concept, a playback session in this case, and providing additional context based on constituent traces enables our users to quickly determine the root cause of a streaming issue that may involve multiple systems.”
- Distributed Tracing: Impact on Engineering Organizations
- Salesforce - Anomaly Detection in Zipkin Trace Data
- Using machine learning
- 1. Calculating Completeness Metrics on Trace Data (sum of durations for spans within a trace compared to that trace’s total duration)
- 2. Identifying High Traffic Areas in the Network
- 3. Identifying Services with Exponential Latency Growth
- Uber - Distributed Tracing
- ⭐ Dan Luu - A simple way to get more value from tracing
- Netflix - Building Netflix’s Distributed Tracing Infrastructure
- Timescale - Promscale and tracing
Jaeger
- Jaeger and OpenTelemetry
- Jaeger GitHub Issue - Discuss post-trace (tail-based) sampling
- Jaeger GitHub Issue - Adaptive Sampling
- SURVEY: Who is using Jaeger
- Jaeger using Kubernetes - various deployment configurations
- ⭐ Distributed Tracing Infrastructure with Jaeger on Kubernetes
OpenTelemetry
- Semantic conventions
- Migrating from OpenTracing
- Assigning custom trace IDs using an IDGenerator – Useful if your application generates a unique request ID and you want to use it for the corresponding trace as well.