Distributed tracing

Links

The OpenTracing Semantic Specification
OpenTelemetry Specification
Pinterest - Analyzing distributed trace data
⭐ Distributed Tracing — we’ve been doing it wrong
- Spans are too low-level a construct for effective root cause analysis
- More high level visualizations would better benefit RCA. For instance, dynamically generaeted service topology graphs or aggregation of trace data to surface anomalous flows.
Twitter thread on how distributed tracing products don’t provide enough value
Lessons from Building Observability Tools at Netflix
- “In summary, the key learnings from our effort are that tying multiple request traces into a logical concept, a playback session in this case, and providing additional context based on constituent traces enables our users to quickly determine the root cause of a streaming issue that may involve multiple systems.”
Distributed Tracing: Impact on Engineering Organizations
Salesforce - Anomaly Detection in Zipkin Trace Data
- Using machine learning
- 1. Calculating Completeness Metrics on Trace Data (sum of durations for spans within a trace compared to that trace’s total duration)
- 2. Identifying High Traffic Areas in the Network
- 3. Identifying Services with Exponential Latency Growth
Uber - Distributed Tracing
⭐ Dan Luu - A simple way to get more value from tracing
Netflix - Building Netflix’s Distributed Tracing Infrastructure
Timescale - Promscale and tracing

Semantic conventions
Migrating from OpenTracing
Assigning custom trace IDs using an IDGenerator – Useful if your application generates a unique request ID and you want to use it for the corresponding trace as well.