Getting started with distributed tracing can be a daunting task. There are many new terms, frameworks, and tools with apparently overlapping capabilities, and it's easy to get lost or sidetracked. This guide will help you navigate the open source distributed tracing landscape by describing and classifying the most popular tools.
Although tracing and profiling are closely related disciplines, distributed tracing is typically understood as the technique that is used to tie the information about different units of work together—usually executed in different processes or hosts—in order to understand a whole chain of events. In a modern application, this means that distributed tracing can be used to tell the story of an HTTP request as it traverses across a myriad of microservices.
Most of the tools listed here can be classified as an instrumentation library, a tracer, an analysis tool (backend + UI), or any combination thereof. The article "The difference between tracing, tracing, and tracing" is a great resource in describing these three faces of distributed tracing.
For the purposes of this guide, we'll define instrumentation as the library that is used to tell what to record, tracer as the library that knows how to record and submit this data, and analysis tool as the back end that receives the trace information. In the real world, these categories are fluid, with the distinction between instrumentation and tracer not always being clear. Similarly, the term analysis tool might be too broad, as some tools are focused on exploring traces and others being complete observability platforms.
This guide lists only open source projects, but there are several other vendors and solutions worth checking out, such as AWS X-Ray, Datadog, Google Stackdriver, Instana, LightStep, among others.
Apache SkyWalking
Instrumentation | Tracer | Analysis tool |
---|---|---|
✗ | ✗ | ✓ |
Apache SkyWalking was initially developed in 2015 as a training project to understand distributed systems. It has since become prevalent in China and aims to be a complete Application Performance Monitoring platform (APM), focusing heavily on automatic instrumentation via agents and integration with existing tracers, such as Zipkin's and Jaeger's, or with infrastructure components like service meshes. SkyWalking was recently promoted to a top-level project at the Apache Foundation.
Apache (Incubating) Zipkin
Instrumentation | Tracer | Analysis tool |
---|---|---|
✓ | ✓ | ✓ |
Apache (Incubating) Zipkin was initially developed at Twitter and open sourced in 2012. It's one of the most mature open source tracing systems and has inspired pretty much all of the modern distributed tracing tools. It's a complete tracing solution, including the instrumentation libraries, the tracer, and the analysis tool. The propagation format, called B3, is the current lingua franca in distributed tracing, as well as its data format, which is natively supported by other tools, such as Envoy Proxy on the producing side and other tracing solutions on the consuming side. One of Zipkin's strengths is the number of high-quality framework instrumentation libraries.
Haystack
Instrumentation | Tracer | Analysis tool |
---|---|---|
✗ | ✓ | ✓ |
Haystack is a tracing system with APM-like capabilities, such as anomaly detection and trend visualization. Originally developed at Expedia, the architecture has a clear focus on high-availability. Haystack leverages OpenTracing as its main instrumentation library and add-on components like Pitchfork can be used to ingest data in other formats.
Jaeger
Instrumentation | Tracer | Analysis tool |
---|---|---|
✗ | ✓ | ✓ |
Jaeger was initially developed at Uber, open sourced in 2017, and moved to the Cloud Native Computing Foundation (CNCF) soon after. The inspiration from Dapper and Zipkin can be seen in Jaeger's original architecture, data model, and nomenclature, but it has evolved beyond that. For the instrumentation part, Jaeger leverages the OpenTracing API, which has been a first-class citizen since the beginning. The analysis tool is very lightweight, making it ideal for development purposes and for highly elastic environments (e.g., multi-tenant Kubernetes clusters), and it is the default tracer for tools like Istio.
OpenCensus
Instrumentation | Tracer | Analysis tool |
---|---|---|
✓ | ✓ | ✗ |
Initially developed at Google based on its internal tracing platform, OpenCensus is both a tracer and an instrumentation library. Its tracer can be connected to "exporters," sending data to open source analysis tools such as Jaeger, Zipkin, and Haystack, as well as to vendors in the area, such as Instana and Google Stackdriver. In addition to the tracer, an OpenCensus Agent is available that can be used as an out-of-process exporter, allowing the instrumented applications to be completely agnostic from the analysis tool where the data ends up. Tracing is one side of OpenCensus, with metrics completing the picture. It's not as rich in terms of framework instrumentation libraries yet, but that will probably change once the merge with the OpenTracing project is completed.
OpenTracing
Instrumentation | Tracer | Analysis tool |
---|---|---|
✓ | ✗ | ✗ |
If there's something close to a standard on the instrumentation side of distributed tracing, it's OpenTracing. This project, hosted at the CNCF was started by people implementing distributed tracing systems in a variety of scenarios: as vendors, as users, or as developers of in-house implementations. On one side of the project, there are many framework instrumentation libraries such as for JAX-RS, Spring Boot, or JDBC. On the other side, several tracers fully support the OpenTracing API, including Jaeger and Haystack, as well as for well-known vendors in the area, such as Instana, LightStep, Datadog, and New Relic. Compatible implementations also exist for Zipkin.
OpenTracing + OpenCensus
Instrumentation | Tracer | Analysis tool |
---|---|---|
?/✓ | ?/✓ | ✗ |
It was recently announced that OpenTracing will be merging efforts with OpenCensus. While it's still not clear what the future tool will look like, or even how it will be named, this is certainly something to keep on the radar. A tentative roadmap has been published along with some concrete proposals in terms of code, showing the direction this new tool will follow.
Pinpoint
Instrumentation | Tracer | Analysis tool |
---|---|---|
✓ | ✓ | ✓ |
Pinpoint was initially developed at Naver in 2012 and open sourced in 2015. It contains APM capabilities, featuring network topology, JVM telemetry graphs and trace views. Instrumentation is done exclusively via agents and can be extended via plugins. The upside of this approach is that instrumentation does not require code changes; but on the downside, it lacks support for explicit instrumentation. Pinpoint works with PHP and JVM-based applications, where it has broad support for frameworks and libraries.
Veneur
Instrumentation | Tracer | Analysis tool |
---|---|---|
✓ | ✓ | ✗ |
The Veneur project was started by Stripe, and it is described as a pipeline for observability data. It deviates from almost all other tools in this guide in that it's very opinionated about what observability should be about: spans. It comes with a set of local agents (called "sinks") that are able to receive spans, extracting or aggregating data from them, sending the outcome to external systems like Kafka. To better achieve that, Veneur comes with its own data format, SSF. Metrics can either be embedded into the spans, or synthesized/aggregated based on "regular" span data.
Dapper
The Dapper distributed tracing solution originated at Google and is described in a paper from 2010. It is a common ancestor to most of the tools listed here, including Zipkin, Jaeger, Haystack, OpenTracing, and OpenCensus. Although Dapper doesn't exist as a solution you can download and install, the paper is still a good reference of the primitives used in modern distributed tracing solutions, as well as the reasoning behind some of the design decisions.
W3C Trace Context
One of the big problems with the current distributed tracing ecosystem is interoperability between applications instrumented using different tracers. To solve this problem, the Distributed Tracing Working Group was formed at the World Wide Web Consortium (W3C) to work on the Trace Context recommendation for the propagation format.
Overview of all projects
Project | Instrumentation | Tracer | Analysis tool |
---|---|---|---|
Apache SkyWalking | ✗ | ✗ | ✓ |
Apache (Incubating) Zipkin | ✓ | ✓ | ✓ |
Haystack | ✗ | ✓ | ✓ |
Jaeger | ✗ | ✓ | ✓ |
OpenCensus | ✓ | ✓ | ✗ |
OpenTracing | ✓ | ✓ | ✗ |
OpenTracing + OpenCensus | ?/✓ | ?/✓ | ✗ |
Pinpoint | ✓ | ✓ | ✓ |
Veneur | ✓ | ✓ | ✗ |
Juraci Paixão Kröhling will present "What are my microservices doing?" at Red Hat Summit, Thursday, May 9, 11:00 a.m.-11:45 a.m. This talk will look at some of the challenges presented by microservices architecture, including the observability problem, where it is hard to know which services exist, how they interrelate, and the importance of each one.
If you haven’t registered yet, visit Red Hat Summit to sign up. See you in Boston!
Last updated: June 4, 2024