OpenTracing: An Open Standard for Distributed Tracing

Originally published at: OpenTracing: An Open Standard for Distributed Tracing | InfluxData


Logs help you understand what your application is doing. Most every application generates its own logs for the server that hosts it. But in a modern distributed system and in a microservices environment, logs are just not enough.

We can use services like ElasticSearch to gather all of an application’s logs together, which is simple enough when you’re dealing with just one application. But nowadays, an application is more like a collection of services, each of which generates its own logs. Since each log records the actions and events of the service, how do we span these services to get insight on the application?

Questions like this one are a testament to why distributed tracing is becoming a requirement for effective monitoring, prompting the need for a tracing standard.

OpenTracing in Theory

Although tracing offers visibility into an application as processes grow in number, instrumenting a system for tracing has thus far been labor-intensive and complex. The OpenTracing standard changes that, enabling the instrumentation of applications for distributed tracing with minimal effort.

In October 2016, OpenTracing became a project under the guidance of the Cloud Native Computing Foundation. Under the CNCF’s stewardship, OpenTracing aims to be an open, vendor-neutral standard for distributed systems instrumentation. It offers a way for developers to follow the thread — to trace requests from beginning to end across touchpoints and understand distributed systems at scale.

In OpenTracing, a trace tells the story of a transaction or workflow as it propagates through a distributed system. The concept of the trace borrows a tool from the scientific community called a directed acyclic graph (DAG), which stages the parts of a process from a clear start to a clear end. Certain groups of steps or spans in-between may be repeatable, but never indefinitely like a “do loop” without an exit condition.

So a trace in this context is a DAG made up of spans — named, timed operations representing contiguous segments of work in that trace. Each component in a distributed trace will contribute its own spans.

Now that you have a bit of background, the following definitions should make more sense: A trace is a set of spans that share a common root. For OpenTracing, a trace is built by collecting all spans that share a TraceId.

In this instance, a span is a set of annotations that correspond to a particular remote procedure call. Each span represents a unit of time and has its own log. The span context is a key/value store that is attachable to a specific span, to which you may log on to better understand the events to which the span refers. Basically, tracing is about spans, inter-process propagation, and active span management.

Why OpenTracing Adoption Is Growing

In microservices architectures, there are more applications communicating with each other than ever before. While application performance monitoring is great for debugging inside a single app, as a system expands into multiple services, how can you understand how much time each service is taking, where the exception happens, and the overall health of your system? In particular, how do you measure network latency between services—such as how long a request takes between one app to another?

Enter distributed tracing instrumentation. With the higher-level distribution of services that takes place in a cloud-based environment, tracing will become a key part of the cloud infrastructure supporting those services.

If you’ve ever used the Firefox browser in development, you know that when you open its Browser Console (Ctrl + Shift + J), you can see all the components currently being executed in the cache, and their current operating status. In a sense, that’s a kind of trace.

Hopefully, the need for a tracing standard for server-side services is as obvious as the need for one on the client side. We need OpenTracing specifically because there are different languages and different libraries, each of which may use its own instrumentation, may send different data and may access its own database. So you rarely have a single trace from any single component. This fact is what gives rise to the need for a common language for the instrumentation of application code, library code, and all kinds of systems.

OpenTracing Use Cases

The OpenTracing documentation offers this candidate for a common definition for tracing: “a thin standardization layer that sits between application/library code and various systems that consume tracing and causality data.” As a standard mechanism for describing system behavior, OpenTracing would thereby serve as a way for applications, libraries, services, and frameworks to “describe and propagate distributed traces without knowledge of the underlying OpenTracing implementation.” Here’s where its value resides.

As discussed on GitHub, common use cases of OpenTracing include:

  • Microservices — for instance, reconstructing the journey that transactions take through a microservices architecture.
  • Caching — troubleshooting to determine whether a request is hitting the cache.
  • Arbitration — for example, tracing the full history of a single process and determining its behavior when multiple services contact it in parallel rather than sequentially.
  • Message bus monitoring — determining the proper spans and distributions of messages in a queue, to ensure they’re triggering the proper series of events, and also to make certain messages are brief, discrete, and never the sources of data leaks.

What InfluxData Is Doing with OpenTracing

Recognizing the need to simplify troubleshooting in microservice platforms, InfluxData decided to add added tracing functionality into its Zipkin Telegraf plugin. Zipkin is a distributed tracing system that helps gather timing data needed to troubleshoot latency problems common with microservices.

Zipkin uses Cassandra as a backstore for all its traces. We discovered it would be useful for Telegraf to collect the traces, then store their data into InfluxDB, a native Time Series Database. Since these traces are all timestamped, InfluxDB is a better choice for storing them. It’s optimized for time series data and built from the ground up for metrics and events. If you are already storing metrics in InfluxDB, it makes sense to store your traces there too, especially because you can then manipulate/cross-analyze traces with other metrics using Kapacitor, InfluxData’s native processing engine.

At InfluxData, we use what we build. To validate our own theories, we’re implementing OpenTracing in our InfluxCloud service. We’ll soon be sharing some of the details on the implementations, as well as how it has helped us in troubleshooting.

OpenTracing is getting much attention from companies because developers want to know what’s happening in their applications. Through OpenTracing, developers are able to understand where each request started, where it is going, and what’s happening across its journey. Having more knowledge lets them take the appropriate action.

You can learn more by watching this recent InfluxData OpenTracing webinar.