Diagnosing errors in production systems can be a challenge. Diagnosing errors in distributed systems can quickly become a nightmare. With a single application, a developer can introduce logs to capture exception messages and stack traces, even publish to central repositories such as Azure Monitor. In order to piece together the puzzle of log entries across multiple systems and applications communicating, we need to understand how these log entries relate to each other: How did calling system A result in a failure in system B?

This is the main challenge solved by distributed tracing. Any time a system communicates with another, an additional breadcrumb piggybacks in the communication so that the receiving system knows exactly what source request correlates to the current request.

Distributed tracing systems often automatically include this extra information for every request in the environment, often through host agents modifying requests or through libraries used directly in running applications. This introduces a challenge, however, in that every new tracing system invents its own way of linking requests together to form a chain or graph. If you want to change your tracing system, you will need to adopt a new way of linking calls together.

With the new W3C Trace Context standard, we can now adopt a standardized way of identifying actions/operations, then propagating that identifying information to downstream systems for eventual consumption in distributed tracing applications.

With .NET 5, the System.Diagnostics.DiagnosticSource API automatically turns on W3C support:

Zipkin W3C support
Each “Activity” receives a unique identifier and that identifier propagates through communication headers to the downstream services/applications/processes. With this in place, we still need to *report* on that information, and that’s where OpenTelemetry and Zipkin come into the picture.

Exporting traces to Zipkin using OpenTelemetry

While the W3C Trace Context standard describes how to identify and correlate distributed activities, OpenTelemetry standardizes how to report on those activities. To enable OpenTelemetry in a .NET 5 application, we need three things:

  • Listeners for telemetry activities
  • Host extension to gather activities for background exporting
  • Exporters for individual tracing tools

OpenTelemetry defines APIs that tracing tools are starting to adopt, but today we will need a bridge between Zipkin and OpenTelemetry. This will change in the future where our applications will not know anything about our tracing tools, which means it’s much more likely that we’ll be able to use our tracing tool of choice in the future should that ever change.

For a web application, we’ll need to pull packages for:

Our registration for hosting is in our normal Startup file for ASP.NET Core:

services.AddOpenTelemetryTracing(config => config
.AddZipkinExporter(o =>
o.Endpoint = new Uri("http://localhost:9411/api/v2/spans");
o.ServiceName = "Api Gateway";
In a complex microservices environment, we might have an API gateway that aggregates calls with many backend services:
Zipkin API Gateway

Ideally, we can trace this call end-to-end. With the registration above, we can now view this individual trace in our Zipkin UI:

Zipkin UI DivergentGateway

Within each activity (or “span,” as Zipkin and OpenTelemetry refer to them), we can see more details about the request:

Zipkin span activity

More complicated, however, are requests that involved some form of durable messaging such as RabbitMQ or Azure Service Bus. With these, it’s less likely that our client will have the appropriate instrumentation hooks for whatever tracing tool we want to use. This is where OpenTelemetry shines. For example, using NServiceBus and an extension for OpenTelemetry, we can register an OpenTelemetry listener that will work with any messaging transport:

services.AddOpenTelemetryTracing(config => config
    .AddZipkinExporter(o =>
        o.Endpoint = new Uri("http://localhost:9411/api/v2/spans");
        o.ServiceName = EndpointName;

We can now instrument a complex distributed interaction that mixes many different communication types and protocols:

complex interaction

Even though our interactions may be complex, with W3C Trace Context flowing links between interactions and OpenTelemetry surfacing these interactions, we can view a single trace that links everything together:

single trace link

Although distributed systems are complex, distributed tracing allows us to see a coherent view of individual actions across every system and application involved. With .NET 5, W3C and OpenTelemetry standards, we can leverage distributed tracing without vendor lock-in.

Post originally published on jimmybogard.com

Let's Talk