python

How to Set Up Distributed Tracing in Python Microservices with OpenTelemetry and Jaeger

Learn how to implement distributed tracing in Python microservices using OpenTelemetry and Jaeger to debug and optimize performance.

How to Set Up Distributed Tracing in Python Microservices with OpenTelemetry and Jaeger

I was debugging a slow order process in our microservices system last week, and it felt like searching for a needle in a haystack. One user request jumped through five different services, and when something went wrong, I had no clear path to follow. That’s when I decided to get serious about distributed tracing. If you’re building or managing Python microservices, this is for you. Let’s build a way to see exactly where your requests go and how long they take.

Have you ever watched a single API call vanish into your architecture, only to reappear seconds later with an error? Without proper tracing, you’re left guessing. Distributed tracing fixes that by giving you a complete map of every service a request touches. I’ll show you how to set this up with OpenTelemetry and Jaeger, step by step.

First, let’s talk about why this matters. In a microservices setup, a simple action like placing an order might involve an API gateway, an authentication service, an inventory check, a payment processor, and a notification system. If the payment fails, which service caused it? Was it slow database queries or a network timeout? Traditional logging won’t cut it because logs are isolated in each service. Tracing connects the dots.

I started with OpenTelemetry because it’s become the go-to standard for observability. It works across different languages and tools, which is perfect for mixed environments. Jaeger is my preferred backend because it’s open-source, easy to run, and gives a clear visual timeline. Together, they let you capture spans—which are individual units of work—and combine them into traces that show the full journey.

Setting up the infrastructure is straightforward. I use Docker Compose to run Jaeger locally. Here’s a basic setup I often use:

version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:1.51
    ports:
      - "16686:16686"
      - "4317:4317"
    environment:
      - COLLECTOR_OTLP_ENABLED=true

Save this as docker-compose.yml and run docker-compose up -d. Now, open your browser to http://localhost:16686—you’ll see the Jaeger UI waiting for data. It’s that simple to get started.

Next, we need to prepare our Python services. I install OpenTelemetry packages using pip. Create a requirements.txt file:

opentelemetry-api==1.21.0
opentelemetry-sdk==1.21.0
opentelemetry-exporter-otlp==1.21.0
opentelemetry-instrumentation-flask==0.42b0
flask==3.0.0

Run pip install -r requirements.txt. These packages give us the tools to instrument our code. But what does instrumentation mean? It’s the process of adding tracing calls to your application, either automatically or manually.

Let’s instrument a Flask microservice. Imagine you have an authentication service. Here’s how I add tracing to it:

from flask import Flask, jsonify
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor

app = Flask(__name__)

# Set up tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Auto-instrument Flask
FlaskInstrumentor().instrument_app(app)

@app.route('/auth', methods=['POST'])
def authenticate():
    # This will automatically create spans for the request
    return jsonify({"status": "authenticated"}), 200

if __name__ == '__main__':
    app.run(port=5000)

With this code, every HTTP request to the /auth endpoint generates a span. OpenTelemetry handles the details, like timing and context propagation. When you run this service and send a request, traces will appear in Jaeger. Try it out—see how long the authentication takes.

But what if you use FastAPI for async services? The process is similar. I add opentelemetry-instrumentation-fastapi to the requirements and instrument it. Here’s a snippet:

from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor().instrument_app(app)

@app.get("/orders/{order_id}")
async def get_order(order_id: int):
    return {"order_id": order_id, "status": "processing"}

Auto-instrumentation covers common frameworks, but sometimes you need more control. That’s where manual spans come in. Suppose you have a complex business logic function. You can wrap it in a span to see exactly how it performs.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_payment(amount):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.amount", amount)
        # Your payment logic here
        time.sleep(0.1)  # Simulate work
        span.add_event("payment_processed", {"timestamp": time.time()})
        return True

In this example, I create a span named “process_payment”, add an attribute for the amount, and log an event when it’s done. Attributes are key-value pairs that add context, like user IDs or request types. Events are like timestamps within the span, useful for marking specific moments.

How do spans connect across services? Through context propagation. When one service calls another via HTTP, it sends trace context in headers. OpenTelemetry manages this automatically for libraries like requests. Here’s how I make a call from one service to another:

import requests
from opentelemetry.instrumentation.requests import RequestsInstrumentor

RequestsInstrumentor().instrument()

response = requests.get("http://localhost:5001/orders/123")

The instrumentation injects headers so the receiving service can link the spans. This way, a trace from the auth service to the order service appears as a single timeline in Jaeger. Have you considered how this works with message queues like RabbitMQ? It’s possible by adding context to messages, but I’ll keep it simple for now.

Database queries are another critical area. I use SQLAlchemy, and OpenTelemetry can trace those too. Add opentelemetry-instrumentation-sqlalchemy and instrument your engine:

from sqlalchemy import create_engine
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

engine = create_engine("sqlite:///mydatabase.db")
SQLAlchemyInstrumentor().instrument(engine=engine)

Now, every query generates a span, showing you slow operations. This is invaluable for spotting bottlenecks—imagine finding that one JOIN query taking seconds.

As your system grows, you might worry about performance. Tracing adds overhead, but you can manage it with sampling. Sampling decides which traces to collect. I often use probability sampling to capture only a fraction of requests. Here’s how to set it up:

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import TraceIdRatioBasedSampler

# Sample 50% of traces
sampler = TraceIdRatioBasedSampler(0.5)
tracer_provider = TracerProvider(sampler=sampler)
trace.set_tracer_provider(tracer_provider)

This balances visibility with resource use. In production, you might sample more for critical paths and less for others.

Tracing works best when combined with logs. I correlate them by adding trace IDs to log messages. OpenTelemetry provides tools for this. For example:

import logging
from opentelemetry import trace

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def some_function():
    current_span = trace.get_current_span()
    trace_id = current_span.get_span_context().trace_id
    logger.info("Processing request, trace_id: %s", trace_id)

Now, when you see a log, you can jump to the trace in Jaeger using that ID. It’s a game-changer for debugging.

When deploying to production, security is key. Use TLS for exporters, restrict access to Jaeger, and set up resource limits. I run Jaeger in a Kubernetes cluster with persistent storage for long-term trace retention. Start small, monitor the impact, and scale as needed.

Throughout this journey, I’ve learned that tracing isn’t just for debugging—it helps optimize performance and understand user behavior. By implementing this, you’ll gain insights that save hours of guesswork.

I hope this guide helps you tackle those tricky distributed systems issues. Give it a try, and you’ll wonder how you managed without it. If you found this useful, please like, share, or comment below with your experiences. I’d love to hear how tracing has helped your projects.


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

Keywords: distributed tracing,python microservices,opentelemetry,jaeger,observability



Similar Posts
Blog Image
How to Build a Scalable Rate Limiter with Redis and FastAPI

Learn how to implement a resilient, sliding-window rate limiter using Redis and FastAPI to protect your API from abuse.

Blog Image
Production-Ready Background Tasks: FastAPI, Celery, and Redis Complete Integration Guide

Learn to build production-ready background task processing with Celery, Redis & FastAPI. Master distributed queues, monitoring, scaling & deployment best practices.

Blog Image
Build Event-Driven Microservices with FastAPI, Redis Streams, and Docker: Complete Production Guide

Learn to build scalable event-driven microservices with FastAPI, Redis Streams & Docker. Complete guide with real-world patterns, monitoring & deployment.

Blog Image
Production-Ready Microservices with FastAPI, SQLAlchemy, Docker: Complete Implementation Guide

Master FastAPI microservices with SQLAlchemy & Docker. Complete guide covering auth, async operations, testing, monitoring & production deployment.

Blog Image
FastAPI WebSocket Chat Application with Redis: Complete Real-Time Messaging Tutorial with Authentication

Learn to build a real-time chat app with FastAPI, WebSockets, and Redis Pub/Sub. Complete tutorial with authentication, scaling, and production deployment tips.

Blog Image
Build Production-Ready Background Tasks: Complete Celery, Redis & FastAPI Tutorial 2024

Learn to build production-ready background task systems with Celery, Redis, and FastAPI. Master async task processing, monitoring, and scaling for robust web apps.