Large Language Models Apr 15, 2026

How to Build a Production-Ready LLM Streaming API with FastAPI, SSE, Backpressure, and Cost Tracking

Learn to build a production-ready LLM streaming API with FastAPI, SSE, backpressure, rate limits, and cost tracking for reliable real-time UX.

I’ve been there—watching a chatbot cursor blink, waiting for a full response while the server crunches through an entire LLM completion. It feels slow, even when it’s technically fast. Lately, I’ve been building systems where that delay isn’t just annoying; it breaks the user’s sense of a real-time conversation. That’s why I became fixated on building a proper streaming API. But I quickly found that setting stream=True is just the starting line. The real challenge is building something that won’t collapse under load, go over budget, or deliver a broken experience when a user closes their tab mid-stream. Today, I’ll walk you through how to build that robust system.

Why does streaming feel so much more responsive? It comes down to a metric called Time to First Token (TTFT). Instead of waiting for the entire response, your client can start processing the very first piece of data almost immediately. This creates a perception of speed and interactivity that batch processing can never match. But have you ever wondered what happens to all those tokens if the client can’t keep up?

Let’s start with the foundation: Server-Sent Events (SSE). It’s a simple HTTP protocol for sending a stream of text events from server to client. FastAPI, with a little help from the sse-starlette package, makes serving an SSE endpoint straightforward. The magic happens when we connect this to an asynchronous generator.

from sse_starlette.sse import EventSourceResponse
from app.core.streaming import TokenStreamHandler

@app.get("/stream")
async def stream_response(prompt: str):
    handler = TokenStreamHandler()
    # Start the LLM generation in a background task
    asyncio.create_task(generate_llm_response(prompt, handler))
    # Return the event stream connected to the handler's generator
    return EventSourceResponse(handler.token_generator())

This code creates a pathway. Tokens from the LLM flow into the handler’s queue, and the token_generator pulls them out, yielding each one to the SSE response. But here’s a critical question: what if the LLM generates tokens faster than the network can send them?

This is where backpressure comes in. Without it, a fast LLM and a slow client connection cause tokens to pile up in memory. In a production setting with many users, this is a recipe for disaster. The solution is a bounded queue. When the queue is full, the act of trying to add a new token will make the producer pause, naturally slowing down the LLM’s consumption until the client catches up.

import asyncio

class BackpressureTokenStream:
    def __init__(self, maxsize: int = 64):  # Small, intentional limit
        self._queue = asyncio.Queue(maxsize=maxsize)

    async def put_token(self, token: str | None):
        # This will wait if the queue is full
        await self._queue.put(token)

    async def stream_tokens(self):
        while True:
            token = await self._queue.get()
            if token is None:  # Our sentinel for 'done'
                break
            yield token

By limiting the queue size, we create a smooth, self-regulating flow of data. The LLM provider’s own HTTP response will actually be paused, preventing our server from drowning in un-sent tokens. It’s a simple mechanism with a huge impact on stability.

Now, let’s talk about control. In a shared API, you need to govern usage. A sliding window rate limiter using Redis is perfect for this. It tracks requests per user in real-time, allowing fair usage without a complex setup. But rate limiting is just one side of the coin. What about tracking costs as they happen?

Real-time cost tracking is non-negotiable for production finance. You need to know the expense of each request as it completes, not hours later from a log dump. This means counting tokens and applying a pricing table immediately. We can do this with a Pydantic model and a simple middleware that calculates cost after a stream finishes.

from pydantic import BaseModel
import tiktoken

class TokenUsage(BaseModel):
    prompt_tokens: int = 0
    completion_tokens: int = 0
    model: str

    @property
    def cost(self) -> float:
        # Example pricing for gpt-4o
        price_map = {"gpt-4o": {"input": 0.0025, "output": 0.0100}}
        rates = price_map.get(self.model, {})
        prompt_cost = (self.prompt_tokens / 1000) * rates.get("input", 0)
        completion_cost = (self.completion_tokens / 1000) * rates.get("output", 0)
        return prompt_cost + completion_cost

The middleware would intercept the request, run the stream, and then use a callback from the LLM provider to tally the final token counts and log the cost to our ledger. This gives you an immediate, accurate view of spend. Can you see how this changes how you might monitor your application’s health?

Integrating LangChain cleanly is the next step. We use its AsyncCallbackHandler to capture each new token and send it to our backpressure-managed queue. The key is to keep the LangChain chain inside an asynchronous task, so it doesn’t block the main event loop and our SSE endpoint stays responsive.

Handling failures gracefully is what separates a prototype from production. If a client disconnects, we must cancel the LLM generation task to stop wasting resources. If the LLM provider times out, we need to send a clear error event to the client before closing the stream. This requires careful try/except blocks and listening for client disconnect events.

So, how do we bring all this together? The architecture flows from client to SSE endpoint, through a backpressure queue, to a LangChain chain with a custom handler, out to the LLM provider, and finally, token usage is fed back to our cost tracker. Each piece is decoupled, testable, and resilient.

Building this has taught me that the best APIs are invisible. They feel fast and reliable because immense effort went into handling the edge cases—the slow networks, the sudden disconnects, the budget overruns. It’s not just about making words appear on a screen one by one; it’s about creating a trustworthy pipeline for thought itself.

Was there a point in this process that surprised you? Perhaps the idea that a simple queue size could prevent a server outage? I find these small, deliberate choices are what define robust systems.

I hope this guide helps you build something truly solid. If you’ve faced similar challenges or have questions on tweaking this setup for your own use case, I’d love to hear about it. Please share your thoughts in the comments—let’s learn from each other. If you found this useful, consider sharing it with another developer who’s wrestling with turning a streaming demo into a production-ready service.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: LLM streaming APIFastAPI SSEbackpressurerate limitingcost tracking

How to Build a Production-Ready LLM Streaming API with FastAPI, SSE, Backpressure, and Cost Tracking

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

Complete Guide: Building Production-Ready RAG Systems with LangChain and Vector Databases

Production-Ready RAG Systems: Complete LangChain and Vector Database Implementation Guide

Production RAG Systems: Complete LangChain ChromaDB FastAPI Implementation Guide for Scalable AI Applications

Production-Ready RAG Systems Guide: LangChain, Vector Databases, and Advanced Implementation Strategies

Build Production-Ready RAG Systems: Complete LangChain and Vector Database Implementation Guide

How to Instruction Tune Open-Source AI Models for Your Unique Needs