Large Language Models Apr 28, 2026

Production-Ready LLM Streaming with FastAPI, Asyncio, and SSE

Learn production-ready LLM streaming with FastAPI, asyncio, and SSE to handle token delivery, disconnects, and scale reliably.

I’ve spent too many late nights watching a spinning cursor while waiting for an LLM to finish its entire response before showing me a single word. That’s not how humans read — we want to see the sentence form, letter by letter, like a human typing in real time. That instant feedback is what makes an AI feel alive, not like a sluggish database query. The moment I started building production chat applications, I realized that most tutorials skip the hard part: how to stream tokens efficiently under load, handle disconnecting users, and keep your server from dropping requests when a hundred people start chatting at once. This article is my attempt to fix that — to walk you through the real mechanics of streaming LLM pipelines using Python’s asyncio, FastAPI, and Server-Sent Events (SSE), with code you can actually deploy.

Why does most streaming advice fail in production? Because they write a synchronous loop inside an async endpoint, blocking the event loop for every user. Or they use WebSockets when SSE is simpler and more reliable for one‑way token delivery. Let me show you the difference. Here’s a common mistake:

# Don't do this
@app.get("/chat")
def sync_stream():
    response = openai.ChatCompletion.create(stream=True)
    for chunk in response:
        yield chunk["choices"][0]["delta"].get("content", "")

That for loop runs synchronously — every other request has to wait until this one finishes yielding. Under high concurrency, your server becomes a bottleneck. Instead, we need an asynchronous generator that never blocks.

But before we code, imagine you’re talking to a friend. They don’t finish their whole story before you hear a word. They send you the first syllable, pause, then the next. That’s the streaming mindset. In technical terms, we want to transform the LLM’s token‑by‑token output into a steady HTTP stream that the browser can display as it arrives. The hero of this story is Server‑Sent Events — a protocol where the server pushes data over a single long‑lived HTTP connection. The browser uses a simple EventSource API to receive messages. No WebSocket‑upgrade dance, no binary framing, just plain text with a data: prefix. Perfect for LLM tokens.

Let’s build the foundation. Start with FastAPI and an async endpoint that returns a StreamingResponse. I’ll use the openai library’s async client because it natively yields async streams.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
import asyncio

app = FastAPI()
client = AsyncOpenAI()

async def generate_stream(messages):
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=True,
    )
    async for chunk in stream:
        content = chunk.choices[0].delta.content or ""
        yield f"data: {content}\n\n"
        await asyncio.sleep(0)  # yield control to event loop

@app.get("/chat")
async def chat_endpoint():
    return StreamingResponse(
        generate_stream([{"role": "user", "content": "Tell me a short story."}]),
        media_type="text/event-stream"
    )

Notice the await asyncio.sleep(0) — that tiny pause forces the event loop to handle other tasks between tokens. Without it, the generator still blocks because yield does not automatically switch contexts in Python. This detail saved me from a production outage where one slow stream starved all others.

Now, what about backpressure? Imagine a user with a slow mobile connection. Your server keeps sending tokens faster than the network can accept them. The TCP send buffer fills up, the kernel blocks the write, and eventually the entire event loop stalls. To handle this, we can use an asyncio.Queue to decouple the LLM’s token production from the HTTP response consumption. Here’s a more robust pattern:

async def safe_stream(messages):
    queue = asyncio.Queue(maxsize=10)  # limit in‑flight tokens
    stop_event = asyncio.Event()

    async def producer():
        try:
            stream = await client.chat.completions.create(..., stream=True)
            async for chunk in stream:
                await queue.put(chunk)
            await queue.put(None)  # sentinel
        except Exception:
            await queue.put(None)
            raise

    async def consumer():
        while True:
            chunk = await queue.get()
            if chunk is None:
                break
            content = chunk.choices[0].delta.content or ""
            yield f"data: {content}\n\n"
            queue.task_done()

    producer_task = asyncio.create_task(producer())
    async for token in consumer():
        yield token
    await producer_task

    # Cleanly cancel on client disconnect
    if await request.is_disconnected():
        producer_task.cancel()

Have you ever wondered what happens when the client closes the browser mid‑stream? Without disconnect detection, your producer keeps pulling tokens from the API, wasting money and compute. FastAPI provides request.is_disconnected() inside the endpoint. Use it to cancel the background task, as shown above.

Now for the orchestration part: a single endpoint for multiple models. I like to use a simple router that reads a model query parameter and delegates to the right backend. Here’s a minimal example:

from app.backends.openai_backend import OpenAIStreamingBackend
from app.backends.ollama_backend import OllamaStreamingBackend

backends = {
    "gpt-4o": OpenAIStreamingBackend(api_key=...),
    "llama3": OllamaStreamingBackend(base_url="http://localhost:11434"),
}

@app.get("/chat/{model}")
async def chat(model: str, message: str):
    backend = backends.get(model)
    if not backend:
        raise HTTPException(400, "unknown model")
    return StreamingResponse(
        backend.stream(messages=[{"role": "user", "content": message}]),
        media_type="text/event-stream"
    )

The OllamaStreamingBackend would look similar but use httpx to send a POST to Ollama’s /api/generate with stream: true. The beauty of the abstraction is that both emit the same StreamChunk dataclass. In production, you can add a fallback: if OpenAI returns a 429 rate limit, automatically switch to Ollama. I’ve used this pattern for a demo that never goes down, even when my OpenAI quota runs out.

Let’s talk observability. Every streaming endpoint should expose metrics: tokens per second, time to first token, number of active connections. I instrument with prometheus-fastapi-instrumentator and add a custom middleware that counts tokens as they pass through.

from starlette.middleware.base import BaseHTTPMiddleware
from prometheus_client import Counter, Histogram

TOKENS_GENERATED = Counter("llm_tokens_total", "Total tokens streamed to clients")
LATENCY_PER_TOKEN = Histogram("llm_token_latency_seconds", "Time between token yields")

class StreamingMetricsMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        if request.url.path.startswith("/chat"):
            # We'll wrap the response generator
            response = await call_next(request)
            # ... (implementation details to count tokens in the iterator)
        return response

The full code would wrap the generator to increment TOKENS_GENERATED for each token and record the time since the previous token.

One last piece: deployment. FastAPI runs on Uvicorn, but for production you want Nginx in front to handle SSL termination, buffering, and connection limits. Configure Nginx to disable buffering for SSE:

location /chat {
    proxy_pass http://fastapi_app:8000;
    proxy_http_version 1.1;
    proxy_set_header Connection '';
    chunked_transfer_encoding on;
    proxy_buffering off;
    proxy_cache off;
}

That proxy_buffering off is critical — otherwise Nginx waits for the whole response before forwarding, killing the streaming illusion. I learned this the hard way when my first deployment showed tokens in bursts every 30 seconds.

Now, after all this engineering, what’s the user experience? I’ve seen teams spend weeks on the model logic but ignore the delivery. A slow, choppy stream feels worse than a delayed full response. Get the streaming right, and your users will feel like they’re talking to a thinking partner, not a machine grinding gears.

So here’s my final ask: if this article helped you understand the guts of LLM streaming, please like, share with your teammates who still use response.text, and comment with the biggest surprise you encountered while building your own pipeline. Every production horror story is a lesson we can all learn from. Let’s make AI feel instantaneous, one chunk at a time.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: LLM streamingFastAPIasyncioServer-Sent EventsPython

Production-Ready LLM Streaming with FastAPI, Asyncio, and SSE

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

How to Build Production-Ready RAG Systems with LangChain ChromaDB Advanced Retrieval Strategies 2024

Production-Ready RAG Systems with LangChain: Complete Vector Database Implementation Guide for Scalable AI Applications

Production RAG Systems with LangChain: Complete Implementation Guide for Vector Database Integration

How to Build Intelligent Document Analysis Agents with Multi-Modal LLMs: Complete 2024 Guide

Production-Ready RAG Systems with LangChain: Complete Implementation Guide for Vector Databases

Beyond Basic RAG: Building Smarter AI Answering Systems with Hybrid Search