Large Language Models Apr 24, 2026

How to Build Streaming LLM APIs with FastAPI, SSE, and Backpressure Control

Learn to build streaming LLM APIs with FastAPI and SSE, handle backpressure, disconnects, and scaling issues for faster AI apps.

I was recently building a personal AI assistant that needed to stream responses token by token. The naive approach—wait for the entire response and then display it—made the interface feel sluggish, even if the network was fast. Users would watch a spinner for 10 seconds, then get a wall of text. The problem wasn’t the model’s speed; it was my pipeline. That’s when I started thinking about streaming APIs at a deeper level. Not just calling stream=True, but architecting the whole system to handle real-time token delivery, client disconnections, and memory overload gracefully. This article is the result of those experiments.

So why does streaming matter so much for LLM applications? The human brain processes visual information faster than text. When you see words appearing one by one, the perceived latency drops dramatically. A response that takes 15 seconds to generate feels instantaneous if the first token arrives in 200 milliseconds. But streaming introduces its own challenges. The server must keep connections open for potentially minutes, handle clients that disconnect halfway, and prevent memory from ballooning when a slow reader falls behind.

Have you ever wondered what happens when your client is consuming tokens slower than the model is generating them? The server’s output buffer fills up, memory usage climbs, and eventually something breaks. That’s backpressure. And it’s one of the most overlooked aspects of streaming APIs.

Let’s start with the protocol choice. Server-Sent Events (SSE) is the simplest way to stream text to a browser or mobile app. Unlike WebSockets, SSE has built-in reconnection logic—the browser automatically retries if the connection drops. That’s perfect for LLM chat where a temporary network blip shouldn’t kill the conversation. I’ve tried WebSockets too, but they require manual heartbeat and reconnection code. SSE reduces that overhead. Plus, every major frontend framework supports EventSource out of the box.

Here’s a minimal FastAPI endpoint that streams tokens using SSE:

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

@app.get("/stream")
async def stream_llm(prompt: str, request: Request):
    async def event_generator():
        # Simulate token generation
        for token in ["Hello", ", ", "world", "!"]:
            if await request.is_disconnected():
                break
            yield f"data: {token}\n\n"
            await asyncio.sleep(0.5)
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",  # Nginx proxy fix
        }
    )

But real LLM backends aren’t simple loops. You need to handle multiple providers—OpenAI, Anthropic, even local models with Ollama. I built an abstract base class that normalizes the streaming interface. Each provider adapts its own API into a consistent AsyncGenerator[StreamChunk].

The OpenAI provider looks clean because the official client already supports stream=True. But watch out: without stream_options={"include_usage": True}, you won’t get final token counts. And if the API raises a connection error, you need to propagate that properly to the client instead of crashing the generator. I learned that the hard way during a demo.

Anthropic’s Claude uses a different pattern: client.messages.stream() returns a context manager. You iterate over text_stream, then call get_final_message() to extract usage. It’s important to close the stream properly to avoid hanging connections. I wrap it in a try-finally block.

For Ollama (running locally), you use their HTTP API with stream: true. The response comes as newline-delimited JSON. You parse each line, extract the token, and yield. This simplicity is great for testing, but you lose automatic retries and load balancing.

Now, backpressure. I add an asyncio Queue with a maximum size between the provider and the SSE endpoint. The provider puts tokens into the queue, and the endpoint gets tokens from it. If the queue is full, the provider must wait—this naturally slows down the generation to match the client’s consumption speed. Here’s the pattern:

async def provider_task(queue: asyncio.Queue, messages, backend):
    async for chunk in backend.stream(messages):
        await queue.put(chunk)
    await queue.put(None)  # Sentinel

async def sse_generator(queue: asyncio.Queue, request: Request):
    while not await request.is_disconnected():
        chunk = await asyncio.wait_for(queue.get(), timeout=30.0)
        if chunk is None:
            break
        yield f"data: {json.dumps({'token': chunk.token})}\n\n"

The wait_for prevents the generator from hanging forever if the provider crashes. And the client disconnection check ensures we don’t waste resources sending tokens into the void.

Observability is another piece I add early. I log time-to-first-token, tokens per second, and the number of reconnections per session. This data helps me understand if the bottleneck is the model, the network, or the client.

One question I keep asking myself: “What happens when the user switches tabs and the browser throttles the connection?” That’s where SSE’s native reconnection shines—the browser sends a Last-Event-ID header. I store the last successful token index in a Redis cache keyed by a session ID, so on reconnect I can resume from where we left off instead of restarting the stream.

Finally, deployment. Nginx needs proxy_buffering off; and the X-Accel-Buffering: no header. Otherwise, it will buffer the entire stream before sending it to the client, defeating the purpose of streaming. I also set proxy_read_timeout to a generous value like 300 seconds because streaming connections can idle for a while between tokens.

If you found this useful, I’d love to hear your thoughts. Leave a comment below with your own streaming war stories. Share this with a teammate who’s building an LLM app. And if you want more deep dives into production AI architecture, hit that like button—it tells me what content to write next.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: streaming LLM APIsFastAPI SSEbackpressure controlAI assistant architecturetoken streaming

How to Build Streaming LLM APIs with FastAPI, SSE, and Backpressure Control

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

Production RAG Systems with LangChain: Complete Implementation Guide for Vector Databases and Advanced Retrieval

Production-Ready RAG Systems: Complete LangChain and Vector Database Implementation Guide 2024

How to Stream LLM Responses with FastAPI and SSE for Real-Time Chat UX

How to Stream AI Responses in Real Time with Async Python and SSE

Build Multi-Agent Research Systems with LangGraph: Complete Planning to Execution Tutorial

Production RAG Systems: LangChain, Vector Databases & Document Intelligence Complete Implementation Guide