Build a Production-Ready FastAPI SSE Streaming API for LLM Chatbots
Learn how to build a FastAPI SSE streaming API for LLM chatbots with backpressure, disconnect handling, and lower perceived latency.
I remember the first time I built a chatbot that took thirty seconds to say anything back to the user. I watched the loading spinner spin, and spin, and spin. The user had already clicked away. That’s when I realised that in the world of large language models, latency isn’t just a number—it’s the difference between a user staying or leaving. Streaming tokens as they are generated turns a sluggish experience into a conversation. But building that streaming API properly, with production-grade reliability, is a different story.
You can call OpenAI’s API, get a full response, and send it as one big JSON blob. That works for internal scripts. For a real application, where users expect to see words appear as they are being thought, you need a different approach. Every extra millisecond of perceived delay increases bounce rate. Streaming reduces the time to first token from seconds to milliseconds.
Why do most tutorials skip this? Because streaming adds complexity: async generators, backpressure, graceful disconnections, and partial response persistence. They show you a three‑line snippet and call it done. That’s fine for a notebook. It’s not fine for a service that handles hundreds of concurrent users.
Let’s fix that.
Why Server‑Sent Events and Not WebSockets?
Every time I give a talk on streaming LLM APIs, someone asks: “Why not WebSockets?” The answer is simple. WebSockets are bidirectional and require a persistent connection with custom reconnection logic. Server‑Sent Events (SSE) are unidirectional (server → client), native in every modern browser via EventSource, and automatically reconnect when the connection drops. For the common case of an LLM response streaming from server to user, SSE is the right tool.
Here’s a tiny example of what an SSE stream looks like on the wire:
data: {"token": "The", "token_index": 0}
data: {"token": " quick", "token_index": 1}
data: {"token": " brown", "token_index": 2}
data: {"token": " fox", "token_index": 3}
Each event is a line starting with data:, followed by a JSON payload. The client reads these lines as they arrive. No polling, no boilerplate.
But what happens when the client is slower than the model? Imagine a user on a slow mobile network. The model is generating tokens at 30 per second, but the client can only consume 10 per second. Without backpressure, the server will buffer those tokens in memory until the client catches up. That’s a memory leak waiting to happen.
Backpressure: The Hidden Ingredient
Backpressure means slowing down the producer when the consumer cannot keep up. In an async Python API, you can implement this with an asyncio.Queue that has a maximum size. When the queue is full, the token producer (the LLM stream) must wait.
Here’s a simple backpressure controller:
import asyncio
from typing import AsyncGenerator
from app.schemas import TokenChunk
class BackpressureController:
def __init__(self, max_queue_size: int = 50):
self.queue: asyncio.Queue = asyncio.Queue(maxsize=max_queue_size)
async def put(self, chunk: TokenChunk):
"""Called by the producer. Blocks if queue is full."""
await self.queue.put(chunk)
async def get(self) -> TokenChunk:
"""Called by the consumer (SSE stream). Blocks if empty."""
return await self.queue.get()
def as_async_generator(self) -> AsyncGenerator[TokenChunk, None]:
async def gen():
while True:
chunk = await self.get()
if chunk.finish_reason:
break
yield chunk
return gen()
The maxsize=50 means the producer will pause when 50 tokens are waiting to be consumed. This prevents unbounded memory growth.
Now integrate this with the SSE endpoint.
Building the FastAPI Streaming Endpoint
FastAPI supports streaming responses via StreamingResponse. Combine it with the sse-starlette library for proper SSE formatting.
# app/routers/stream.py
from fastapi import APIRouter, Request
from sse_starlette.sse import EventSourceResponse
from app.providers import OpenAIProvider, AnthropicProvider, OllamaProvider
from app.schemas import StreamRequest
from app.core.backpressure import BackpressureController
router = APIRouter()
@router.post("/chat/stream")
async def stream_chat(request: StreamRequest, fastapi_request: Request):
# Choose provider based on request
provider = get_provider(request.provider)
# Create backpressure controller
bp = BackpressureController(max_queue_size=50)
async def token_generator():
"""Wraps the LLM stream with backpressure and SSE formatting."""
try:
async for chunk in provider.stream(request):
# Apply backpressure: block if queue full
await bp.put(chunk)
# Yield SSE event
yield {
"event": "token",
"data": chunk.model_dump_json()
}
except Exception as e:
yield {
"event": "error",
"data": str(e)
}
finally:
# Signal end of stream
yield {
"event": "done",
"data": ""
}
return EventSourceResponse(token_generator())
Notice EventSourceResponse handles the proper Content-Type: text/event-stream header and keeps the connection alive.
Structured Output Streaming: The Buffer Problem
When you stream JSON‑structured outputs (e.g., “give me a JSON with keys name and year”), the user expects valid JSON as it arrives. But token‑by‑token, the JSON is incomplete. You need a buffer that accumulates tokens and only yields complete “chunks” that are valid JSON when possible.
A simple approach: buffer tokens until you can parse a complete JSON value. This is a form of “chunked streaming.”
import json
from app.schemas import TokenChunk
class JSONBuffer:
def __init__(self):
self.buffer = ""
def add_token(self, chunk: TokenChunk) -> str | None:
self.buffer += chunk.token
try:
# Attempt to parse the buffer as valid JSON
# If it fails, return None (not complete yet)
# If it succeeds, return the parsed JSON string
parsed = json.loads(self.buffer)
return json.dumps(parsed)
except json.JSONDecodeError:
return None
In the generator, instead of yielding every raw token, group tokens until the buffer yields a complete JSON object. This is particularly useful for tool‑calling where the model outputs a function call.
Graceful Disconnection and Partial Persistence
Users close tabs. Clients drop connections. If the model is still generating, the server should stop wasting compute. FastAPI provides the Request object which has an is_disconnected() method.
async def token_generator():
async for chunk in provider.stream(request):
if await fastapi_request.is_disconnected():
# Save partial response to Redis/RDB
await save_partial_response(request.stream_id, partial_buffer)
break
# ... yield chunk
Saving partial responses allows the user to resume or review what was already generated. This is a nice touch for expensive models.
Client Consumption: Browser and curl
A front‑end can use the native EventSource API. No extra libraries.
const eventSource = new EventSource('/chat/stream');
eventSource.addEventListener('token', (e) => {
const data = JSON.parse(e.data);
document.getElementById('output').innerText += data.token;
});
eventSource.addEventListener('done', () => {
eventSource.close();
});
eventSource.addEventListener('error', (e) => {
console.error('Stream error:', e);
});
For server‑side or CLI consumption, curl works wonderfully:
curl -N http://localhost:8000/chat/stream \
-H "Content-Type: application/json" \
-d '{"provider":"openai","model":"gpt-4o-mini","messages":[{"role":"user","content":"Tell me a story"}]}'
The -N flag disables buffering, so you see each line as it arrives.
Putting It All Together: Production Deployment
In production, you add:
- Rate limiting per user or per API key.
- Authentication via JWT or API keys.
- TLS termination behind Nginx.
- Horizontal scaling with multiple workers, each running separate event loops.
A minimal docker-compose.yml:
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- REDIS_URL=redis://redis:6379
depends_on:
- redis
redis:
image: redis:7-alpine
The API runs with uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4.
What About Latency and Cost?
Streaming doesn’t reduce total compute cost—the model still generates the same number of tokens. But it reduces perceived latency. Users can start reading the first sentence while the rest is being generated. And if they decide to interrupt (e.g., “stop, that’s enough”), you save the cost of the remaining tokens.
If you’re serving a model locally with Ollama, you can also implement speculative decoding, but that’s another article.
The Question I Keep Asking Myself
Why do so many production LLM APIs still block until the full response is ready? The answer is usually: “It was easier to build.” But easier isn’t better. Users have been trained by ChatGPT to expect streaming. Anything less feels broken.
So here’s my challenge to you: the next time you build an LLM endpoint, make it stream. Your users will thank you, and your logs will show lower bounce rates.
If this article helped you think differently about streaming APIs, click that like button. Share it with a colleague who still uses requests.post().json(). Drop a comment below telling me about the weirdest streaming bug you’ve encountered—I’d love to hear it.
Now go build something that streams.
As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!
101 Books
101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.
Check out our book Golang Clean Code available on Amazon.
Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!
📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!
Our Creations
Be sure to check out our creations:
Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools
We are on Medium
Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva