Large Language Models Apr 11, 2026

How to Build a Production-Ready LLM Server with Streaming, Batching, and GPU Memory Control

Learn to build a production-ready LLM server with streaming, batching, and GPU memory control for low-latency, scalable inference.

I’ve spent the last few months watching brilliant prototype applications stumble at the finish line. A team crafts a beautiful interface around a powerful language model, only to see it buckle under the first wave of real users. The issue is almost never the model’s intelligence, but the infrastructure supporting it. Loading a model and calling .generate() is easy. Building a system that serves hundreds of concurrent requests with low latency and predictable memory use is the real challenge. That’s what we’re going to build today.

Think about the last time you used a chat interface that felt instantaneous. How do you think it delivers tokens so quickly, even under load? The answer lies in moving beyond single-request thinking.

Our goal is to create a server that can handle multiple conversations at once, start delivering responses in milliseconds, and use our expensive GPU hardware efficiently. We’ll start from a simple baseline and evolve it step-by-step into a robust system.

First, let’s set up our environment. We’ll use a quantized model to save memory, which is critical for serving larger models.

# 1_model_loader.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"

def load_model_for_serving():
    # 4-bit quantization drastically reduces memory needs
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"  # Essential for correct batched generation

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype=torch.bfloat16
    )
    model.eval()  # Set to inference mode
    return model, tokenizer

Why does padding_side="left" matter so much? In a batch, shorter sequences need padding. For models that generate from the end of the input, padding on the right would mean the model starts generating from empty padding tokens, producing nonsense. Left padding aligns all the real content on the right, where generation begins.

Now, let’s establish our baseline. This is the simple approach that works for one request but fails with many.

# 2_naive_baseline.py
from model_loader import load_model_for_serving

model, tokenizer = load_model_for_serving()

def generate_one(prompt, max_tokens=256):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7
        )

    # Decode only the new tokens
    new_text = outputs[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_text, skip_special_tokens=True)

# Try it
text = generate_one("Explain quantum computing simply.")
print(text)

This works, but what happens if two users send requests at the same time? The second user waits for the first to finish. The GPU sits idle between sentences. This is poor resource use. More critically, the user waits with a blank screen for the entire generation time.

The first major improvement is streaming. Instead of waiting for the complete response, we send tokens as they’re produced. The user sees progress immediately, which feels much faster.

# 3_streaming_server.py
from fastapi import FastAPI
from threading import Thread
import torch
from transformers import TextIteratorStreamer
from model_loader import load_model_for_serving
from sse_starlette.sse import EventSourceResponse

app = FastAPI()
model, tokenizer = load_model_for_serving()

@app.get("/stream")
async def generate_stream(prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)

    # Run generation in a separate thread to avoid blocking
    generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=500)
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    # Stream tokens as server-sent events
    async def event_generator():
        for token in streamer:
            yield {"event": "update", "data": token}
        yield {"event": "close", "data": ""}

    return EventSourceResponse(event_generator())

Run this with uvicorn streaming_server:app. Now, a frontend can connect and display tokens word-by-word. This improves user experience dramatically, but our server still processes only one request at a time. The GPU is busy, but not full.

To truly scale, we need to process multiple requests simultaneously. This is where continuous batching comes in. Instead of waiting for a batch to finish before starting the next, we add new requests to the current batch as others complete. It keeps the GPU constantly working.

Implementing this from scratch is complex, but we can understand the principle by building a simplified version. The key insight is managing a queue of requests and a dynamic set of active generations.

# 4_batching_manager.py
import asyncio
from typing import List, Dict
import torch

class SimpleBatchManager:
    def __init__(self, model, tokenizer, max_batch_size=4):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.request_queue = asyncio.Queue()
        self.active_batch = None

    async def add_request(self, prompt: str) -> str:
        """Add a request and return generated text."""
        # In a real system, this would use a future/promise
        # For simplicity, we'll simulate the flow
        inputs = self.tokenizer(prompt, return_tensors="pt")
        # Real implementation would collect inputs, pad them,
        # run batch generation, then return individual results
        return "Simulated batch response for: " + prompt[:50]

A production system uses optimized libraries like vLLM or TGI that handle this automatically. For our purpose, let’s see how to integrate a basic version with FastAPI.

What about memory? As sequences grow longer, the KV cache—the memory storing previous attention keys and values—can become massive. We must manage it.

# 5_memory_aware_generation.py
def generate_with_memory_limits(prompt, max_model_len=4096):
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # Check if prompt fits within our context window
    if inputs["input_ids"].shape[1] > max_model_len - 100:
        raise ValueError("Prompt too long for safe generation")
    
    # Monitor memory
    print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            # Limit memory by controlling cache size
            max_length=min(inputs["input_ids"].shape[1] + 256, max_model_len)
        )
    
    print(f"Peak allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Putting it all together, a production-ready service combines streaming, batching, and memory awareness. Here’s a blueprint for the final integrated server.

# 6_final_server.py
from fastapi import FastAPI, BackgroundTasks
from contextlib import asynccontextmanager
import asyncio
import torch
from model_loader import load_model_for_serving

# Global model reference
model, tokenizer = None, None

@asynccontextmanager
async def app_lifespan(app: FastAPI):
    # Load model on startup
    global model, tokenizer
    model, tokenizer = load_model_for_serving()
    yield
    # Cleanup on shutdown
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

app = FastAPI(lifespan=app_lifespan)

@app.get("/health")
async def health_check():
    cuda_available = torch.cuda.is_available()
    memory_used = torch.cuda.memory_allocated() / 1e9 if cuda_available else 0
    return {
        "status": "healthy",
        "cuda": cuda_available,
        "gpu_memory_gb": round(memory_used, 2)
    }

@app.post("/generate")
async def generate_endpoint(prompt: str, max_tokens: int = 256):
    # In a full implementation, this would:
    # 1. Add request to batch manager queue
    # 2. Return a stream ID or async response
    # 3. Process in batch with other requests
    # For now, we return a simple response
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            do_sample=True
        )
    
    new_text = outputs[0][inputs["input_ids"].shape[1]:]
    return {"response": tokenizer.decode(new_text, skip_special_tokens=True)}

This structure gives you a foundation. The health endpoint is crucial for monitoring. In production, you’d add metrics for request latency, token throughput, and error rates.

Consider this: if your GPU memory is full, is it because of model weights or the growing KV cache from long conversations? Understanding this distinction helps you tune effectively.

The final step is deployment. A Docker container ensures consistency. Here’s a minimal Dockerfile:

FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "final_server:app", "--host", "0.0.0.0", "--port", "8000"]

Build it with docker build -t llm-server . and run with docker run -p 8000:8000 --gpus all llm-server.

We’ve walked from a simple script to a server architecture ready for production. The key lessons are: stream tokens for better UX, batch requests for efficiency, and always monitor your resources. Each improvement builds upon the last, transforming a fragile demo into a robust service.

What challenges have you faced when moving models from prototype to production? I’d love to hear about your experiences. If this guide helped you, please share it with others who might be facing similar hurdles. Your comments and questions help make these resources better for everyone.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: LLM serverstreaming inferencecontinuous batchingGPU memory optimizationFastAPI

How to Build a Production-Ready LLM Server with Streaming, Batching, and GPU Memory Control

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

Production-Ready RAG Systems with LangChain and Vector Databases: Complete Implementation Guide for Developers

How to Build a Production-Ready Document QA System with Unstructured and Semantic Chunking

Build Production-Ready RAG Systems with LangChain and Vector Databases: Complete Implementation Guide

Build Production-Ready RAG Systems with LangChain and Vector Databases: Complete Implementation Guide 2024

Build Production-Ready RAG Systems with LangChain and Vector Databases: Complete Python Tutorial

Production-Ready RAG Systems with LangChain: Complete Vector Database Implementation Guide 2024