I was recently building a chat interface that felt sluggish. A user would ask a question, and we’d all just sit there, staring at a loading spinner for what felt like an eternity. It was frustrating. The AI was thinking, but the user had no evidence anything was happening. This experience is common. It’s the classic problem of the buffered response, where the entire answer must be cooked before a single byte is served to the user.
So, I began to explore a better way. What if we could serve the response word by word, as it’s created? This is the core idea behind streaming. Instead of waiting for a full essay, you get the first sentence in less than a second. The perception of speed is incredible, even if the total generation time is the same.
Why does this matter for you? If you’re building any application where a human waits for an AI, streaming transforms the experience. It feels conversational, responsive, and respectful of the user’s time. It turns a monolithic wait into a flowing dialogue. Let me show you how it works.
The old way looks like this in code. You ask for everything at once.
# The user waits...and waits.
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain the theory of relativity."}]
)
# Only now, after a full wait, does the text appear.
print(response.choices[0].message.content)
The streaming approach changes the game. You get pieces of the answer as they are ready.
# The user sees a response almost immediately.
for chunk in client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain the theory of relativity."}],
stream=True # This one parameter changes everything.
):
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
See the difference? The stream=True flag and the loop over chunk objects are the keys. Each chunk contains a piece of the final message, a “delta” of content. We print it as it arrives. Have you ever noticed how much more engaging a live typing indicator is compared to a static loading icon? This is that, for AI.
Let’s build a more practical example with OpenAI. We can even measure how well it performs.
from openai import OpenAI
import time
client = OpenAI()
def stream_with_metrics(prompt):
start = time.time()
first_token_time = None
stream = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.time() - start
print(f"\nFirst token in {first_token_time:.2f}s: ", end="")
print(chunk.choices[0].delta.content, end="", flush=True)
print(f"\n\nDone. Total time: {time.time() - start:.2f}s")
stream_with_metrics("Describe the ecosystem of a rainforest.")
This gives us immediate feedback. The “Time To First Token” (TTFT) is a critical metric. A good TTFT is under a second. It tells the user the system is alive and working. The rest of the text can follow at its own pace. What do you think happens to user patience when they see that first word appear quickly?
Streaming isn’t just for OpenAI. The pattern is similar across providers. Here’s how you might do it with a local model from Hugging Face, which is perfect for private or cost-sensitive projects.
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread
import torch
def stream_local_model(prompt):
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Run the model generation in a separate thread
thread = Thread(target=model.generate, kwargs={"**inputs, "max_new_tokens": 500, "streamer": streamer})
thread.start()
# This loop runs in our main thread, printing tokens as the model generates them.
for text in streamer:
print(text, end="", flush=True)
thread.join()
stream_local_model("Write a short poem about Python code.")
This code uses a TextIteratorStreamer. The key is running the model’s .generate() method in a background thread. That thread “feeds” tokens into the streamer, and our main thread consumes them. It makes running a large local model feel interactive.
But this is all in a terminal. The real test is putting it on the web. How do we stream to a browser? The answer is often Server-Sent Events (SSE) with a framework like FastAPI. SSE provides a simple way to push text updates from a server to a web client.
Imagine you’re building the backend for a ChatGPT-like website. Here’s a basic FastAPI endpoint that streams.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import asyncio
import json
app = FastAPI()
client = OpenAI()
async def openai_stream_generator(prompt: str):
stream = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if content := chunk.choices[0].delta.content:
# Format the chunk as a Server-Sent Event
yield f"data: {json.dumps({'text': content})}\n\n"
await asyncio.sleep(0) # Lets other tasks run
@app.post("/chat")
async def chat_stream(prompt: str):
return StreamingResponse(
openai_stream_generator(prompt),
media_type="text/event-stream"
)
The StreamingResponse is the hero here. It takes an async generator and sends each chunk as an event. A JavaScript frontend can listen to these events and append text to the screen in real time. This is the architecture behind many modern AI chat applications. Can you see how this simple endpoint creates a fluid user experience?
Now, let’s think about optimization. Streaming gives you control. You can implement token buffering. Instead of sending every single character immediately—which can look choppy—you can collect a few words or a sentence fragment before sending a slightly larger, smoother update. It’s a balance between speed and visual comfort.
You also need to handle problems. What if the user closes the browser tab mid-stream? Your server should catch the disconnect and stop the generation, saving precious compute resources. FastAPI and other frameworks often provide ways to detect a broken client connection.
Finally, let’s talk about state. A surprising challenge is building a coherent stream when you need the AI to remember the conversation. You must carefully manage the chat history, appending each new user message and the AI’s streamed response, ensuring the context for the next turn is complete and accurate.
This journey from a silent loading screen to a lively, typing conversation is what modern LLM interaction is all about. It’s not just a technical detail; it’s a fundamental shift in how humans and AI communicate. The code patterns are straightforward, but the impact on user satisfaction is profound.
I hope this guide helps you make your AI applications feel instantly smarter and more responsive. Was there a moment when you first saw an AI stream text that made you think, “Wow, this is different”? I’d love to hear about your projects. If you found this useful, please share it with someone else who’s building the next wave of AI tools. Let me know in the comments what you’re working on!
As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!
101 Books
101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.
Check out our book Golang Clean Code available on Amazon.
Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!
📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!
Our Creations
Be sure to check out our creations:
Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools
We are on Medium
Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva