Large Language Models Apr 10, 2026

How to Build a Production-Ready Document QA System with Unstructured and Semantic Chunking

Learn how to build a production-ready document QA system using Unstructured, semantic chunking, and routed retrieval for accurate answers.

I’ve spent the last few months working with different teams—engineers, product managers, even legal departments—and a common frustration keeps coming up. Everyone has documents. PDFs, scanned reports, Word files, messy HTML exports. They contain critical information, but finding a specific clause, extracting a set of figures, or just asking a simple question about the content feels like searching for a needle in a haystack. The usual approach of uploading a PDF to a simple chatbot often ends in disappointment. The answers are shallow, the context is lost, and tables or complex formats are ignored. That’s what led me here. I wanted to build something better, something that truly understands a document’s structure and meaning. This is a guide to building a production-ready system that does just that.

Why do most simple document question-answering systems fail when you move from a demo to real work? The problem isn’t the large language model itself. It’s usually everything that happens before the question is even asked. Think about a standard financial report. It has headlines, paragraphs, multi-page tables, footnotes, and embedded charts. A basic tool might just chop the entire PDF into 500-word chunks, splitting a table right down the middle. What use is half a table to an LLM? The context is destroyed before the model even sees it.

So, where do we start? We start by giving our system better eyes. Instead of treating a document as a plain text file, we use a library designed to see its parts. This is where intelligent parsing comes in. A tool like Unstructured.io acts like an advanced document reader. It can look at a PDF and identify what each piece is: a title, a narrative paragraph, a table, a list item, or even an image. This step is fundamental.

from unstructured.partition.auto import partition

# This single function handles PDFs, Word docs, HTML, and more.
elements = partition(filename="annual_report.pdf", strategy="hi_res")

for element in elements[:5]:
    print(f"{element.category}: {element.text[:60]}...")

This code gives us a list of structured elements. Now we know what we’re working with. We can handle tables separately from paragraphs, preserving their integrity. But have you considered what happens when a document is scanned, or when a table spans two pages? The ‘hi_res’ strategy is crucial here for accuracy, especially with complex layouts.

Once we have these clean, identified pieces, we need to prepare them for our language model. This is the second major point where systems stumble: chunking. Throwing the entire document at the model is impossible due to length limits. Chopping it blindly into fixed sizes is inefficient. We need a smarter way. What if we could keep related ideas together?

Enter semantic chunking. Instead of counting characters, we can split text based on its natural boundaries—sentences, paragraphs, and sections. A technique I’ve found particularly effective is sentence-window chunking. Here’s the idea: we store individual sentences for efficient search, but when we find a relevant sentence, we also retrieve the few sentences that came before and after it. This provides the model with the immediate context it needs to generate a coherent answer.

from llama_index.core.node_parser import SentenceWindowNodeParser

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # Retrieve 3 sentences before and after the target.
    window_metadata_key="window",
    original_text_metadata_key="original_sentence",
)

# This creates nodes where the main 'text' is a single sentence,
# and the 'window' metadata holds the surrounding context.
nodes = node_parser.get_nodes_from_documents(documents)

This approach is a game-changer. It makes retrieval precise and keeps answers grounded in a full context window, not just an out-of-context fragment.

Now, we have clean data, intelligently chunked. The next step is making it searchable. We create a vector index—a database that understands the meaning of our text. When you ask a question, it doesn’t just look for keyword matches; it finds the text snippets that are semantically most similar to your query’s intent. But what if your document collection is vast and varied? A single index for everything can be messy.

Imagine you have legal contracts, marketing brochures, and engineering specs in the same pile. A question about “liability” should probably search the legal documents first. This is where a router comes in. We can create multiple, specialized indexes and build a query router that directs questions to the most appropriate knowledge base automatically. It makes the system much more accurate and efficient.

The final layer is the query engine itself. A good system shouldn’t just retrieve a chunk and ask the LLM to paraphrase it. For complex questions, it should break them down. “Compare the financial risks in Q3 and Q4” is really two questions. A sophisticated pipeline will decompose it, find answers for each sub-question, and then synthesize a final, comprehensive response. We can also teach it to extract structured data, like pulling all dates and product names into a JSON format, using defined Pydantic models.

Putting this all together, we move from a fragile script to a robust service. We can wrap it in a FastAPI application, add a job queue with Redis for processing large documents asynchronously, and stream answers back to the user in real-time. This is what turns a cool prototype into a tool people can rely on daily.

The journey from a messy PDF to clear, actionable insights requires thoughtful steps. It’s about respecting the structure of the original document, preserving context, and asking the right questions of your data. By building a pipeline that addresses parsing, intelligent chunking, and routed retrieval, we create a system that genuinely understands your content. What problem could you solve if your documents could finally talk back clearly?

If this breakdown of building a real document intelligence pipeline was helpful, please share it with a colleague who might be facing similar challenges. I’d love to hear about your experiences or questions in the comments below—what’s the most difficult document type you’ve had to work with?

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: document QA systemsemantic chunkingUnstructured.iovector retrievalRAG pipeline

How to Build a Production-Ready Document QA System with Unstructured and Semantic Chunking

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

Production-Ready RAG Systems: Complete LangChain Vector Database Guide for Intelligent Document Retrieval

Build Production-Ready RAG Systems with LangChain and Vector Databases: Complete Python Guide

Building Production-Ready RAG Systems with LangChain and Vector Databases Complete Implementation Guide 2024

Production-Ready RAG Systems with LangChain and Vector Databases: Complete Document Intelligence Implementation Guide

Build Production-Ready RAG Systems with LangChain and Vector Databases Complete Implementation Guide

Build Production-Ready RAG Systems with LangChain and Vector Databases: Complete Python Guide 2024