large_language_model

Beyond Basic RAG: Building Smarter AI Answering Systems with Hybrid Search

Learn how to improve RAG systems with query rewriting, hybrid search, and re-ranking for more accurate AI answers.

Beyond Basic RAG: Building Smarter AI Answering Systems with Hybrid Search

I’ve been thinking a lot about basic RAG systems lately. You know, the kind that take a question, find some text, and spit out an answer. They work, but often, they just miss the mark. The retrieved text doesn’t quite fit. The answer feels shallow. Have you ever asked a system a slightly complex question and gotten a generic, unhelpful reply back? That frustration is exactly why we need to move beyond the basics.

The core issue is simple: people don’t ask questions like a search bar. Our questions are messy, implied, or need pieces from different places. A basic system looking for literal matches or even just semantic similarity often stumbles. What if the answer requires connecting two ideas that are never mentioned in the same sentence? This gap between how we ask and how machines find is what led me down this path.

Let’s build something better. We’ll create a pipeline that doesn’t just retrieve; it thinks about the question first, searches in multiple smart ways, and then double-checks its work. Think of it like a skilled researcher, not just a keyword scanner.

First, we need our tools. You’ll want to set up a clean space for this project.

python -m venv advanced_rag
source advanced_rag/bin/activate
pip install llama-index qdrant-client sentence-transformers rank-bm25 openai

Now, let’s talk about our documents. Throwing a giant text file at the system won’t do. We need to prepare them thoughtfully.

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter

# Let's say we have some raw content
documents = [
    Document(text="LlamaIndex helps you build LLM applications over your data. It connects data sources to large language models."),
    Document(text="Qdrant is a vector database. It stores vectors and allows fast similarity search based on them.")
]

# We split them into manageable pieces
parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = parser.get_nodes_from_documents(documents)

for node in nodes:
    node.metadata["source"] = "our_docs"  # Adding helpful info

See what we did there? We broke the text down and added a simple label. This metadata becomes crucial later. It helps the system understand what it’s looking at.

But here’s a question: what if the user’s question is too short or uses different words than our documents? Imagine someone asks, “How do I make an AI read my files?” Our document says “connect data sources to LLMs.” The meaning is similar, but the words are different. A simple search might fail.

This is where our first smart trick comes in: query rewriting. Before we even search, we ask a small, fast language model to improve the question.

from openai import OpenAI
client = OpenAI()

def rewrite_query(user_query):
    prompt = f"""
    Improve this search query for a technical documentation database.
    Make it more descriptive and clear. Keep the core intent.
    User Query: {user_query}
    Improved Query:
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )
    return response.choices[0].message.content

# Example
simple_question = "AI read my files?"
better_question = rewrite_query(simple_question)
print(better_question)  # Output: "How can I connect local document files to a large language model for data ingestion?"

The rewritten question is now much more likely to find the relevant text about “connecting data sources.” It’s a small step that makes a huge difference.

Now, where do we put our prepared text? We use a special database built for this job: a vector database. Qdrant is a great choice. It doesn’t just store text; it stores the meaning of text as numbers (called vectors). This lets us search by meaning, not just keywords.

from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from llama_index.embeddings.openai import OpenAIEmbedding

# Connect to Qdrant (could be local or cloud)
client = QdrantClient(location=":memory:")  # Simple in-memory for testing
vector_store = QdrantVectorStore(client=client, collection_name="my_docs")

# We need to turn text into vectors. This is called embedding.
embed_model = OpenAIEmbedding()

# Let's create our search index
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex(nodes, vector_store=vector_store, embed_model=embed_model)

This gives us semantic search. But is meaning always enough? Sometimes you need to find a specific term, like a function name “get_nodes_from_documents.” A meaning-based search might overlook it. So, we add a second method: keyword search. This is the classic search engine style. We combine both methods. This is called hybrid search.

How do we combine two different search results? We use a clever method called Reciprocal Rank Fusion. It looks at the position of each result in both lists and calculates a new, fair score.

from typing import List
import numpy as np

def reciprocal_rank_fusion(dense_results: List, sparse_results: List, k=60):
    """
    Combines results from dense (vector) and sparse (keyword) search.
    """
    fused_scores = {}
    
    # Process dense search results
    for rank, item in enumerate(dense_results):
        doc_id = item.node.node_id
        fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (rank + k + 1)
    
    # Process sparse search results
    for rank, item in enumerate(sparse_results):
        doc_id = item.node.node_id
        fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (rank + k + 1)
    
    # Sort by the new combined score
    reranked_results = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    return reranked_results[:5]  # Return top 5

We have our top pieces of text. But are they the best ones? The final step is re-ranking. We take our, say, 20 best matches and ask a more powerful model to judge them strictly against the original question. It’s like having a final reviewer.

def simple_rerank(query, candidate_texts):
    """
    A basic re-ranker that uses an LLM to judge relevance.
    """
    judgments = []
    for text in candidate_texts:
        prompt = f"""
        Query: {query}
        Document Text: {text[:500]}...
        On a scale of 1-5, how relevant is this document to the query?
        Answer with the number only.
        """
        # In a real system, you'd use a dedicated re-ranking model for speed.
        # This is for illustration.
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        score = int(response.choices[0].message.content.strip())
        judgments.append((score, text))
    
    # Sort by score, highest first
    judgments.sort(key=lambda x: x[0], reverse=True)
    return [text for score, text in judgments]

Now, we put it all together. Our pipeline has clear, distinct stages: rewrite the question, search with multiple methods, combine the results, and re-rank for quality. This structure is powerful and adaptable.

What does this get us? Answers that are more accurate, more reliable, and better grounded in the actual source material. The system works harder to understand you and find the right information.

The beauty of this approach is in its modularity. If one part isn’t working well—say, the query rewriter—you can improve it without breaking everything else. You can swap the vector database, try a different embedding model, or adjust the hybrid search balance.

Building this kind of system taught me that effective AI isn’t about one magical model. It’s about designing a thoughtful process. It’s about guiding the technology through steps that mimic careful consideration. The result feels less like talking to a database and more like consulting an expert.

I hope walking through this process gives you a clear path to improve your own projects. The tools are available, and the methods, as we’ve seen, are straightforward to piece together. The leap from a simple lookup to a robust answering engine is within reach.

If this guide helped you connect the dots, I’d love to hear about what you’re building. Did you try a different re-ranker? How are you handling your documents? Sharing your experience helps everyone learn. Please feel free to leave a comment below, and if you found this useful, pass it along to someone else who might be wrestling with these same challenges. Let’s keep building smarter systems, together.


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

Keywords: rag systems,hybrid search,query rewriting,semantic search,llm applications



Similar Posts
Blog Image
Building Production-Ready RAG Systems with LangChain and Vector Databases: Complete Python Implementation Guide

Learn to build scalable RAG systems with LangChain & vector databases. Master document processing, embedding optimization & hybrid search. Production-ready Python guide.

Blog Image
Build Production RAG Systems with LangChain Chroma: Complete Guide to Retrieval-Augmented Generation

Learn to build production-ready RAG systems using LangChain and Chroma. Complete guide covers document processing, embeddings, retrieval, and deployment strategies.

Blog Image
Production-Ready RAG Systems with LangChain: Complete Vector Database Implementation Guide

Learn to build production-ready RAG systems with LangChain & vector databases. Complete guide covers implementation, optimization & deployment. Start building now!

Blog Image
Build Production-Ready RAG Systems with LangChain and Vector Databases: Complete Python Tutorial

Learn to build production-ready RAG systems with LangChain and vector databases in Python. Master document processing, embedding, retrieval, and LLM integration for scalable AI applications.

Blog Image
Build Production-Ready RAG Systems: LangChain, Chroma & Advanced Retrieval Strategies for High-Performance AI Applications

Learn to build production-ready RAG systems with LangChain, Chroma, and advanced retrieval strategies. Complete guide with optimization tips and deployment best practices.

Blog Image
Build Production-Ready RAG Systems: Complete LangChain and Vector Database Implementation Guide

Learn to build production-ready RAG systems with LangChain and vector databases. Complete guide covering Chroma, Pinecone, Weaviate integration, optimization, and deployment. Build scalable AI applications today.