Advanced RAG: Hybrid Search and Reranking in Production AI Systems

Introduction

Basic RAG pipelines often rely purely on vector similarity search.

However, production AI systems rarely use vector search alone.

Instead, modern retrieval systems combine:

  • Dense vector embeddings
  • Keyword search
  • Reranking models

This architecture is called Hybrid Search.

It significantly improves retrieval quality, especially for:

  • Technical documentation
  • Structured datasets
  • Long enterprise knowledge bases

In this article we explore how to implement hybrid search and reranking in Python.

Why Pure Vector Search Fails

Vector embeddings capture semantic similarity, but they often struggle with:

  • Exact keyword matching
  • Rare tokens
  • Identifiers
  • Product names

Example query:

How to configure Redis maxmemory policy?

Vector search might return documents about Redis configuration in general.

But keyword search will find documents containing maxmemory policy directly.

The solution is combining both retrieval methods.

Hybrid Retrieval Architecture

Production retrieval systems often look like this:

User Query
   ↓
Query Processing
   ↓
┌─────────────────────┐
│ Vector Search       │
│ (Embeddings)        │
└─────────────────────┘

┌─────────────────────┐
│ Keyword Search      │
│ (BM25)              │
└─────────────────────┘

        ↓
   Result Merging
        ↓
      Reranker
        ↓
      Top K Docs
        ↓
     LLM Prompt

Step 1: Vector Search

Vector similarity search retrieves semantically relevant documents.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def embed(text):
    return model.encode([text])[0]

def vector_search(query, index, documents, k=5):
    q = embed(query).reshape(1, -1)
    scores, ids = index.search(q, k)
    results = [documents[i] for i in ids[0]]
    return results

Step 2: Keyword Search (BM25)

BM25 is a classic information retrieval algorithm.

from rank_bm25 import BM25Okapi

tokenized_docs = [doc.split() for doc in documents]

bm25 = BM25Okapi(tokenized_docs)

def keyword_search(query, k=5):
    tokenized_query = query.split()
    scores = bm25.get_scores(tokenized_query)
    top_ids = sorted(
        range(len(scores)),
        key=lambda i: scores[i],
        reverse=True
    )[:k]
    return [documents[i] for i in top_ids]

Step 3: Hybrid Result Merging

Next we merge vector and keyword results.

Example:

Vector results:   [doc1, doc5, doc7]
Keyword results:  [doc2, doc5, doc9]

Merged results:
doc5
doc1
doc2
doc7
doc9

Example implementation:

def hybrid_search(query):
    vector_results = vector_search(query, index, documents)
    keyword_results = keyword_search(query)
    combined = list(dict.fromkeys(
        vector_results + keyword_results
    ))
    return combined[:8]

Step 4: Reranking

Hybrid retrieval still produces noisy results.

To improve quality we add a reranking model.

Rerankers evaluate query-document pairs.

Example:

Query: "Redis maxmemory policy"

Doc A: Redis configuration overview — Score: 0.42
Doc B: Redis maxmemory policy guide — Score: 0.91

Implementation example:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L-6-v2"
)

def rerank(query, docs):
    pairs = [(query, doc) for doc in docs]
    scores = reranker.predict(pairs)
    ranked = sorted(
        zip(docs, scores),
        key=lambda x: x[1],
        reverse=True
    )
    return [doc for doc, _ in ranked[:4]]

For a deeper dive into reranking models for RAG, see our dedicated article on the topic.

Final Retrieval Pipeline

The full pipeline now looks like this:

User Query
   ↓
Query Embedding
   ↓
Vector Search
   ↓
Keyword Search
   ↓
Result Merge
   ↓
Reranking Model
   ↓
Top Documents
   ↓
LLM Prompt

Example API Endpoint

We expose the retrieval pipeline through FastAPI.

from fastapi import FastAPI

app = FastAPI()

@app.post("/search")
async def search(query: str):
    candidates = hybrid_search(query)
    ranked = rerank(query, candidates)
    return {
        "documents": ranked
    }

Production Considerations

Hybrid search introduces additional complexity.

Key considerations include:

Latency

Multiple retrieval stages increase response time.

Common optimizations:

  • Caching query embeddings
  • Limiting candidate pool
  • Asynchronous retrieval

Observability

Important metrics:

  • Retrieval latency
  • Reranker latency
  • Recall@k
  • LLM answer accuracy

Scaling

Large AI systems often separate:

  • Retrieval Service
  • Reranking Service
  • LLM Service

This microservice architecture improves scalability.

Engineering Insight

One common mistake is sending too many documents to the LLM.

This increases cost and often reduces answer quality.

A strong reranker allows you to send only the most relevant context.

When building production RAG systems, combining hybrid search with reranking can significantly improve retrieval quality.

Conclusion

Hybrid retrieval architectures significantly improve the quality of RAG systems.

By combining:

  • Vector embeddings
  • Keyword search
  • Reranking models

AI systems can retrieve more accurate context for LLM generation.

This architecture is widely used in production AI systems powering:

  • Internal knowledge assistants
  • Developer copilots
  • Enterprise search platforms

Choosing the right vector databases for AI systems is crucial for implementing hybrid search effectively.

Back to Blog