Introduction
Basic RAG pipelines often rely purely on vector similarity search.
However, production AI systems rarely use vector search alone.
Instead, modern retrieval systems combine:
- Dense vector embeddings
- Keyword search
- Reranking models
This architecture is called Hybrid Search.
It significantly improves retrieval quality, especially for:
- Technical documentation
- Structured datasets
- Long enterprise knowledge bases
In this article we explore how to implement hybrid search and reranking in Python.
Why Pure Vector Search Fails
Vector embeddings capture semantic similarity, but they often struggle with:
- Exact keyword matching
- Rare tokens
- Identifiers
- Product names
Example query:
How to configure Redis maxmemory policy?
Vector search might return documents about Redis configuration in general.
But keyword search will find documents containing maxmemory policy directly.
The solution is combining both retrieval methods.
Hybrid Retrieval Architecture
Production retrieval systems often look like this:
User Query
↓
Query Processing
↓
┌─────────────────────┐
│ Vector Search │
│ (Embeddings) │
└─────────────────────┘
┌─────────────────────┐
│ Keyword Search │
│ (BM25) │
└─────────────────────┘
↓
Result Merging
↓
Reranker
↓
Top K Docs
↓
LLM Prompt
Step 1: Vector Search
Vector similarity search retrieves semantically relevant documents.
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
def embed(text):
return model.encode([text])[0]
def vector_search(query, index, documents, k=5):
q = embed(query).reshape(1, -1)
scores, ids = index.search(q, k)
results = [documents[i] for i in ids[0]]
return results
Step 2: Keyword Search (BM25)
BM25 is a classic information retrieval algorithm.
from rank_bm25 import BM25Okapi
tokenized_docs = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
def keyword_search(query, k=5):
tokenized_query = query.split()
scores = bm25.get_scores(tokenized_query)
top_ids = sorted(
range(len(scores)),
key=lambda i: scores[i],
reverse=True
)[:k]
return [documents[i] for i in top_ids]
Step 3: Hybrid Result Merging
Next we merge vector and keyword results.
Example:
Vector results: [doc1, doc5, doc7]
Keyword results: [doc2, doc5, doc9]
Merged results:
doc5
doc1
doc2
doc7
doc9
Example implementation:
def hybrid_search(query):
vector_results = vector_search(query, index, documents)
keyword_results = keyword_search(query)
combined = list(dict.fromkeys(
vector_results + keyword_results
))
return combined[:8]
Step 4: Reranking
Hybrid retrieval still produces noisy results.
To improve quality we add a reranking model.
Rerankers evaluate query-document pairs.
Example:
Query: "Redis maxmemory policy"
Doc A: Redis configuration overview — Score: 0.42
Doc B: Redis maxmemory policy guide — Score: 0.91
Implementation example:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L-6-v2"
)
def rerank(query, docs):
pairs = [(query, doc) for doc in docs]
scores = reranker.predict(pairs)
ranked = sorted(
zip(docs, scores),
key=lambda x: x[1],
reverse=True
)
return [doc for doc, _ in ranked[:4]]
For a deeper dive into reranking models for RAG, see our dedicated article on the topic.
Final Retrieval Pipeline
The full pipeline now looks like this:
User Query
↓
Query Embedding
↓
Vector Search
↓
Keyword Search
↓
Result Merge
↓
Reranking Model
↓
Top Documents
↓
LLM Prompt
Example API Endpoint
We expose the retrieval pipeline through FastAPI.
from fastapi import FastAPI
app = FastAPI()
@app.post("/search")
async def search(query: str):
candidates = hybrid_search(query)
ranked = rerank(query, candidates)
return {
"documents": ranked
}
Production Considerations
Hybrid search introduces additional complexity.
Key considerations include:
Latency
Multiple retrieval stages increase response time.
Common optimizations:
- Caching query embeddings
- Limiting candidate pool
- Asynchronous retrieval
Observability
Important metrics:
- Retrieval latency
- Reranker latency
- Recall@k
- LLM answer accuracy
Scaling
Large AI systems often separate:
- Retrieval Service
- Reranking Service
- LLM Service
This microservice architecture improves scalability.
Engineering Insight
One common mistake is sending too many documents to the LLM.
This increases cost and often reduces answer quality.
A strong reranker allows you to send only the most relevant context.
When building production RAG systems, combining hybrid search with reranking can significantly improve retrieval quality.
Conclusion
Hybrid retrieval architectures significantly improve the quality of RAG systems.
By combining:
- Vector embeddings
- Keyword search
- Reranking models
AI systems can retrieve more accurate context for LLM generation.
This architecture is widely used in production AI systems powering:
- Internal knowledge assistants
- Developer copilots
- Enterprise search platforms
Choosing the right vector databases for AI systems is crucial for implementing hybrid search effectively.