Introduction
Large Language Models are powerful, but they have a major limitation: they don't know anything about your private data.
This is where Retrieval-Augmented Generation (RAG) comes in. Instead of relying only on the model's training data, RAG systems retrieve relevant information from a knowledge base and provide it as context to the LLM.
In this article we'll build a production-style RAG system in Python using:
- FastAPI for the API layer
- pgvector for vector search
- OpenAI embeddings for semantic indexing
- Async Python pipelines for scalability
By the end, you'll understand how to design a real-world RAG backend architecture, not just a toy demo.
1. RAG Architecture Overview
A typical RAG system consists of several layers:
User Query
↓
FastAPI API
↓
Retriever (Vector Search)
↓
Context Builder
↓
LLM Generation
↓
Final Answer
But production systems also include:
- Document ingestion
- Chunking
- Embedding pipelines
- Vector indexing
- Caching
- Monitoring
A simplified architecture looks like this:
Documents
↓
Chunking
↓
Embeddings
↓
Vector Database (pgvector)
User Query
↓
Embedding
↓
Vector Search
↓
Context Assembly
↓
LLM
↓
Response
2. Setting Up the Project
Install dependencies:
pip install fastapi uvicorn asyncpg openai tiktoken
You'll also need PostgreSQL with pgvector installed.
Example Docker setup:
docker run -d \
-p 5432:5432 \
-e POSTGRES_PASSWORD=password \
ankane/pgvector
3. Designing the Vector Database Schema
We store document chunks with their embeddings.
Example table:
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding VECTOR(1536)
);
Add an index for fast similarity search:
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops);
This allows efficient semantic search over embeddings.
4. Building the Embedding Pipeline
First, create a function that generates embeddings.
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def embed_text(text: str):
response = await client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
Now we can embed documents before storing them.
5. Document Chunking
LLMs work better with smaller chunks of context.
Example chunking function:
def chunk_text(text, chunk_size=500):
words = text.split()
for i in range(0, len(words), chunk_size):
yield " ".join(words[i:i + chunk_size])
Then we embed and store each chunk.
async def index_document(conn, text):
for chunk in chunk_text(text):
embedding = await embed_text(chunk)
await conn.execute(
"""
INSERT INTO documents (content, embedding)
VALUES ($1, $2)
""",
chunk,
embedding
)
This creates the vector knowledge base.
6. Implementing Vector Search
When a user asks a question, we:
- Embed the query
- Search the vector database
- Retrieve the most similar chunks
Example:
async def search_documents(conn, query_embedding, k=5):
rows = await conn.fetch(
"""
SELECT content
FROM documents
ORDER BY embedding <-> $1
LIMIT $2
""",
query_embedding,
k
)
return [r["content"] for r in rows]
This uses cosine similarity to find relevant context.
7. Context Assembly
Now we combine retrieved documents into a prompt.
def build_context(docs):
return "\n\n".join(docs)
Example prompt:
Use the following context to answer the question.
Context:
{context}
Question:
{question}
8. Generating the Final Answer
Now we call the LLM.
async def generate_answer(context, question):
prompt = f"""
Use the following context to answer the question.
Context:
{context}
Question:
{question}
"""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
9. Building the FastAPI Endpoint
Now we connect everything together.
from fastapi import FastAPI
import asyncpg
app = FastAPI()
@app.on_event("startup")
async def startup():
app.state.db = await asyncpg.create_pool(
"postgresql://postgres:password@localhost:5432/postgres"
)
@app.post("/ask")
async def ask(question: str):
query_embedding = await embed_text(question)
async with app.state.db.acquire() as conn:
docs = await search_documents(conn, query_embedding)
context = build_context(docs)
answer = await generate_answer(context, question)
return {"answer": answer}
This endpoint implements the full RAG pipeline.
10. Production Improvements
The simple version works, but real systems need additional optimizations.
1. Async Ingestion Pipelines
Process large document collections concurrently.
documents
↓
async ingestion
↓
embedding workers
↓
vector indexing
2. Batch Embeddings
Embedding APIs are expensive and slow. Batching requests dramatically improves throughput.
3. Caching
Cache:
- Embeddings
- Retrieval results
- LLM responses
Common choices: Redis, Semantic cache, Prompt cache
4. Reranking
Vector search sometimes retrieves noisy documents. A reranker model improves accuracy.
Vector Search
↓
Top 20 results
↓
Reranker
↓
Top 5 results
5. Monitoring
Track:
- LLM latency
- Token usage
- Retrieval quality
- Errors
Observability becomes critical as systems scale. For more on building production RAG systems, see our comprehensive guide.
11. Final Architecture
A production RAG system typically looks like this:
+-------------+
User Request → | FastAPI |
+-------------+
↓
+-------------+
| Retriever |
+-------------+
↓
+-------------+
| Vector DB |
| pgvector |
+-------------+
↓
+-------------+
| Context |
| Builder |
+-------------+
↓
+-------------+
| OpenAI LLM |
+-------------+
↓
Response
Conclusion
Retrieval-Augmented Generation is quickly becoming the standard architecture for AI applications.
A production-ready RAG system typically includes:
- Document ingestion pipelines
- Chunking strategies
- Embedding pipelines
- Vector databases
- Async APIs
- Monitoring and scaling infrastructure
Python provides an excellent ecosystem for building these systems, especially with frameworks like FastAPI and databases like pgvector. Understanding the differences between pgvector vs FAISS vs Pinecone helps you choose the right storage solution.
As LLM applications continue to grow, engineers who understand RAG system design and AI backend architecture will be increasingly in demand. For more on effective chunking strategies for RAG, see our detailed guide.