How to Build a Production RAG System in Python

Introduction

Large Language Models are powerful, but they have a major limitation: they don't know anything about your private data.

This is where Retrieval-Augmented Generation (RAG) comes in. Instead of relying only on the model's training data, RAG systems retrieve relevant information from a knowledge base and provide it as context to the LLM.

In this article we'll build a production-style RAG system in Python using:

FastAPI for the API layer
pgvector for vector search
OpenAI embeddings for semantic indexing
Async Python pipelines for scalability

By the end, you'll understand how to design a real-world RAG backend architecture, not just a toy demo.

1. RAG Architecture Overview

A typical RAG system consists of several layers:

User Query
    ↓
FastAPI API
    ↓
Retriever (Vector Search)
    ↓
Context Builder
    ↓
LLM Generation
    ↓
Final Answer

But production systems also include:

Document ingestion
Chunking
Embedding pipelines
Vector indexing
Caching
Monitoring

A simplified architecture looks like this:

Documents
   ↓
Chunking
   ↓
Embeddings
   ↓
Vector Database (pgvector)

User Query
   ↓
Embedding
   ↓
Vector Search
   ↓
Context Assembly
   ↓
LLM
   ↓
Response

2. Setting Up the Project

Install dependencies:

pip install fastapi uvicorn asyncpg openai tiktoken

You'll also need PostgreSQL with pgvector installed.

Example Docker setup:

docker run -d \
  -p 5432:5432 \
  -e POSTGRES_PASSWORD=password \
  ankane/pgvector

3. Designing the Vector Database Schema

We store document chunks with their embeddings.

Example table:

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding VECTOR(1536)
);

Add an index for fast similarity search:

CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops);

This allows efficient semantic search over embeddings.

4. Building the Embedding Pipeline

First, create a function that generates embeddings.

from openai import AsyncOpenAI

client = AsyncOpenAI()

async def embed_text(text: str):
    response = await client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

Now we can embed documents before storing them.

5. Document Chunking

LLMs work better with smaller chunks of context.

Example chunking function:

def chunk_text(text, chunk_size=500):
    words = text.split()
    for i in range(0, len(words), chunk_size):
        yield " ".join(words[i:i + chunk_size])

Then we embed and store each chunk.

async def index_document(conn, text):
    for chunk in chunk_text(text):
        embedding = await embed_text(chunk)

        await conn.execute(
            """
            INSERT INTO documents (content, embedding)
            VALUES ($1, $2)
            """,
            chunk,
            embedding
        )

This creates the vector knowledge base.

6. Implementing Vector Search

When a user asks a question, we:

Embed the query
Search the vector database
Retrieve the most similar chunks

Example:

async def search_documents(conn, query_embedding, k=5):
    rows = await conn.fetch(
        """
        SELECT content
        FROM documents
        ORDER BY embedding <-> $1
        LIMIT $2
        """,
        query_embedding,
        k
    )

    return [r["content"] for r in rows]

This uses cosine similarity to find relevant context.

7. Context Assembly

Now we combine retrieved documents into a prompt.

def build_context(docs):
    return "\n\n".join(docs)

Example prompt:

Use the following context to answer the question.

Context:
{context}

Question:
{question}

8. Generating the Final Answer

Now we call the LLM.

async def generate_answer(context, question):

    prompt = f"""
    Use the following context to answer the question.

    Context:
    {context}

    Question:
    {question}
    """

    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content

9. Building the FastAPI Endpoint

Now we connect everything together.

from fastapi import FastAPI
import asyncpg

app = FastAPI()

@app.on_event("startup")
async def startup():
    app.state.db = await asyncpg.create_pool(
        "postgresql://postgres:password@localhost:5432/postgres"
    )

@app.post("/ask")
async def ask(question: str):

    query_embedding = await embed_text(question)

    async with app.state.db.acquire() as conn:
        docs = await search_documents(conn, query_embedding)

    context = build_context(docs)

    answer = await generate_answer(context, question)

    return {"answer": answer}

This endpoint implements the full RAG pipeline.

10. Production Improvements

The simple version works, but real systems need additional optimizations.

1. Async Ingestion Pipelines

Process large document collections concurrently.

documents
   ↓
async ingestion
   ↓
embedding workers
   ↓
vector indexing

2. Batch Embeddings

Embedding APIs are expensive and slow. Batching requests dramatically improves throughput.

3. Caching

Cache:

Embeddings
Retrieval results
LLM responses

Common choices: Redis, Semantic cache, Prompt cache

4. Reranking

Vector search sometimes retrieves noisy documents. A reranker model improves accuracy.

Vector Search
   ↓
Top 20 results
   ↓
Reranker
   ↓
Top 5 results

5. Monitoring

Track:

LLM latency
Token usage
Retrieval quality
Errors

Observability becomes critical as systems scale. For more on building production RAG systems, see our comprehensive guide.

11. Final Architecture

A production RAG system typically looks like this:

                +-------------+
User Request →  |  FastAPI    |
                +-------------+
                       ↓
                +-------------+
                | Retriever   |
                +-------------+
                       ↓
                +-------------+
                | Vector DB   |
                |  pgvector   |
                +-------------+
                       ↓
                +-------------+
                | Context     |
                | Builder     |
                +-------------+
                       ↓
                +-------------+
                | OpenAI LLM  |
                +-------------+
                       ↓
                   Response

Conclusion

Retrieval-Augmented Generation is quickly becoming the standard architecture for AI applications.

A production-ready RAG system typically includes:

Document ingestion pipelines
Chunking strategies
Embedding pipelines
Vector databases
Async APIs
Monitoring and scaling infrastructure

Python provides an excellent ecosystem for building these systems, especially with frameworks like FastAPI and databases like pgvector. Understanding the differences between pgvector vs FAISS vs Pinecone helps you choose the right storage solution.

As LLM applications continue to grow, engineers who understand RAG system design and AI backend architecture will be increasingly in demand. For more on effective chunking strategies for RAG, see our detailed guide.

How to Build a Production RAG System in Python (FastAPI + pgvector + OpenAI)