Introduction
Large Language Models rarely operate directly on raw data.
Instead, modern AI systems rely on embeddings — vector representations of text, documents, images, or structured records.
These embeddings power:
- Semantic search
- Recommendation systems
- Retrieval-Augmented Generation (RAG)
- Clustering and similarity analysis
- Knowledge retrieval pipelines
However, generating embeddings at scale introduces several engineering challenges:
- API rate limits
- Batch processing
- Storage optimization
- Incremental updates
- Failure recovery
In this article we'll explore how to build production-ready embedding pipelines in Python that are scalable, fault-tolerant, and efficient.
Why Embedding Pipelines Matter
In small prototypes developers often generate embeddings on the fly:
embedding = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
This works for dozens of documents.
But production systems often process:
- 10k – 10M documents
- Large datasets
- Streaming updates
- Continuous ingestion
Without a proper pipeline you will face:
- API bottlenecks
- Expensive recomputation
- Inconsistent embeddings
- Poor search performance
A production embedding pipeline solves this by introducing:
- Batch generation
- Queue processing
- Persistent storage
- Retry logic
- Incremental updates
Architecture of an Embedding Pipeline
A typical production architecture looks like this:
Raw Data
│
▼
Data Cleaning
│
▼
Chunking
│
▼
Embedding Generation
│
▼
Vector Storage
│
▼
Semantic Search / RAG
Core components:
| Component | Purpose |
|---|---|
| Data ingestion | Load raw documents |
| Chunking | Split text into smaller pieces |
| Embedding generation | Convert text to vectors |
| Storage | Persist vectors in DB |
| Retrieval | Similarity search |
For many AI backends this pipeline runs continuously.
Step 1 — Chunking Documents for Embeddings
LLMs have context limits, so large documents must be split into smaller chunks.
Typical chunk sizes:
| Use Case | Chunk Size |
|---|---|
| RAG | 200–500 tokens |
| Search | 300–800 tokens |
| Knowledge base | 500–1000 tokens |
Example chunking implementation:
def chunk_text(text, size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start += size - overlap
return chunks
Chunk overlap is important because:
- It preserves semantic continuity
- Improves retrieval quality
Without overlap important context may be lost. For more on chunking strategies for RAG, see our detailed guide.
Step 2 — Generating Embeddings in Batches
Calling the embedding API one request per document is inefficient.
Instead we use batch requests.
Example with OpenAI:
from openai import OpenAI
client = OpenAI()
def generate_embeddings(texts):
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return [item.embedding for item in response.data]
Batching provides several advantages:
- Lower latency
- Fewer API calls
- Better throughput
Typical production batch sizes:
32 – 128 documents (depending on model limits)
Step 3 — Async Embedding Processing
Embedding pipelines often process thousands of documents.
Using synchronous requests will become a bottleneck.
Instead we implement async processing.
Example:
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def embed_text(text):
response = await client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
async def process_batch(texts):
tasks = [embed_text(t) for t in texts]
return await asyncio.gather(*tasks)
Benefits:
- High throughput
- Efficient API usage
- Better CPU utilization
Async pipelines can process 5–10x more documents compared to synchronous implementations.
Step 4 — Storing Embeddings in Vector Databases
After generation embeddings must be stored for retrieval.
Popular storage options:
| Database | Use Case |
|---|---|
| pgvector | PostgreSQL integration |
| FAISS | Local vector search |
| Pinecone | Managed vector DB |
| Weaviate | Enterprise vector platform |
Example storing embeddings in PostgreSQL + pgvector:
INSERT INTO documents (content, embedding)
VALUES ($1, $2);
Table schema:
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding VECTOR(1536)
);
Index for fast similarity search:
CREATE INDEX idx_embedding
ON documents
USING ivfflat (embedding vector_cosine_ops);
This enables millisecond-level semantic search. Learn more about vector databases for AI systems in our comparison guide.
Step 5 — Incremental Embedding Updates
Production datasets are not static.
New documents appear constantly.
Instead of recomputing everything, pipelines should support:
- Incremental ingestion
- Deduplication
- Update detection
Example strategy:
if document_hash not in database:
generate_embedding()
This prevents unnecessary API calls and reduces cost.
Step 6 — Failure Handling and Retries
Embedding APIs can fail due to:
- Rate limits
- Network errors
- Timeouts
Production pipelines must include retry mechanisms.
Example:
import tenacity
@tenacity.retry(
wait=tenacity.wait_exponential(multiplier=1),
stop=tenacity.stop_after_attempt(5)
)
def generate_embedding(text):
return client.embeddings.create(
model="text-embedding-3-small",
input=text
)
This ensures robustness under load.
Engineering Insight
A common mistake when building embedding systems is recomputing embeddings too often.
In production pipelines embeddings should be treated as immutable artifacts.
Instead of regenerating them:
- Version them
- Cache them
- Update only when the source document changes
This approach dramatically reduces API costs and pipeline latency.
Final Thoughts
Embedding pipelines are a core component of modern AI systems.
A production-grade architecture should include:
- Document chunking
- Batch embedding generation
- Async processing
- Vector database storage
- Incremental updates
- Failure handling
When implemented correctly, these pipelines enable:
- Scalable semantic search
- Efficient RAG systems
- Real-time AI retrieval
And most importantly — they transform raw data into machine-understandable knowledge. For a complete guide on building production RAG systems, see our comprehensive article.