Introduction
Large Language Models enable powerful AI applications, but they introduce two major engineering challenges:
- High latency
- High operational cost
Each LLM request may take hundreds of milliseconds or even several seconds. When systems scale to thousands or millions of queries per day, these costs quickly become significant.
One effective optimization technique is semantic caching.
Unlike traditional caching mechanisms that rely on exact string matches, semantic caching stores responses based on meaning similarity. If a new query is semantically similar to a previously answered question, the system can reuse the cached result instead of calling the LLM again.
This approach can dramatically reduce:
- LLM API usage
- Response latency
- Infrastructure cost
In this article we will explore:
- How semantic caching works
- How to implement it in Python
- How to integrate it into RAG systems
- Engineering considerations for production systems
Why Traditional Caching Fails for LLM Systems
Traditional caching works well when requests are identical.
Example:
GET /users/123
But natural language queries rarely repeat exactly.
For example:
"What is the refund policy?"
vs
"Can you explain how refunds work?"
Although the meaning is the same, a traditional cache treats them as different keys.
Semantic caching solves this problem by using vector similarity.
Semantic Cache Architecture
Instead of storing responses by string keys, semantic caches store:
- Query embeddings
- LLM responses
When a new query arrives, the system checks whether a similar query already exists.
Architecture Overview
User Query
↓
Embedding Generation
↓
Semantic Cache Lookup
↓
Cache Hit → Return Cached Response
↓
Cache Miss → Run RAG / LLM
↓
Store Response in Cache
This reduces unnecessary LLM calls when similar questions appear.
Where Semantic Caching Fits in the LLM Pipeline
Semantic caching typically sits before the RAG or LLM generation step.
Example Production Architecture
User Query
↓
API Layer (FastAPI)
↓
Semantic Cache
↓
Retriever (Vector Search)
↓
Prompt Construction
↓
LLM Generation
If the cache returns a result, the system skips the expensive generation step entirely.
Step 1: Generating Query Embeddings
To implement semantic caching we first need to convert queries into vector embeddings.
Example using OpenAI embeddings:
from openai import OpenAI
client = OpenAI()
def create_embedding(text: str):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
Each query is converted into a high-dimensional vector representation.
These vectors allow us to compute semantic similarity between queries.
Step 2: Designing the Cache Storage
A semantic cache typically stores:
query_embeddingquery_textllm_responsetimestamp
Example schema using PostgreSQL with pgvector:
CREATE TABLE semantic_cache (
id SERIAL PRIMARY KEY,
query TEXT,
embedding VECTOR(1536),
response TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
This allows us to perform vector similarity search directly inside the database.
Step 3: Implementing Semantic Cache Lookup
When a query arrives:
- Generate embedding
- Search for similar queries
- Return cached result if similarity is high enough
Example implementation:
def search_cache(query_embedding, db, threshold=0.9):
result = db.execute("""
SELECT query, response,
1 - (embedding <=> %s) AS similarity
FROM semantic_cache
ORDER BY embedding <=> %s
LIMIT 1
""", (query_embedding, query_embedding))
row = result.fetchone()
if row and row["similarity"] > threshold:
return row["response"]
return None
If similarity exceeds the threshold, the cached response is reused.
Step 4: Running the LLM on Cache Miss
If no similar query exists in the cache, the system proceeds with the normal LLM pipeline.
Example:
def generate_answer(prompt):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
After generating the response, we store it in the cache.
Step 5: Storing Responses in the Semantic Cache
def store_cache(query, embedding, response, db):
db.execute("""
INSERT INTO semantic_cache
(query, embedding, response)
VALUES (%s, %s, %s)
""", (query, embedding, response))
db.commit()
Over time, the cache accumulates answers to frequently asked questions.
Full Request Flow
Putting everything together:
def handle_query(question):
embedding = create_embedding(question)
cached = search_cache(embedding, db)
if cached:
return cached
response = generate_answer(question)
store_cache(question, embedding, response, db)
return response
This simple pipeline can significantly reduce repeated LLM calls.
Latency Improvements
Without caching:
Query
↓
Embedding (50 ms)
↓
Vector search (30 ms)
↓
LLM generation (1000–3000 ms)
With semantic caching:
Query
↓
Embedding (50 ms)
↓
Cache lookup (10 ms)
↓
Return response
In many systems, semantic caching reduces latency by 10–50x for repeated queries.
Cost Optimization
LLM usage is often the most expensive part of AI systems.
Example:
1M requests per month
$0.002 per request
= $2000/month
If semantic caching handles 40% of queries, the cost drops to:
600k LLM requests
≈ $1200/month
Savings scale linearly with traffic.
Engineering Considerations
Cache Similarity Threshold
Setting similarity too low can produce incorrect responses.
Typical values:
0.85 – 0.95
Higher values improve accuracy but reduce cache hits.
Cache Expiration
Knowledge may change over time. Cache entries should expire periodically.
Example strategies:
- Time-based expiration
- Manual invalidation
- Versioned cache
Avoiding Stale Knowledge
When the knowledge base changes, cached responses may become outdated.
Solutions include:
- Cache invalidation on document updates
- Storing document version metadata
- TTL-based cache expiration
Semantic Caching in RAG Systems
Semantic caching works especially well in Retrieval-Augmented Generation systems.
Example architecture:
User Query
↓
Semantic Cache
↓
RAG Retriever
↓
Context Injection
↓
LLM
In many production RAG systems:
- 30–60% of queries are cacheable
- Large enterprise assistants see massive latency improvements
When building RAG systems in Python, adding semantic caching can significantly reduce latency and costs.
Engineering Insight
A common mistake is implementing semantic caching after the LLM call.
Instead, caching should occur before expensive operations, including:
- Document retrieval
- Reranking
- LLM generation
Correct architecture:
Query ↓ Semantic Cache ↓ Retriever ↓ Reranker ↓ LLMThis ensures that the system avoids unnecessary computation as early as possible.
Proper monitoring LLM systems helps track cache hit rates and optimize performance.
Conclusion
Semantic caching is one of the most effective techniques for optimizing LLM systems.
By storing responses based on semantic similarity, AI systems can:
- Dramatically reduce LLM calls
- Improve response latency
- Lower infrastructure costs
In production environments, semantic caching often becomes a core component of AI system architecture.
When combined with RAG pipelines, vector databases, and efficient backend APIs, it enables scalable and cost-efficient AI applications. Understanding proper LLM API design is crucial for integrating caching effectively.
As LLM-powered systems continue to grow in scale, semantic caching will remain a critical optimization layer for production AI infrastructure.