Semantic Caching for LLM Systems: Reducing Latency and Cost in Production

Introduction

Large Language Models enable powerful AI applications, but they introduce two major engineering challenges:

High latency
High operational cost

Each LLM request may take hundreds of milliseconds or even several seconds. When systems scale to thousands or millions of queries per day, these costs quickly become significant.

One effective optimization technique is semantic caching.

Unlike traditional caching mechanisms that rely on exact string matches, semantic caching stores responses based on meaning similarity. If a new query is semantically similar to a previously answered question, the system can reuse the cached result instead of calling the LLM again.

This approach can dramatically reduce:

LLM API usage
Response latency
Infrastructure cost

In this article we will explore:

How semantic caching works
How to implement it in Python
How to integrate it into RAG systems
Engineering considerations for production systems

Why Traditional Caching Fails for LLM Systems

Traditional caching works well when requests are identical.

Example:

GET /users/123

But natural language queries rarely repeat exactly.

For example:

"What is the refund policy?"
vs
"Can you explain how refunds work?"

Although the meaning is the same, a traditional cache treats them as different keys.

Semantic caching solves this problem by using vector similarity.

Semantic Cache Architecture

Instead of storing responses by string keys, semantic caches store:

Query embeddings
LLM responses

When a new query arrives, the system checks whether a similar query already exists.

Architecture Overview

User Query
    ↓
Embedding Generation
    ↓
Semantic Cache Lookup
    ↓
Cache Hit → Return Cached Response
    ↓
Cache Miss → Run RAG / LLM
    ↓
Store Response in Cache

This reduces unnecessary LLM calls when similar questions appear.

Where Semantic Caching Fits in the LLM Pipeline

Semantic caching typically sits before the RAG or LLM generation step.

Example Production Architecture

User Query
    ↓
API Layer (FastAPI)
    ↓
Semantic Cache
    ↓
Retriever (Vector Search)
    ↓
Prompt Construction
    ↓
LLM Generation

If the cache returns a result, the system skips the expensive generation step entirely.

Step 1: Generating Query Embeddings

To implement semantic caching we first need to convert queries into vector embeddings.

Example using OpenAI embeddings:

from openai import OpenAI

client = OpenAI()

def create_embedding(text: str):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )

    return response.data[0].embedding

Each query is converted into a high-dimensional vector representation.

These vectors allow us to compute semantic similarity between queries.

Step 2: Designing the Cache Storage

A semantic cache typically stores:

query_embedding
query_text
llm_response
timestamp

Example schema using PostgreSQL with pgvector:

CREATE TABLE semantic_cache (
    id SERIAL PRIMARY KEY,
    query TEXT,
    embedding VECTOR(1536),
    response TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

This allows us to perform vector similarity search directly inside the database.

Step 3: Implementing Semantic Cache Lookup

When a query arrives:

Generate embedding
Search for similar queries
Return cached result if similarity is high enough

Example implementation:

def search_cache(query_embedding, db, threshold=0.9):

    result = db.execute("""
        SELECT query, response,
        1 - (embedding <=> %s) AS similarity
        FROM semantic_cache
        ORDER BY embedding <=> %s
        LIMIT 1
    """, (query_embedding, query_embedding))

    row = result.fetchone()

    if row and row["similarity"] > threshold:
        return row["response"]

    return None

If similarity exceeds the threshold, the cached response is reused.

Step 4: Running the LLM on Cache Miss

If no similar query exists in the cache, the system proceeds with the normal LLM pipeline.

Example:

def generate_answer(prompt):

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content

After generating the response, we store it in the cache.

Step 5: Storing Responses in the Semantic Cache

def store_cache(query, embedding, response, db):

    db.execute("""
        INSERT INTO semantic_cache
        (query, embedding, response)
        VALUES (%s, %s, %s)
    """, (query, embedding, response))

    db.commit()

Over time, the cache accumulates answers to frequently asked questions.

Full Request Flow

Putting everything together:

def handle_query(question):

    embedding = create_embedding(question)

    cached = search_cache(embedding, db)

    if cached:
        return cached

    response = generate_answer(question)

    store_cache(question, embedding, response, db)

    return response

This simple pipeline can significantly reduce repeated LLM calls.

Latency Improvements

Without caching:

Query
 ↓
Embedding (50 ms)
 ↓
Vector search (30 ms)
 ↓
LLM generation (1000–3000 ms)

With semantic caching:

Query
 ↓
Embedding (50 ms)
 ↓
Cache lookup (10 ms)
 ↓
Return response

In many systems, semantic caching reduces latency by 10–50x for repeated queries.

Cost Optimization

LLM usage is often the most expensive part of AI systems.

Example:

1M requests per month
$0.002 per request
= $2000/month

If semantic caching handles 40% of queries, the cost drops to:

600k LLM requests
≈ $1200/month

Savings scale linearly with traffic.

Engineering Considerations

Cache Similarity Threshold

Setting similarity too low can produce incorrect responses.

Typical values:

0.85 – 0.95

Higher values improve accuracy but reduce cache hits.

Cache Expiration

Knowledge may change over time. Cache entries should expire periodically.

Example strategies:

Time-based expiration
Manual invalidation
Versioned cache

Avoiding Stale Knowledge

When the knowledge base changes, cached responses may become outdated.

Solutions include:

Cache invalidation on document updates
Storing document version metadata
TTL-based cache expiration

Semantic Caching in RAG Systems

Semantic caching works especially well in Retrieval-Augmented Generation systems.

Example architecture:

User Query
   ↓
Semantic Cache
   ↓
RAG Retriever
   ↓
Context Injection
   ↓
LLM

In many production RAG systems:

30–60% of queries are cacheable
Large enterprise assistants see massive latency improvements

When building RAG systems in Python, adding semantic caching can significantly reduce latency and costs.

Engineering Insight

A common mistake is implementing semantic caching after the LLM call.

Instead, caching should occur before expensive operations, including:

Document retrieval

Reranking

LLM generation

Correct architecture:
Query
 ↓
Semantic Cache
 ↓
Retriever
 ↓
Reranker
 ↓
LLM
This ensures that the system avoids unnecessary computation as early as possible.

Proper monitoring LLM systems helps track cache hit rates and optimize performance.

Conclusion

Semantic caching is one of the most effective techniques for optimizing LLM systems.

By storing responses based on semantic similarity, AI systems can:

Dramatically reduce LLM calls
Improve response latency
Lower infrastructure costs

In production environments, semantic caching often becomes a core component of AI system architecture.

When combined with RAG pipelines, vector databases, and efficient backend APIs, it enables scalable and cost-efficient AI applications. Understanding proper LLM API design is crucial for integrating caching effectively.

As LLM-powered systems continue to grow in scale, semantic caching will remain a critical optimization layer for production AI infrastructure.