Set up semantic caching for LLM API responses. By the end you will have a cache that matches semantically similar queries to avoid redundant API calls.
- Neumann installed (Installation)
- A running Neumann shell
neumann --wal-dir ./cache-dataCache an LLM response with its exact prompt:
CACHE PUT 'What is machine learning?' 'Machine learning is a subset of artificial intelligence that enables systems to learn from data...'Retrieve it:
CACHE GET 'What is machine learning?'You should see the cached response.
Store a response with an embedding vector for semantic matching:
CACHE SEMANTIC PUT 'What is machine learning?' 'Machine learning is a subset of artificial intelligence...' EMBEDDING [0.9, 0.1, 0.2, 0.05, 0.8, 0.15, 0.3, 0.02]Try a semantically similar but differently worded query:
CACHE SEMANTIC GET 'Explain ML to me' THRESHOLD 0.8If the embedding of "Explain ML to me" is similar enough to the stored embedding (above the 0.8 threshold), you get a cache hit and the stored response is returned.
Build up a cache with several topics:
CACHE SEMANTIC PUT 'How do neural networks work?' 'Neural networks consist of layers of interconnected nodes...' EMBEDDING [0.85, 0.2, 0.3, 0.1, 0.7, 0.25, 0.15, 0.05]
CACHE SEMANTIC PUT 'What is Docker?' 'Docker is a platform for containerizing applications...' EMBEDDING [0.1, 0.8, 0.05, 0.7, 0.2, 0.6, 0.1, 0.3]
CACHE SEMANTIC PUT 'Explain REST APIs' 'REST APIs use HTTP methods to perform CRUD operations...' EMBEDDING [0.2, 0.6, 0.1, 0.8, 0.15, 0.5, 0.3, 0.2]CACHE LISTYou should see all cached entries.
Remove a specific entry:
CACHE DELETE 'What is Docker?'Verify it was removed:
CACHE GET 'What is Docker?'Store document metadata alongside the cache for richer context:
CREATE TABLE llm_usage (
id INT PRIMARY KEY,
prompt TEXT,
model TEXT,
tokens_used INT,
cached INT
);
INSERT INTO llm_usage VALUES (1, 'What is machine learning?', 'gpt-4', 150, 0);
INSERT INTO llm_usage VALUES (2, 'Explain ML to me', 'gpt-4', 0, 1);Track which queries hit the cache:
SELECT * FROM llm_usage WHERE cached = 1;You should have:
- Exact cache entries stored and retrieved (CACHE PUT/GET)
- Semantic cache entries with embeddings
- Semantic similarity matching returning cached responses
- Cache deletion working
- Usage tracking in a relational table
- Configure Semantic Cache -- tune cache parameters
- Vector Search with Filtering -- more embedding search patterns
- Use Cases -- more application patterns