feat: add disk-based embedding cache#331
Open
BeamNawapat wants to merge 3 commits intozilliztech:masterfrom
Open
feat: add disk-based embedding cache#331BeamNawapat wants to merge 3 commits intozilliztech:masterfrom
BeamNawapat wants to merge 3 commits intozilliztech:masterfrom
Conversation
Cache embedding vectors to ~/.context/embedding-cache/ keyed by SHA256(content) per model. On re-index, only uncached chunks hit the API — cached chunks load from disk instantly. Logs cache hit rate per batch. Disable with EMBEDDING_CACHE=false.
Delete cached embeddings not modified in 30 days (configurable via EMBEDDING_CACHE_MAX_AGE_DAYS). Runs async on startup, non-blocking. Removes empty prefix directories after cleanup.
There was a problem hiding this comment.
Pull request overview
Adds a transparent, disk-based embedding cache to reduce redundant embedding API calls during re-indexing by persisting embeddings keyed by content hash and routing batch embedding through the cache.
Changes:
- Introduces
EmbeddingCachefor disk-backedget/set/getBatch/cleanupof embeddings. - Exports the cache from the embedding module and wires it into
Contextto cacheembedBatch()results. - Adds startup cache initialization + background cleanup and logs cache hit rates.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| packages/core/src/embedding/index.ts | Exports the new EmbeddingCache entrypoint. |
| packages/core/src/embedding/embedding-cache.ts | Implements the disk-backed embedding cache and TTL cleanup. |
| packages/core/src/context.ts | Instantiates the cache and routes chunk batch embedding through it. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| // Initialize embedding cache | ||
| const cacheModel = `${this.embedding.getProvider()}_${this.embedding.getDimension()}`; |
Comment on lines
+597
to
+602
| const uncachedTexts = uncachedIndices.map(i => contents[i]); | ||
| const newEmbeddings = await this.embedding.embedBatch(uncachedTexts); | ||
|
|
||
| for (let j = 0; j < uncachedIndices.length; j++) { | ||
| results[uncachedIndices[j]] = newEmbeddings[j]; | ||
| this.embeddingCache.set(contents[uncachedIndices[j]], newEmbeddings[j]); |
|
|
||
| private getCachePath(contentHash: string): string { | ||
| const prefix = contentHash.slice(0, 2); | ||
| return path.join(this.cacheDir, prefix, contentHash.slice(0, 12) + '.json'); |
| async cleanup(maxAgeDays?: number): Promise<void> { | ||
| if (!this.enabled) return; | ||
|
|
||
| const days = maxAgeDays ?? parseInt(envManager.get('EMBEDDING_CACHE_MAX_AGE_DAYS') || '30', 10); |
Comment on lines
+104
to
+121
| const prefixDirs = fs.readdirSync(this.cacheDir); | ||
| for (const prefix of prefixDirs) { | ||
| const prefixPath = path.join(this.cacheDir, prefix); | ||
| if (!fs.statSync(prefixPath).isDirectory()) continue; | ||
|
|
||
| const files = fs.readdirSync(prefixPath); | ||
| for (const file of files) { | ||
| const filePath = path.join(prefixPath, file); | ||
| const stat = fs.statSync(filePath); | ||
| if (stat.mtimeMs < cutoff) { | ||
| fs.unlinkSync(filePath); | ||
| deleted++; | ||
| } | ||
| } | ||
|
|
||
| // Remove empty prefix dirs | ||
| if (fs.readdirSync(prefixPath).length === 0) { | ||
| fs.rmdirSync(prefixPath); |
Comment on lines
+51
to
+52
| const data = JSON.parse(fs.readFileSync(cachePath, 'utf-8')); | ||
| return { vector: data.v, dimension: data.d }; |
- Use full SHA256 (64 chars) in filename, not 12-char prefix (collision risk) - Validate JSON shape in get() (Array.isArray, dimension match) and pass expectedDimension to constructor for stricter cross-model isolation - cleanup() now uses fs.promises (truly async, no event-loop block) - cleanup() guards maxAgeDays <= 0 / non-finite (prevents purge-everything) - updateEmbedding() now reinitializes the cache so model switches don't serve stale vectors from the previous model - cachedEmbedBatch() dedupes duplicate strings within a single batch so identical chunks don't each hit the API
Contributor
Author
|
Addressed Copilot review feedback in ea99ede:
All checks still green. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a transparent, disk-based embedding cache to skip redundant embedding API calls when re-indexing the same code. On a re-index, only chunks whose content has not been embedded before hit the API; cached chunks load from disk in milliseconds.
Motivation
Embedding API calls are the slowest and most expensive step in indexing. When a codebase is re-indexed (after
force: true, switching machines, or recovering from a failed run), every chunk is sent through the embedding provider again — even if its content is byte-identical to a previous run.In practice this wastes:
A small content-addressed cache keyed by
SHA256(content)per(provider, dimension)eliminates this waste entirely for unchanged chunks.Changes
packages/core/src/embedding/embedding-cache.ts(new, ~130 LOC) —EmbeddingCacheclass withget,set,getBatch,cleanupmethods. Storage:~/.context/embedding-cache/{provider}_{dimension}/XX/{sha256}.json(hierarchical to avoid single-dir overflow). No external dependencies — uses Nodefs/crypto/path/os.packages/core/src/embedding/index.ts— ExportEmbeddingCache.packages/core/src/context.ts— InitializeEmbeddingCachein constructor keyed by${provider}_${dimension}. New privatecachedEmbedBatch()wrapsembedding.embedBatch(): returns cached vectors instantly, only sends uncached chunks to the API. Indexing path now callscachedEmbedBatch()instead ofembedding.embedBatch()directly. Async TTL cleanup runs once on startup (non-blocking).Behavior:
voyage-code-3vstext-embedding-3-small).[Cache] 75% hit (3/4 cached, 1 embedded).EMBEDDING_CACHE_MAX_AGE_DAYS(default 30).Configuration
EMBEDDING_CACHEtruefalseto opt out completely.EMBEDDING_CACHE_DIR~/.context/embedding-cacheEMBEDDING_CACHE_MAX_AGE_DAYS300to disable cleanup.Usage
Zero-config — cache is on by default. Re-indexing the same content shows the hit rate:
Disable temporarily:
Move cache to a shared location:
Test plan
pnpm buildpasses (core + mcp)EMBEDDING_PROVIDER→ new cache directory created, old one untouchedEMBEDDING_CACHE=false→ no cache directory created, no[Cache]logs~/.context/embedding-cache/mid-run → next batch falls back to API gracefullyNotes for reviewers
{"v": [vector...], "d": dimension}— small (~6KB per 1024-dim float vector) but consider compression in a follow-up if storage becomes a concern.clear_cacheMCP tool could be added in a follow-up.