Skip to content

feat: add disk-based embedding cache#331

Open
BeamNawapat wants to merge 3 commits intozilliztech:masterfrom
BeamNawapat:pr/embedding-cache
Open

feat: add disk-based embedding cache#331
BeamNawapat wants to merge 3 commits intozilliztech:masterfrom
BeamNawapat:pr/embedding-cache

Conversation

@BeamNawapat
Copy link
Copy Markdown
Contributor

Summary

Add a transparent, disk-based embedding cache to skip redundant embedding API calls when re-indexing the same code. On a re-index, only chunks whose content has not been embedded before hit the API; cached chunks load from disk in milliseconds.

Motivation

Embedding API calls are the slowest and most expensive step in indexing. When a codebase is re-indexed (after force: true, switching machines, or recovering from a failed run), every chunk is sent through the embedding provider again — even if its content is byte-identical to a previous run.

In practice this wastes:

  • API quota / cost — large monorepos can re-burn $1–$5 per re-index
  • Latency — re-indexing a 10k-file repo against VoyageAI takes minutes purely for embeddings
  • Provider rate limits — easy to hit on Gemini / OpenAI free tiers

A small content-addressed cache keyed by SHA256(content) per (provider, dimension) eliminates this waste entirely for unchanged chunks.

Changes

  • packages/core/src/embedding/embedding-cache.ts (new, ~130 LOC)EmbeddingCache class with get, set, getBatch, cleanup methods. Storage: ~/.context/embedding-cache/{provider}_{dimension}/XX/{sha256}.json (hierarchical to avoid single-dir overflow). No external dependencies — uses Node fs / crypto / path / os.
  • packages/core/src/embedding/index.ts — Export EmbeddingCache.
  • packages/core/src/context.ts — Initialize EmbeddingCache in constructor keyed by ${provider}_${dimension}. New private cachedEmbedBatch() wraps embedding.embedBatch(): returns cached vectors instantly, only sends uncached chunks to the API. Indexing path now calls cachedEmbedBatch() instead of embedding.embedBatch() directly. Async TTL cleanup runs once on startup (non-blocking).

Behavior:

  • Per-model isolation prevents cross-contamination when switching providers (e.g., voyage-code-3 vs text-embedding-3-small).
  • Best-effort design: any cache I/O error falls back to a normal API call. Corrupted JSON, missing files, permission errors all degrade gracefully.
  • Hit rate logged per batch: [Cache] 75% hit (3/4 cached, 1 embedded).
  • Stale entries auto-removed on startup based on EMBEDDING_CACHE_MAX_AGE_DAYS (default 30).

Configuration

Env var Default Purpose
EMBEDDING_CACHE true Enable/disable. Set to false to opt out completely.
EMBEDDING_CACHE_DIR ~/.context/embedding-cache Storage location.
EMBEDDING_CACHE_MAX_AGE_DAYS 30 TTL for cleanup-on-startup. Set 0 to disable cleanup.

Usage

Zero-config — cache is on by default. Re-indexing the same content shows the hit rate:

[Context] 💾 Embedding cache enabled for model: VoyageAI_1024
[Cache] ✅ All 47 embeddings from cache
[Cache] 88% hit (44/50 cached, 6 embedded)

Disable temporarily:

EMBEDDING_CACHE=false npx @zilliz/claude-context-mcp@latest

Move cache to a shared location:

EMBEDDING_CACHE_DIR=/mnt/team-cache/embeddings npx @zilliz/claude-context-mcp@latest

Test plan

  • pnpm build passes (core + mcp)
  • Index a small repo, then re-index → cache hit rate should be ~100%
  • Edit one file, re-index → only changed chunks re-embedded
  • Switch EMBEDDING_PROVIDER → new cache directory created, old one untouched
  • EMBEDDING_CACHE=false → no cache directory created, no [Cache] logs
  • Delete ~/.context/embedding-cache/ mid-run → next batch falls back to API gracefully

Notes for reviewers

  • Cache files are JSON {"v": [vector...], "d": dimension} — small (~6KB per 1024-dim float vector) but consider compression in a follow-up if storage becomes a concern.
  • No write locking; concurrent indexers writing the same key would race, but the result is functionally identical so this is intentionally not guarded.
  • Old cache entries from previous models are not deleted on provider switch (only TTL cleanup applies). Trade-off: simpler logic vs slightly larger disk usage. A clear_cache MCP tool could be added in a follow-up.

Cache embedding vectors to ~/.context/embedding-cache/ keyed by
SHA256(content) per model. On re-index, only uncached chunks hit
the API — cached chunks load from disk instantly.

Logs cache hit rate per batch. Disable with EMBEDDING_CACHE=false.
Delete cached embeddings not modified in 30 days (configurable via
EMBEDDING_CACHE_MAX_AGE_DAYS). Runs async on startup, non-blocking.
Removes empty prefix directories after cleanup.
Copilot AI review requested due to automatic review settings April 25, 2026 09:14
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a transparent, disk-based embedding cache to reduce redundant embedding API calls during re-indexing by persisting embeddings keyed by content hash and routing batch embedding through the cache.

Changes:

  • Introduces EmbeddingCache for disk-backed get/set/getBatch/cleanup of embeddings.
  • Exports the cache from the embedding module and wires it into Context to cache embedBatch() results.
  • Adds startup cache initialization + background cleanup and logs cache hit rates.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
packages/core/src/embedding/index.ts Exports the new EmbeddingCache entrypoint.
packages/core/src/embedding/embedding-cache.ts Implements the disk-backed embedding cache and TTL cleanup.
packages/core/src/context.ts Instantiates the cache and routes chunk batch embedding through it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/core/src/context.ts Outdated
}

// Initialize embedding cache
const cacheModel = `${this.embedding.getProvider()}_${this.embedding.getDimension()}`;
Comment thread packages/core/src/context.ts Outdated
Comment on lines +597 to +602
const uncachedTexts = uncachedIndices.map(i => contents[i]);
const newEmbeddings = await this.embedding.embedBatch(uncachedTexts);

for (let j = 0; j < uncachedIndices.length; j++) {
results[uncachedIndices[j]] = newEmbeddings[j];
this.embeddingCache.set(contents[uncachedIndices[j]], newEmbeddings[j]);

private getCachePath(contentHash: string): string {
const prefix = contentHash.slice(0, 2);
return path.join(this.cacheDir, prefix, contentHash.slice(0, 12) + '.json');
async cleanup(maxAgeDays?: number): Promise<void> {
if (!this.enabled) return;

const days = maxAgeDays ?? parseInt(envManager.get('EMBEDDING_CACHE_MAX_AGE_DAYS') || '30', 10);
Comment on lines +104 to +121
const prefixDirs = fs.readdirSync(this.cacheDir);
for (const prefix of prefixDirs) {
const prefixPath = path.join(this.cacheDir, prefix);
if (!fs.statSync(prefixPath).isDirectory()) continue;

const files = fs.readdirSync(prefixPath);
for (const file of files) {
const filePath = path.join(prefixPath, file);
const stat = fs.statSync(filePath);
if (stat.mtimeMs < cutoff) {
fs.unlinkSync(filePath);
deleted++;
}
}

// Remove empty prefix dirs
if (fs.readdirSync(prefixPath).length === 0) {
fs.rmdirSync(prefixPath);
Comment on lines +51 to +52
const data = JSON.parse(fs.readFileSync(cachePath, 'utf-8'));
return { vector: data.v, dimension: data.d };
- Use full SHA256 (64 chars) in filename, not 12-char prefix (collision risk)
- Validate JSON shape in get() (Array.isArray, dimension match) and pass
  expectedDimension to constructor for stricter cross-model isolation
- cleanup() now uses fs.promises (truly async, no event-loop block)
- cleanup() guards maxAgeDays <= 0 / non-finite (prevents purge-everything)
- updateEmbedding() now reinitializes the cache so model switches don't
  serve stale vectors from the previous model
- cachedEmbedBatch() dedupes duplicate strings within a single batch so
  identical chunks don't each hit the API
@BeamNawapat
Copy link
Copy Markdown
Contributor Author

Addressed Copilot review feedback in ea99ede:

  • Full SHA256 in filename (no truncation/collision risk)
  • get() validates JSON shape (Array.isArray, dimension match)
  • cleanup() rewritten with fs.promises (truly async, no event-loop block) and guards maxAgeDays <= 0
  • updateEmbedding() reinitializes the cache so model switches can't serve stale vectors
  • cachedEmbedBatch() dedupes duplicate strings within a single batch so identical chunks share one API call

All checks still green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants