Skip to content

feat(code-index): experimental DuckDB + tree-sitter code-index plugin#58

Merged
stephane-segning merged 6 commits into
mainfrom
claude/priceless-shtern-cf5900
Jun 17, 2026
Merged

feat(code-index): experimental DuckDB + tree-sitter code-index plugin#58
stephane-segning merged 6 commits into
mainfrom
claude/priceless-shtern-cf5900

Conversation

@stephane-segning

@stephane-segning stephane-segning commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

1. Summary

Adds @vymalo/opencode-code-index — a personal, experimental, private (not published) OpenCode plugin that indexes a repo into an embedded DuckDB store and exposes structural code_* tools: code_symbol, code_callers, code_callees, code_references, code_blast_radius, plus index_refresh / index_status.

Source of truth: the design doc plans/code-index.md — this PR implements it.

  • Content-addressed by git blob, scoped per branch (a branch is a path→blob manifest); branch/worktree switches re-index only the delta and blast_radius stays branch-correct (names resolve at query time).
  • Call graph is tree-sitter-built, deliberately sound but partial (no type info; confidence-tagged edges for a future typed tier).
  • Docs: docs/code-index.md, plans/code-index.md, ADR-0002/0003/0004, a runnable demo/.

2. Intent

Give the agent the multi-hop structural queries grep can't cheaply answer ("who calls this", "blast radius") without standing up a DB server — embedded DuckDB, per-project, zero idle cost. It lives in the workspace for convenience and may be removed.

Source of truth / design rationale: https://github.com/vymalo/opencode-oauth2/blob/claude/priceless-shtern-cf5900/plans/code-index.md

3. Scope

In Scope

  • The plugin (packages/opencode-code-index), its tests, docs/ADRs, and the runnable demo.
  • pnpm-workspace.yaml: allow tree-sitter native builds.

Out of Scope

  • Semantic prose search (docs_search / embeddings) — designed in plans/code-index.md §10, deferred.
  • A precise method-dispatch (tsserver/SCIP) tier — designed, not built.

4. Verification

I verified this change by:

  • Running automated tests — 56 tests, coverage 98% stmts / 85% branches (incl. an end-to-end multi-branch blast_radius through real tree-sitter + DuckDB).
  • Running the full pre-push gate (build, typecheck, coverage, lint, format:check) — green.
  • Running the plugin against this repo via demo/demo.mjs.
  • Testing error cases (not-a-repo, unparseable blob, transaction rollback, concurrent-index de-dup).

Commands run:

pnpm --filter @vymalo/opencode-code-index coverage
pnpm -r build && pnpm -r typecheck && pnpm coverage && pnpm lint && pnpm format:check
pnpm --filter @vymalo/opencode-code-index demo

Results:

56 tests passed · 98.4% stmts / 85% branches · gate exit 0
demo: 178 files → 726 symbols, 4608 edges; blast_radius(extractFromSource) = ensureIndexed, execute, indexRepo

6. Risk Assessment

Risk level: Low — a private, opt-in, non-published plugin; not wired into any other package or CI gate. Adds native tree-sitter build deps (DuckDB ships prebuilt).

7. AI Usage Declaration

AI was used for:

  • Generating code
  • Generating tests
  • Drafting documentation
  • Reviewing the diff

Human verification:

  • I understand every meaningful change in this PR
  • I checked generated code and tests manually
  • Addressed the automated review (concurrency/atomicity/path hardening) after verifying each finding against the cited lines
  • I accept responsibility for this PR

🤖 Generated with Claude Code

Add @vymalo/opencode-code-index (private, experimental, not published): a
personal OpenCode plugin that indexes a repo into an embedded DuckDB store and
exposes code_* tools — code_symbol, code_callers, code_callees, code_references,
code_blast_radius, plus index_refresh / index_status.

The index is content-addressed by git blob and scoped per branch (a branch is a
path→blob manifest), so branch/worktree switches re-index only the delta and
blast_radius stays branch-correct. Names resolve at query time against the active
manifest. The call graph is built with tree-sitter and is deliberately sound but
partial (no type info — generic obj.method() is dropped); edges carry a confidence
tag (name | this | typed) for a future language-server/SCIP enrichment tier.

Engine and resolution choices were validated by throwaway spikes before building
(DuckDB recursive CTE for blast_radius; tree-sitter extraction on real source).

- packages/opencode-code-index: src + 50 vitest tests (98% stmts, 85% branches)
- pnpm-workspace: allow tree-sitter native builds
- docs/code-index.md (reference) + plans/code-index.md (design)
- README / CLAUDE.md / AGENTS.md / CHANGELOG: document the experimental plugin

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental, private OpenCode plugin, @vymalo/opencode-code-index, which indexes repositories into an embedded DuckDB store using a tree-sitter symbol graph to expose structural code-intelligence tools. Feedback on these changes focuses on enhancing robustness and reliability. Key recommendations include coordinating concurrent indexing requests using a promise map to prevent race conditions, validating that environment-provided paths are absolute, resolving relative database paths against the active worktree root, wrapping database operations in transactions to ensure atomicity, and separating file-reading from database-writing to prevent masked errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +26 to +32
interface WorktreeContext {
repo: GitRepo;
store: CodeIndexStore;
dbPath: string;
/** Branches already indexed (or confirmed indexed) in this process. */
ensured: Set<string>;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Add an indexingPromises map to WorktreeContext to track active indexing operations per branch and prevent concurrent indexing race conditions.

Suggested change
interface WorktreeContext {
repo: GitRepo;
store: CodeIndexStore;
dbPath: string;
/** Branches already indexed (or confirmed indexed) in this process. */
ensured: Set<string>;
}
interface WorktreeContext {
repo: GitRepo;
store: CodeIndexStore;
dbPath: string;
/** Branches already indexed (or confirmed indexed) in this process. */
ensured: Set<string>;
/** Active indexing promise per branch to prevent concurrent indexing. */
indexingPromises: Map<string, Promise<void>>;
}

Comment on lines +80 to +94
async function ensureIndexed(ctx: WorktreeContext, force = false): Promise<string> {
const branch = await ctx.repo.currentBranch();
if (!force && ctx.ensured.has(branch)) {
return branch;
}
const status = await ctx.store.status(branch);
if (force || status.files === 0) {
await indexRepo(ctx.repo, ctx.store, {
extensions: deps.options.extensions,
logger: deps.logger
});
}
ctx.ensured.add(branch);
return branch;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Coordinate concurrent indexing requests for the same branch using a promise map. Parallel tool execution by AI agents can trigger ensureIndexed concurrently, leading to race conditions, PRIMARY KEY violations, or database locking errors.

  async function ensureIndexed(ctx: WorktreeContext, force = false): Promise<string> {
    const branch = await ctx.repo.currentBranch();
    if (!force && ctx.ensured.has(branch)) {
      return branch;
    }
    let promise = ctx.indexingPromises.get(branch);
    if (!promise) {
      const performIndex = async () => {
        const status = await ctx.store.status(branch);
        if (force || status.files === 0) {
          await indexRepo(ctx.repo, ctx.store, {
            extensions: deps.options.extensions,
            logger: deps.logger
          });
        }
        ctx.ensured.add(branch);
      };
      promise = performIndex().finally(() => {
        ctx.indexingPromises.delete(branch);
      });
      ctx.indexingPromises.set(branch, promise);
    }
    await promise;
    return branch;
  }

Comment on lines +1 to +2
import { homedir, platform } from "node:os";
import { join } from "node:path";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import isAbsolute from node:path to validate that environment-provided directory paths are absolute before using them.

Suggested change
import { homedir, platform } from "node:os";
import { join } from "node:path";
import { homedir, platform } from "node:os";
import { isAbsolute, join } from "node:path";
References
  1. When reading directory paths from environment variables (such as APPDATA or XDG_STATE_HOME), validate that the path is absolute (e.g., using isAbsolute) before using it. This prevents empty or relative environment values from resolving paths relative to the process current working directory (CWD).

Comment on lines +10 to +20
export function cacheDir(): string {
const home = homedir();
switch (platform()) {
case "darwin":
return join(home, "Library", "Caches", NAMESPACE);
case "win32":
return join(process.env.LOCALAPPDATA ?? join(home, "AppData", "Local"), NAMESPACE);
default:
return join(process.env.XDG_CACHE_HOME ?? join(home, ".cache"), NAMESPACE);
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Validate that the directory paths read from environment variables (LOCALAPPDATA and XDG_CACHE_HOME) are absolute using isAbsolute before using them. This prevents empty or relative environment values from resolving paths relative to the process current working directory (CWD).

Suggested change
export function cacheDir(): string {
const home = homedir();
switch (platform()) {
case "darwin":
return join(home, "Library", "Caches", NAMESPACE);
case "win32":
return join(process.env.LOCALAPPDATA ?? join(home, "AppData", "Local"), NAMESPACE);
default:
return join(process.env.XDG_CACHE_HOME ?? join(home, ".cache"), NAMESPACE);
}
}
export function cacheDir(): string {
const home = homedir();
switch (platform()) {
case "darwin":
return join(home, "Library", "Caches", NAMESPACE);
case "win32": {
const appData = process.env.LOCALAPPDATA;
const base = appData && isAbsolute(appData) ? appData : join(home, "AppData", "Local");
return join(base, NAMESPACE);
}
default: {
const xdgCache = process.env.XDG_CACHE_HOME;
const base = xdgCache && isAbsolute(xdgCache) ? xdgCache : join(home, ".cache");
return join(base, NAMESPACE);
}
}
}
References
  1. When reading directory paths from environment variables (such as APPDATA or XDG_STATE_HOME), validate that the path is absolute (e.g., using isAbsolute) before using it. This prevents empty or relative environment values from resolving paths relative to the process current working directory (CWD).

Comment on lines +1 to +2
import { mkdir } from "node:fs/promises";
import { dirname } from "node:path";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import isAbsolute and resolve from node:path to handle relative user-provided database paths and resolve them against the worktree root.

Suggested change
import { mkdir } from "node:fs/promises";
import { dirname } from "node:path";
import { mkdir } from "node:fs/promises";
import { dirname, isAbsolute, resolve } from "node:path";

Comment on lines +54 to +66
async function buildContext(worktree: string): Promise<WorktreeContext> {
const repo = makeRepo(worktree);
if (!(await repo.isRepo())) {
throw new Error(NOT_A_REPO);
}
const repoId = await repo.repoId();
const dbPath = deps.options.dbPath ?? defaultDbPath(repoId);
if (dbPath !== ":memory:") {
await mkdir(dirname(dbPath), { recursive: true });
}
const store = await openStore(dbPath);
return { repo, store, dbPath, ensured: new Set() };
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Resolve relative user-provided dbPath against the active worktree root instead of the process current working directory (CWD) to ensure consistency across different directories.

  async function buildContext(worktree: string): Promise<WorktreeContext> {
    const repo = makeRepo(worktree);
    if (!(await repo.isRepo())) {
      throw new Error(NOT_A_REPO);
    }
    const repoId = await repo.repoId();
    let dbPath = deps.options.dbPath ?? defaultDbPath(repoId);
    if (dbPath !== ":memory:" && !isAbsolute(dbPath)) {
      dbPath = resolve(worktree, dbPath);
    }
    if (dbPath !== ":memory:") {
      await mkdir(dirname(dbPath), { recursive: true });
    }
    const store = await openStore(dbPath);
    return { repo, store, dbPath, ensured: new Set(), indexingPromises: new Map() };
  }

Comment on lines +95 to +116
/** Insert a parsed blob (blob row + its symbols + its edges) in one shot. */
async insertBlob(blobSha: string, lang: string, extraction: Extraction): Promise<void> {
await this.conn.run("INSERT INTO blob VALUES (?, ?)", [blobSha, lang]);
for (const d of extraction.defs) {
await this.conn.run("INSERT INTO symbol VALUES (?, ?, ?, ?)", [
blobSha,
d.name,
d.kind,
d.line
]);
}
for (const e of extraction.refs) {
await this.conn.run("INSERT INTO ref VALUES (?, ?, ?, ?, ?, ?)", [
blobSha,
e.caller,
e.dstName,
e.kind,
e.line,
e.confidence
]);
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Wrap the multiple database insert statements in a transaction to ensure atomicity. If any insert fails or the process is interrupted, the database won't be left in a corrupted or partially populated state.

  /** Insert a parsed blob (blob row + its symbols + its edges) in one shot. */
  async insertBlob(blobSha: string, lang: string, extraction: Extraction): Promise<void> {
    await this.conn.run("BEGIN TRANSACTION");
    try {
      await this.conn.run("INSERT INTO blob VALUES (?, ?)", [blobSha, lang]);
      for (const d of extraction.defs) {
        await this.conn.run("INSERT INTO symbol VALUES (?, ?, ?, ?)", [
          blobSha,
          d.name,
          d.kind,
          d.line
        ]);
      }
      for (const e of extraction.refs) {
        await this.conn.run("INSERT INTO ref VALUES (?, ?, ?, ?, ?, ?)", [
          blobSha,
          e.caller,
          e.dstName,
          e.kind,
          e.line,
          e.confidence
        ]);
      }
      await this.conn.run("COMMIT");
    } catch (err) {
      await this.conn.run("ROLLBACK");
      throw err;
    }
  }

Comment on lines +118 to +129
/** Replace a branch/root's manifest with `entries` (delete-then-insert). */
async replaceManifest(branch: string, root: string, entries: ManifestEntry[]): Promise<void> {
await this.conn.run("DELETE FROM manifest WHERE branch = ? AND root = ?", [branch, root]);
for (const e of entries) {
await this.conn.run("INSERT INTO manifest VALUES (?, ?, ?, ?)", [
branch,
root,
e.path,
e.blobSha
]);
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Wrap the delete-then-insert manifest replacement in a transaction to ensure atomicity and prevent leaving the branch manifest in a deleted or partially written state on failure.

  /** Replace a branch/root's manifest with 'entries' (delete-then-insert). */
  async replaceManifest(branch: string, root: string, entries: ManifestEntry[]): Promise<void> {
    await this.conn.run("BEGIN TRANSACTION");
    try {
      await this.conn.run("DELETE FROM manifest WHERE branch = ? AND root = ?", [branch, root]);
      for (const e of entries) {
        await this.conn.run("INSERT INTO manifest VALUES (?, ?, ?, ?)", [
          branch,
          root,
          e.path,
          e.blobSha
        ]);
      }
      await this.conn.run("COMMIT");
    } catch (err) {
      await this.conn.run("ROLLBACK");
      throw err;
    }
  }

Comment on lines +60 to +73
try {
const source = await repo.readBlob(entry.blobSha);
const extraction = extractFromSource(source, lang);
await store.insertBlob(entry.blobSha, lang, extraction);
indexed++;
} catch (err) {
// Record the blob with no symbols so we don't retry a file we can't parse.
await store.insertBlob(entry.blobSha, lang, { defs: [], refs: [] });
options.logger?.warn("code_index_blob_failed", {
path: entry.path,
blob: entry.blobSha,
error: err instanceof Error ? err.message : String(err)
});
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Separate the read/parse operations from the database write operation in the try-catch block. If a fatal database error occurs during insertBlob, attempting to write to the database again in the catch block is highly likely to fail or mask the real database issue as a parsing failure.

    let extraction: Extraction;
    try {
      const source = await repo.readBlob(entry.blobSha);
      extraction = extractFromSource(source, lang);
    } catch (err) {
      options.logger?.warn("code_index_blob_failed", {
        path: entry.path,
        blob: entry.blobSha,
        error: err instanceof Error ? err.message : String(err)
      });
      // Record the blob with no symbols so we don't retry a file we can't parse.
      try {
        await store.insertBlob(entry.blobSha, lang, { defs: [], refs: [] });
      } catch (dbErr) {
        options.logger?.error("code_index_fallback_write_failed", {
          blob: entry.blobSha,
          error: dbErr instanceof Error ? dbErr.message : String(dbErr)
        });
      }
      continue;
    }

    await store.insertBlob(entry.blobSha, lang, extraction);
    indexed++;

stephane-segning and others added 5 commits June 17, 2026 08:53
…ginConfig

OpenCode has no typed `pluginConfig` field; plugin options are passed via the
`[specifier, options]` tuple inside the `plugin` array (delivered as the 2nd-arg
PluginOptions, which is what this plugin actually reads). The previous example
used a non-existent `pluginConfig` block keyed by package name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
demo/demo.mjs indexes the current branch's HEAD tree into a throwaway DuckDB
file (OS temp dir, cleaned up) and drives the real code_* tools exactly as
OpenCode would, printing the model-facing text. Wired as the `demo` package
script and documented with sample output in docs/code-index.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Record the three alternative-closing choices behind the code-index plugin in
the MADR-style ADR format:
- 0002: embedded DuckDB over a graph-DB server (Neo4j/Qdrant; Cozo abandoned)
- 0003: content-addressed-by-blob + per-branch manifest indexing model
- 0004: tree-sitter-only "sound but partial" call-graph resolution

Index updated; docs/code-index.md cross-links them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apply the Gemini review findings on #58 (all verified as real, not false positives):

- tools.ts: guard concurrent indexing. OpenCode runs tools in parallel, so two
  first-touch calls on a branch could both run indexRepo (duplicate INSERTs /
  lock contention). A per-branch in-flight promise (`startIndex`) now de-dupes
  concurrent runs; index_refresh shares the same guard.
- config.ts: only honor LOCALAPPDATA / XDG_CACHE_HOME when absolute (the `??`
  idiom doesn't catch ""), else fall back to home — no CWD-relative cache dir.
- tools.ts: resolve a relative dbPath override against the worktree, not the CWD.
- store.ts: wrap insertBlob and replaceManifest in a transaction (BEGIN/COMMIT/
  ROLLBACK) so a mid-write failure leaves no partial rows.
- indexer.ts: separate read/parse from the DB write so a genuine DB error
  propagates (fatal) instead of being masked as a parse failure and retried.

Tests: +4 (concurrent-index de-dup, relative-dbPath resolution, env-path
fallback, transaction rollback) + a fallback-write-failure case. 56 tests,
coverage 98% stmts / 85% branches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

✅ AI Governance check passed

This PR declares AI usage, references a source of truth, and provides verification evidence. Thank you.

@stephane-segning

Copy link
Copy Markdown
Contributor Author

Review addressed — all five findings were real (verified, not false positives)

Merged main (resolved the CHANGELOG.md conflict — kept both the code-index entry and the trace-tier/fixes from #54–57) and applied every Gemini finding in 6a518e6:

# Finding Fix
1 (high) Concurrent indexing race tools.ts: per-branch in-flight promise (startIndex) de-dupes parallel runs; index_refresh shares the guard
2 Empty/relative env base resolves under CWD config.ts: honor LOCALAPPDATA/XDG_CACHE_HOME only when isAbsolute (the ?? idiom does not catch \"\"), else home
3 Relative dbPath vs CWD tools.ts: resolve a relative override against the worktree
4 Non-atomic multi-INSERT store.ts: insertBlob + replaceManifest wrapped in BEGIN/COMMIT/ROLLBACK
5 indexer catch masks DB errors indexer.ts: read/parse in its own try; the real write is outside, so a DB error propagates instead of being retried/masked

Added 5 tests (concurrent-index de-dup, relative-dbPath resolution, env-path fallback, transaction rollback, fallback-write-failure). 56 tests, 98% stmts / 85% branches. Full pre-push gate green.

(Per the AI Governance doctrine just adopted in #59: findings treated as claims and verified against the cited lines before acting — here all five held up.)

@stephane-segning stephane-segning merged commit a5047eb into main Jun 17, 2026
4 of 6 checks passed
@stephane-segning stephane-segning deleted the claude/priceless-shtern-cf5900 branch June 17, 2026 17:03
@stephane-segning stephane-segning mentioned this pull request Jun 17, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant