From a5729af3832ae86cf13828341d8f722733cf79e4 Mon Sep 17 00:00:00 2001 From: Dmitry Teryaev Date: Sun, 21 Jun 2026 23:42:42 +0300 Subject: [PATCH 1/2] docs(plans): execution plan for init/increment perf (PR-P1..PR-P3) Adds plans/active/PLAN-INIT-INCREMENT-PERF.md and the companion plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md implementing the approved proposal propose/active/INIT-INCREMENT-PERF-PROPOSE.md. Three PRs: - PR-P1: bulk in-memory-pyarrow COPY FROM for the full rebuild path; equivalence harness is the merge gate. - PR-P2: same primitive for the incremental path (Route-MERGE dedup retained). - PR-P3: lifespan-cached LayeredIgnore (ContextKey) + is_ignored _mega memo. No production code. Stacks behind proposal PR #338. Co-Authored-By: Claude --- plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md | 356 +++++++++++++++++ plans/active/PLAN-INIT-INCREMENT-PERF.md | 432 +++++++++++++++++++++ 2 files changed, 788 insertions(+) create mode 100644 plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md create mode 100644 plans/active/PLAN-INIT-INCREMENT-PERF.md diff --git a/plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md b/plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md new file mode 100644 index 00000000..c2e6e12d --- /dev/null +++ b/plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md @@ -0,0 +1,356 @@ +# Agent task prompts — Faster init/increment (PR-P1 → PR-P3) + +Status: **active**. One self-contained prompt per PR. Copy the prompt verbatim +into the agent, attach the files in its `@-files` block, and let it execute. + +**Workflow per PR:** + +1. Create the branch named in the prompt off the stated base. +2. Read the cited plan section in full **before** writing code. +3. Implement step-by-step; run the listed tests after each step. +4. Run the sentinel greps — every "must return zero" line must be empty. +5. Paste the manual-evidence output into the PR description. +6. Open a PR with the exact title in the Definition of Done. + +**Universal rules for every prompt:** + +- Use only `.venv/bin/python` and `.venv/bin/pip` (never system python/pip). +- `server.py` is stdio — never write to stdout from anything reachable by a tool handler. +- Do not add a cocoindex dependency outside `java_index_flow_lancedb.py`. +- The plan is the source of truth — if this prompt and the plan disagree, the plan wins. +- Do not touch any file outside the prompt's `@-files` + the test files it names. If you think an adjacent file must change, **stop and ask** — don't ship it. +- Do not loosen any existing test assertion to make it pass. +- Breaking changes are allowed; no compatibility shims. + +--- + +## PR-P1 — Bulk `COPY FROM` for the full rebuild path + +**Branch:** `perf/bulk-graph-writes-p1` off `master`. +**Base:** `master`. +**Plan section:** `plans/active/PLAN-INIT-INCREMENT-PERF.md` § PR-P1 (read this first). +**Estimated diff size:** medium (one module + tests + a fixture). + +**Attach (`@-files`):** + +- `@build_ast_graph.py` (the full-rebuild write path: `_write_nodes`, `_write_edges`, `_write_routes_and_exposes`, `write_ladybug`, `_node_row`, `_SCHEMA_*`, `_callee_declaring_role_at_write`, `_populate_declares_rows`, `_populate_overrides_rows`) +- `@propose/active/INIT-INCREMENT-PERF-PROPOSE.md` (design + staging invariants) +- `@tests/test_ast_graph_build.py` (existing regression net + where the new tests go) +- `@tests/_builders.py` (graph-build helpers: `build_ladybug_full_into`, etc.) + +**Prompt:** + +```` +You are implementing PR-P1 from `plans/active/PLAN-INIT-INCREMENT-PERF.md`. + +Read the **PR-P1** section of the plan in full before writing any code, plus the +proposal's "Staging invariants" and "Equivalence" paragraphs. The plan is the +source of truth — if this prompt and the plan disagree, the plan wins. + +## Scope + +Implement PR-P1 exactly as specified: replace the per-row `conn.execute` writes +in the **full rebuild path** (`write_ladybug`) with bulk in-memory-pyarrow +`COPY FROM`. Concretely: + +1. **Step-1 spike (first commit):** confirm the exact REL `COPY FROM` column + order + `pa.Table.from_pylist` LIST typing with a throwaway 2-node + 1-edge + toy. Record the working incantation in the `_bulk_copy` docstring. +2. Add `_bulk_copy(conn, table_name, columns, rows)` + the `*_COLUMNS` / + `_REL_*_COLUMNS` constants (column order matches `_SCHEMA_*`; REL tables list + FROM/TO first). +3. Convert `_write_nodes` (full path) to stage all node rows then + `_bulk_copy(conn, "Symbol", NODE_COLUMNS, rows)`. Delete `_CREATE_SYMBOL`. +4. Convert `_write_edges` to per-edge-type row staging (applying the SAME + `seen_calls`/`seen_ucs` dedup and SAME `_callee_declaring_role_at_write` + lookup, at staging time) then `_bulk_copy` each REL table. Delete the dead + `_CREATE_EXT/IMPL/INJ/DECL/OVERRIDES/CALL/UNRESOLVED/UNRESOLVED_AT` strings. +5. Convert `_write_routes_and_exposes` (+ Client/Producer/HTTP_CALLS/ASYNC_CALLS) + to bulk; delete the dead `_CREATE_ROUTE/EXPOSES/CLIENT/DECLARES_CLIENT/ + PRODUCER/DECLARES_PRODUCER/HTTP_CALL/ASYNC_CALL` strings. +6. Convert the `GraphMeta` write (`:3472-3473`) to a single-row `_bulk_copy`. +7. Generate + commit `tests/fixtures/graph_baseline_bank_chat.json` from the + **last per-row build** before you remove the per-row path. +8. Add the four named tests. + +## Out of scope (do NOT touch) + +- The incremental path: `_write_nodes_merge`, `_MERGE_SYMBOL`, `_delete_file_scope`, + `incremental_rebuild`, and the pass5/6 `MERGE (r:Route)` (`:3819-3821`) stay + exactly as-is. PR-P1 is full-rebuild only. +- `java_index_flow_lancedb.py`, `path_filtering.py`, `server.py`, `search_lancedb.py`. +- Any schema (`_SCHEMA_*` DDL), ontology, or re-index change. +- CSV or Parquet-file staging (pyarrow in-memory only). +- Loosening any existing test. + +If you find yourself wanting to touch any of the above, **stop and ask**. + +## Deliverables + +1. `_bulk_copy` helper + column-order constants in `build_ast_graph.py`. +2. `write_ladybug` full path bulk-loads all node tables then all REL tables. +3. Dead `_CREATE_*` / `_CREATE_SYMBOL` strings removed. +4. `tests/fixtures/graph_baseline_bank_chat.json` committed. +5. Four new tests in `tests/test_ast_graph_build.py`. + +## Tests + +Run, all must pass: +``` +.venv/bin/python -m pytest tests/test_ast_graph_build.py -v +.venv/bin/python -m pytest tests/test_incremental_graph.py tests/test_bank_chat_brownfield_integration.py tests/test_call_edges_e2e.py -q +.venv/bin/ruff check . +``` + +Sentinel greps — **must all return zero** (no output): +``` +grep -n "_CREATE_SYMBOL\b" build_ast_graph.py +grep -nE "conn\.execute\(_CREATE_(EXT|IMPL|INJ|DECL|OVERRIDES|CALL|UNRESOLVED|ROUTE|EXPOSES|CLIENT|DECLARES_CLIENT|PRODUCER|DECLARES_PRODUCER|HTTP_CALL|ASYNC_CALL)" build_ast_graph.py +``` + +Sentinel greps — **must be non-zero** (guards against over-deletion): +``` +grep -n "_MERGE_SYMBOL\b" build_ast_graph.py # incremental path, kept +grep -n "MERGE (r:Route" build_ast_graph.py # pass5/6 dedup, kept +grep -n "COPY .*FROM \$rows" build_ast_graph.py # bulk path present +``` + +## Manual evidence (paste in PR description) + +Build the fixture via the bulk path and inspect meta: +```bash +rm -rf /tmp/p1 && .venv/bin/python build_ast_graph.py \ + --source-root tests/bank-chat-system \ + --ladybug-path /tmp/p1/code_graph.lbug --verbose +.venv/bin/java-codebase-rag meta --source-root tests/bank-chat-system --index-dir /tmp/p1 +``` +Expected: meta `counts_json` and node/edge counts **identical** to a pre-PR +per-row build (paste both side by side). Note the graph-write phase timing from +the `JCIRAG_PROGRESS` lines vs the pre-PR baseline. + +## Definition of Done + +- [ ] Step-1 spike result recorded in `_bulk_copy` docstring. +- [ ] Full path bulk-loads nodes-then-edges; no per-row `CREATE` remains in the full path. +- [ ] `_CREATE_SYMBOL` and the dead `_CREATE_*` edge/node strings deleted. +- [ ] All four new tests pass; full regression suites pass unchanged. +- [ ] All "must return zero" sentinel greps are empty; all "non-zero" greps hit. +- [ ] `.venv/bin/ruff check .` clean. +- [ ] Benchmark (before/after graph-write phase) pasted in the PR description. +- [ ] PR title: `perf(graph): bulk COPY FROM for the full rebuild path (PR-P1)`. +```` + +--- + +## PR-P2 — Bulk write for the incremental path + +**Branch:** `perf/bulk-graph-writes-p2` off PR-P1's branch (or `master` if PR-P1 has merged). +**Base:** PR-P1 merged. +**Plan section:** `plans/active/PLAN-INIT-INCREMENT-PERF.md` § PR-P2 (read this first). +**Estimated diff size:** small-medium. + +**Attach (`@-files`):** + +- `@build_ast_graph.py` (`_write_nodes_merge`, `_MERGE_SYMBOL`, `_delete_file_scope`, `incremental_rebuild`, `_bulk_copy` + `_REL_*_COLUMNS` from PR-P1) +- `@tests/test_incremental_graph.py` (regression net) + +**Prompt:** + +```` +You are implementing PR-P2 from `plans/active/PLAN-INIT-INCREMENT-PERF.md`. + +Read the **PR-P2** section in full. It depends on PR-P1's `_bulk_copy` primitive +and `_REL_*_COLUMNS` constants, which are already merged. The plan wins if this +prompt disagrees. + +## Scope + +Apply PR-P1's bulk primitive to the incremental path: + +1. Convert `_write_nodes_merge` (`:817`, uses `_MERGE_SYMBOL`) to stage rows then + `_bulk_copy(conn, "Symbol", NODE_COLUMNS, rows)`. The incremental path is + delete-then-insert (`_delete_file_scope` already removed the old scope), so + plain `COPY` insert into the cleaned scope is correct. Delete `_MERGE_SYMBOL`; + remove `_write_nodes_impl` if it has no remaining caller. +2. Convert the incremental edge re-emit to per-type staging + `_bulk_copy`, + scoped to the re-emitted files (same pattern as PR-P1's `_write_edges`). +3. **Retain the pass5/6 `MERGE (r:Route)` dedup** (`:3819-3821`) verbatim and add + a one-line comment explaining it is intentionally kept (routes written during + the scoped step must MERGE, not duplicate, against the global step). Do NOT + bulk-convert the Route writes that this MERGE guards. +4. Add the two named tests. + +## Out of scope (do NOT touch) + +- The full-rebuild path (already bulk in PR-P1). +- Route dedup semantics — keep the `MERGE (r:Route)` exactly. +- `_delete_file_scope`, `incremental_rebuild` algorithm, dependent-expansion, + crash-marker logic. +- Anything outside `build_ast_graph.py` + `tests/test_incremental_graph.py`. + +If you find yourself wanting to touch any of the above, **stop and ask**. + +## Deliverables + +1. Incremental node re-emit via `_bulk_copy`; `_MERGE_SYMBOL` deleted. +2. Incremental edge re-emit via per-type `_bulk_copy`. +3. `MERGE (r:Route)` retained + commented. +4. Two new tests in `tests/test_incremental_graph.py`. + +## Tests + +Run, all must pass: +``` +.venv/bin/python -m pytest tests/test_incremental_graph.py -v +.venv/bin/python -m pytest tests/test_ast_graph_build.py -q +.venv/bin/ruff check . +``` + +Sentinel greps — **must return zero**: +``` +grep -n "_MERGE_SYMBOL\b" build_ast_graph.py +``` + +Sentinel greps — **must be non-zero** (Route dedup still present; bulk still used): +``` +grep -n "MERGE (r:Route" build_ast_graph.py +grep -n "COPY .*FROM \$rows" build_ast_graph.py +``` + +## Manual evidence (paste in PR description) + +Single-file change equivalence: +```bash +# set up an index, touch one file, increment, then full-rebuild the same state +# and diff the graphs (node count, per-type edge counts, GraphMeta counters). +``` +Expected: incremental(bulk) == full-rebuild(bulk) for that state. Paste the +side-by-side counts. + +## Definition of Done + +- [ ] Incremental node/edge re-emit uses `_bulk_copy`; `_MERGE_SYMBOL` deleted. +- [ ] `MERGE (r:Route)` retained and commented. +- [ ] `test_incremental_bulk_write_equivalent_to_full_rebuild`, + `test_incremental_route_merge_dedup_preserved` pass; full + `tests/test_incremental_graph.py` passes unchanged. +- [ ] Sentinel greps pass (zero where required, non-zero where required). +- [ ] `.venv/bin/ruff check .` clean. +- [ ] PR title: `perf(graph): bulk COPY FROM for the incremental path (PR-P2)`. +```` + +--- + +## PR-P3 — Cached `LayeredIgnore` + `is_ignored` memo + +**Branch:** `perf/cached-ignore-p3` off `master`. +**Base:** `master` (independent of PR-P1/P2). +**Plan section:** `plans/active/PLAN-INIT-INCREMENT-PERF.md` § PR-P3 (read this first). +**Estimated diff size:** small. + +**Attach (`@-files`):** + +- `@java_index_flow_lancedb.py` (`ContextKey` defs `:60-72`, `coco_lifespan` provide sites `~:287-306`, `process_java_file`/`process_sql_file`/`process_yaml_file` `:344`/`:416`/`:464`) +- `@path_filtering.py` (`LayeredIgnore`, `_mega`, `_mega_build_for_rel`, `is_ignored`) +- `@tests/test_lancedb_e2e.py` (ignore test to keep green) + +**Prompt:** + +```` +You are implementing PR-P3 from `plans/active/PLAN-INIT-INCREMENT-PERF.md`. + +Read the **PR-P3** section in full. It is independent of PR-P1/P2. The plan wins +if this prompt disagrees. + +## Scope + +1. Define `IGNORE = coco.ContextKey[LayeredIgnore]("java_lance_layered_ignore")` + alongside `PROJECT_ROOT`/`EMBEDDER`/`LANCE_DB` (`:60-72`), reusing the SAME + `_ck_params` (`detect_change` vs `tracked`) detection block. +2. In `coco_lifespan`, add `builder.provide(IGNORE, LayeredIgnore(root))` next to + the other `builder.provide(...)` calls — built **once** per flow run. +3. In `process_java_file` (`:344`), `process_sql_file` (`:416`), `process_yaml_file` + (`:464`): add `ignore = coco.use_context(IGNORE)` and replace + `LayeredIgnore(project_root).is_ignored((project_root / file.file_path.path).resolve())` + with `ignore.is_ignored((project_root / file.file_path.path).resolve())`. + Keep `project_root` (still used for path resolution and `_parse_and_enrich_java`). +4. In `path_filtering.py` `LayeredIgnore`: add `self._mega_cache` in `__init__` + and memoize `_mega(rel)` keyed by `Path(rel_project).parent.as_posix()` (mega + depends only on the directory — `_mega_build_for_rel` reads `dir_parts` only). + `is_ignored`/`diagnose_dict` call `_mega` unchanged and benefit transparently. +5. Add the three named tests. + +## Out of scope (do NOT touch) + +- `build_ast_graph.py` and the graph write path (PR-P1/P2 own that). +- The ignore *decision* logic (`_mega_build_for_rel`, `_winning_row`, negation + scanning) — only memoize, do not alter semantics. +- Any schema, ontology, or re-index change. +- Loosening any existing test. + +If you find yourself wanting to touch any of the above, **stop and ask**. + +## Deliverables + +1. `IGNORE` ContextKey defined + provided once in `coco_lifespan`. +2. The three `process_*_file` functions consume it (no per-file construction). +3. `_mega` memoized by directory in `LayeredIgnore`. +4. Three new tests. + +## Tests + +Run, all must pass: +``` +.venv/bin/python -m pytest tests/test_lancedb_e2e.py -q +.venv/bin/python -m pytest tests -q -k "ignore or path_filter or vectors_progress" +.venv/bin/ruff check . +``` +Heavy (only if you can run cocoindex e2e locally): +``` +JAVA_CODEBASE_RAG_RUN_HEAVY=1 .venv/bin/python -m pytest tests/test_lancedb_e2e.py -q +``` + +Sentinel greps — **must return zero**: +``` +grep -nE "LayeredIgnore\(project_root\)" java_index_flow_lancedb.py +``` + +Sentinel greps — **must be non-zero**: +``` +grep -n "coco.use_context(IGNORE)" java_index_flow_lancedb.py # 3 sites +grep -n "_mega_cache" path_filtering.py # memo present +``` + +## Manual evidence (paste in PR description) + +Show the ignore object is built once: with `JAVA_CODEBASE_RAG_RUN_HEAVY=1`, run +the flow over a small corpus and log `id(ignore)` per file (temporary instrumentation), +confirm a single object id across all files. Then confirm a micro-benchmark of +`is_ignored` over N files drops with the `_mega` cache (same-directory files hit +the cache). + +## Definition of Done + +- [ ] `IGNORE` ContextKey defined (version-detected) + provided once in `coco_lifespan`. +- [ ] The three `process_*_file` consume it; no per-file `LayeredIgnore(project_root)`. +- [ ] `_mega` memoized by directory; `is_ignored`/`diagnose_dict` results unchanged. +- [ ] `test_is_ignored_mega_caches_by_directory`, + `test_layered_ignore_memo_preserves_decisions`, + `test_layered_ignore_provided_once_per_flow` pass. +- [ ] Existing ignore + vectors-progress tests pass unchanged. +- [ ] Sentinel greps pass. +- [ ] `.venv/bin/ruff check .` clean. +- [ ] PR title: `perf(vectors): lifespan-cached LayeredIgnore + is_ignored memo (PR-P3)`. +```` + +--- + +## Notes for the orchestrator + +- **Landing order:** PR-P1 → PR-P2 (P2 depends on P1's `_bulk_copy` + + `_REL_*_COLUMNS`). PR-P3 is independent and can go first, last, or between. +- **Review between PRs:** request code review after each PR lands (see + `superpowers:requesting-code-review`) — the equivalence harness is the gate + for P1/P2; the memo-parity test is the gate for P3. +- **Sentinel greps are binding:** a non-empty "must return zero" grep means scope + leaked; a empty "must be non-zero" grep means an over-deletion. Either blocks merge. diff --git a/plans/active/PLAN-INIT-INCREMENT-PERF.md b/plans/active/PLAN-INIT-INCREMENT-PERF.md new file mode 100644 index 00000000..60182a9e --- /dev/null +++ b/plans/active/PLAN-INIT-INCREMENT-PERF.md @@ -0,0 +1,432 @@ +# Plan: Faster init/increment — bulk graph writes + cached ignore + +Status: **active (planning)**. This plan implements +[`propose/active/INIT-INCREMENT-PERF-PROPOSE.md`](../../propose/active/INIT-INCREMENT-PERF-PROPOSE.md) +(see also PR #338). + +Depends on: proposal PR #338 should land first (or this plan tracks against the +proposal text directly). No code dependency on other in-flight work. + +## Goal + +- Cut `init` / `reprocess` wall-clock on a medium Java corpus from ~395s toward + ~120s by replacing the per-row graph write with bulk `COPY FROM` (PR-P1). +- Extend the bulk primitive to the incremental path so `increment` on large + change-sets is also fast (PR-P2). +- Remove the ~25s `LayeredIgnore` re-construction + `is_ignored` re-merge paid + per file in the cocoindex vectors phase (PR-P3). +- Preserve graph contents exactly: no ontology bump, no re-index, no query-result + change (proven by an equivalence harness). + +## Principles (do not relitigate in review) + +- **Byte-equivalent graph.** PR-P1/P2 change only the write mechanism; the graph + (node/edge rows, properties, `GraphMeta` counters) must be identical to today. + The equivalence harness is the merge gate — a failing equivalence test blocks + the PR. +- **Full rebuild path only in PR-P1.** The incremental path keeps its current + per-row writes until PR-P2. PR-P1 does not touch `increment`. +- **In-memory pyarrow `COPY FROM`** is the bulk mechanism (verified: the `ladybug` + wrapper forwards `COPY FROM` verbatim and accepts a pyarrow/pandas param — + `ladybug/connection.py:337/488`). Parquet-file staging is a fallback, not the + default. Do not propose CSV. +- **MPS device default is out of scope** — the flow already auto-selects MPS. Do + not add a device PR. +- **No new env vars, CLI flags, or public surface.** No compatibility shims. + +## PR breakdown - overview + +| PR | Scope | Ontology bump | Areas of concern | Test buckets | Independent of | +| --- | --- | --- | --- | --- | --- | +| **PR-P1** | Bulk `COPY FROM` for the full rebuild path (`init`/`reprocess`) in `build_ast_graph.py` | none | REL-table `COPY FROM` column order (FROM/TO first); LIST column arrow typing; CALLS dedup + `callee_declaring_role` materialized at staging; node-before-edge load order; `GraphMeta` bulk | equivalence + determinism + baseline + full `test_ast_graph_build.py` regression | — | +| **PR-P2** | Shared bulk primitive applied to the incremental `_delete_file_scope` → re-emit path | none | Preserve pass5/6 `MERGE (r:Route)` dedup (`build_ast_graph.py:3819-3821`); incremental delete-then-emit ordering; equivalence vs full rebuild | `test_incremental_graph.py` regression + new incremental-equivalence test | depends on **PR-P1** | +| **PR-P3** | Hoist `LayeredIgnore` to a cocoindex `ContextKey` + memoize `is_ignored`'s `_mega` | none | `ContextKey` lifespan scoping (build once, not per file); `_mega` memo correctness (mega depends on the file's directory only — match_file stays per-file); no change to any ignore *decision* | flow ignore tests + memo unit tests + heavy vectors-progress test | independent of PR-P1, PR-P2 | + +Landing order: **P1 → P2**; **P3** may land in any order (independent). + +## Resolved design decisions + +| Topic | Decision | +| --- | --- | +| Bulk mechanism | In-memory pyarrow: `conn.execute("COPY FROM $rows", {"rows": pa_table})`. Verified against `ladybug/connection.py:337` (`FROM $param`) and `:488` (pyarrow/pandas/polars accepted). | +| REL-table column rule | First two staged columns are the FROM/TO node primary keys (`Symbol.id` / `Route.id` / etc.). Exact column *naming* for REL `COPY FROM` is confirmed in PR-P1 step 1 (a 5-line spike); the contract below assumes positional FROM/TO. | +| `GraphMeta` MERGE (`build_ast_graph.py:3472-3473`) | Folded into PR-P1 bulk (it is in the full-rebuild path; the table is freshly created and empty, so `COPY` insert is correct). Resolves proposal Open Q1. | +| PR-P3 cache vehicle | cocoindex `ContextKey[LayeredIgnore]` (lifespan-scoped, built once). Resolves proposal Open Q2. | +| `is_ignored` memo | Cache `_mega(rel)` → `(mega, spec)` keyed by the file's project-relative **directory** (`Path(rel).parent.as_posix()`), because `_mega_build_for_rel` uses only `dir_parts = parts[:-1]`. `spec.match_file(rel)` stays per-file (cheap, filename-dependent). | + +--- + +# PR-P1 — Bulk `COPY FROM` for the full rebuild path + +**Goal:** replace the per-row `conn.execute` writes in `write_ladybug` (the +init/reprocess full-rebuild entry, `build_ast_graph.py:3893`) with bulk +in-memory-pyarrow `COPY FROM`. Graph contents stay byte-equivalent. + +## File-by-file changes + +### 1. `build_ast_graph.py` — bulk-write primitive + staging + +#### 1a. New helper `_bulk_copy` +Add near the other write helpers (after `_node_row`, ~`build_ast_graph.py:2994`): + +```python +import pyarrow as pa + +def _bulk_copy(conn: ladybug.Connection, table_name: str, columns: list[str], rows: list[dict]) -> None: + """Bulk-load rows into a (node or rel) table via in-memory pyarrow COPY FROM. + + `columns` fixes column order — for REL tables the first two MUST be the + FROM/TO node ids (kuzu requirement). Empty `rows` is a no-op. + """ + if not rows: + return + tbl = pa.Table.from_pylist(rows) # infers types from non-empty data + # from_pylist infers STRING / list correctly for non-empty rows; + # step-1 spike confirms LIST columns (modifiers/annotations/capabilities) + # infer as pa.list_(pa.string()). If any column is all-empty-list, pass an + # explicit pa.schema derived from the _SCHEMA_* strings instead. + conn.execute(f"COPY {table_name} FROM $rows", {"rows": tbl}) +``` + +Column-order constants (define next to the `_SCHEMA_*` strings, ~`:2812`), each +matching its `_SCHEMA_*` column order exactly — for REL tables the first two +entries are the endpoint ids: + +- `NODE_COLUMNS` = the `Symbol` property columns in `_SCHEMA_NODE` order. +- `ROUTE_COLUMNS`, `CLIENT_COLUMNS`, `PRODUCER_COLUMNS`, `GRAPHMETA_COLUMNS`, + `UNRESOLVED_CALL_SITE_COLUMNS` — same, from their `_SCHEMA_*`. +- For each REL table: `_REL_*_COLUMNS` = `["FROM", "TO", ]` + for `EXTENDS/IMPLEMENTS/INJECTS/DECLARES/OVERRIDES/CALLS/UNRESOLVED_AT/EXPOSES/ + DECLARES_CLIENT/DECLARES_PRODUCER/HTTP_CALLS/ASYNC_CALLS`. + +> **Step-1 spike (mandatory, first commit of PR-P1):** confirm (a) the exact +> REL `COPY FROM` column naming — build a 2-node + 1-edge toy, `COPY CALLS FROM +> $rows` with columns `["FROM","TO","source_file","resolved",...]`, assert the +> edge lands with correct endpoints; (b) `pa.Table.from_pylist` type inference +> for the `modifiers`/`annotations`/`capabilities` LIST columns. Record the +> working incantation in the helper docstring. If inference fails on empty +> lists, build an explicit `pa.schema` map keyed by table. + +#### 1b. Convert `_write_nodes` (full path) to bulk +`_write_nodes` (`:3096`) currently calls `_write_nodes_impl(..., symbol_query=_CREATE_SYMBOL)` +which loops `conn.execute(_CREATE_SYMBOL, _node_row(...))` per node. Replace the +full-path body: iterate packages / files / types / members exactly as +`_write_nodes_impl` does, but append each `_node_row(...)` to a list, then call +`_bulk_copy(conn, "Symbol", NODE_COLUMNS, node_rows)`. Keep the role/capability +resolution (`resolve_role_and_capabilities`) and `tables.type_role_by_node_id` +population identical (those are pure in-memory steps before staging). + +`_write_nodes_impl` + `_MERGE_SYMBOL` stay — they are the **incremental** path +(`_write_nodes_merge`, `:817`) used until PR-P2. `_CREATE_SYMBOL` becomes dead +(only `_write_nodes` used it) → delete it. + +#### 1c. Convert `_write_edges` to bulk +`_write_edges` (`:3244`) loops per edge, calling `conn.execute(_CREATE_EXT, +{...})` etc. with two dedup sets (`seen_calls` `:3282-3288`, `seen_ucs` +`:3317-3321`) and a `callee_declaring_role` lookup (`_callee_declaring_role_at_write`, +`:1647`/`:3302`). Restructure to accumulate **per-edge-type row lists**, applying +the *same* dedup and *same* `callee_declaring_role` materialization, then +`_bulk_copy` each REL table once: + +```python +extends_rows: list[dict] = [] +... +for ...: + extends_rows.append({"FROM": src_id, "TO": dst_id, "source_file": ..., "dst_name": ..., "dst_fqn": ..., "resolved": ...}) +... +calls_rows.append(...) # only when key not in seen_calls; include "callee_declaring_role" +... +_bulk_copy(conn, "EXTENDS", _REL_EXTENDS_COLUMNS, extends_rows) +_bulk_copy(conn, "CALLS", _REL_CALLS_COLUMNS, calls_rows) # CALLS_COLUMNS ends with callee_declaring_role +... +``` + +The dedup sets and `_callee_declaring_role_at_write` calls are computed +**before** appending (i.e. at staging), exactly preserving current semantics. +`_CREATE_EXT`/`_CREATE_IMPL`/`_CREATE_INJ`/`_CREATE_DECL`/`_CREATE_OVERRIDES`/ +`_CREATE_CALL`/`_CREATE_UNRESOLVED`/`_CREATE_UNRESOLVED_AT` become dead → delete. + +`_populate_declares_rows` (`:3189`) and `_populate_overrides_rows` (`:3206`) are +pure in-memory population — unchanged; they feed the DECLARES/OVERRIDES row lists. + +#### 1d. Convert `_write_routes_and_exposes` (and Client/Producer/HTTP_CALLS/ASYNC_CALLS) to bulk +Same pattern (`~`:3349-3408`): stage `Route`/`Client`/`Producer` node rows and +`EXPOSES`/`DECLARES_CLIENT`/`DECLARES_PRODUCER`/`HTTP_CALLS`/`ASYNC_CALLS` edge +rows (with FROM/TO ids), then `_bulk_copy` each. The pass5/6 `MERGE (r:Route)` +dedup (`:3819-3821`) is **incremental-only** and untouched in PR-P1; in the full +path routes are emitted once into a fresh table, so plain `COPY` insert is +correct. Delete the now-dead `_CREATE_ROUTE`/`_CREATE_EXPOSES`/`_CREATE_CLIENT`/ +`_CREATE_DECLARES_CLIENT`/`_CREATE_PRODUCER`/`_CREATE_DECLARES_PRODUCER`/ +`_CREATE_HTTP_CALL`/`_CREATE_ASYNC_CALL` strings. + +#### 1e. Convert `GraphMeta` write to bulk +The `MERGE (m:GraphMeta {key: $k})` loop (`:3472-3473`) becomes a single-row +`_bulk_copy(conn, "GraphMeta", GRAPHMETA_COLUMNS, [meta_row])` (table is freshly +created and empty in a full rebuild). Keep the row contents (ontology_version, +built_at, source_root, parse_errors, counts_json) identical. + +#### 1f. `write_ladybug` load order (node-before-edge) +`write_ladybug` (`:3893`) already calls `_write_nodes` → `_write_edges` → +`_write_routes_and_exposes` → GraphMeta. Preserve this order: all node tables +(`Symbol`, `Route`, `Client`, `Producer`, `UnresolvedCallSite`, `GraphMeta`) are +bulk-loaded before any REL table, so endpoint rows exist when kuzu validates REL +`COPY FROM`. Add an inline comment stating the ordering invariant. + +### 2. `tests/test_ast_graph_build.py` — equivalence harness + regression + +Add an equivalence harness and a committed baseline. The existing 26 tests in +this file (e.g. `test_schema_has_all_expected_tables`, +`test_graph_meta_present_and_versioned`, `test_each_edge_type_populated`, +`test_pass3_callee_declaring_role_bank_annotated_types`, +`test_pass3_unresolved_call_site_emitted`, `test_pass3_known_external_calls_preserved`) +must continue to pass unchanged — they ARE the regression net. + +Add a committed baseline artifact `tests/fixtures/graph_baseline_bank_chat.json` +(node count, per-type edge counts, `GraphMeta` counters, and N=3 sampled edge +property rows per type incl. `source_file` and CALLS `callee_declaring_role`). +Generated once during PR-P1 from the **last per-row build** before its removal, +committed, and regenerated only when `ontology_version` changes. + +### 3. `scripts/bench_init_graph_write.py` (new, dev-only) — benchmark +A small script that times `init` (or `build_ast_graph.py --source-root +`) on a medium corpus and prints the graph-write phase delta vs the +pre-PR baseline. Not shipped in the package; lives under `scripts/`. Documents +the measured speedup in the PR description. + +## Tests for PR-P1 + +1. `test_bulk_write_graph_matches_per_row_baseline` — build `tests/bank-chat-system` + via the bulk path, assert node count, per-type edge counts, `GraphMeta` + counters, and the sampled edge property rows equal `tests/fixtures/graph_baseline_bank_chat.json`. +2. `test_bulk_write_is_deterministic_double_build` — build bank-chat twice to two + DB paths via the bulk path, assert identical node count, per-type edge counts, + `GraphMeta` counters, and a query battery (`MATCH (s:Symbol) RETURN ...`, + `neighbors`-shaped reads). Models on existing + `test_29_determinism_pass4_route_ids` / `test_overrides_edge_set_deterministic_double_build`. +3. `test_bulk_write_preserves_calls_dedup_and_callee_declaring_role` — assert + CALLS rows are deduped by `(src,dst,argc,line)` and carry the correct + `callee_declaring_role` (reuse the `@Service` callee assertion from + `test_pass3_callee_declaring_role_bank_annotated_types` against a bulk build). +4. `test_bulk_write_empty_rel_table_is_noop` — a corpus with no `EXTENDS` edges + must not error (`_bulk_copy` no-ops on empty rows) and the table is empty. + +**Must-still-pass (regression, do not loosen):** the full `tests/test_ast_graph_build.py`, +`tests/test_bank_chat_brownfield_integration.py`, `tests/test_call_edges_e2e.py`, +and `tests/test_incremental_graph.py` suites (incremental path untouched). + +## Definition of done (PR-P1) + +- [ ] `_bulk_copy` helper added; step-1 spike result recorded in its docstring. +- [ ] `write_ladybug` full path writes all node tables then all REL tables via `_bulk_copy`; no per-row `CREATE` remains in the full path. +- [ ] Dead `_CREATE_*` strings + `_CREATE_SYMBOL` deleted. +- [ ] `test_bulk_write_graph_matches_per_row_baseline`, + `test_bulk_write_is_deterministic_double_build`, + `test_bulk_write_preserves_calls_dedup_and_callee_declaring_role`, + `test_bulk_write_empty_rel_table_is_noop` added and pass. +- [ ] Full existing graph/edge/brownfield/incremental test suites pass unchanged. +- [ ] `.venv/bin/ruff check .` clean. +- [ ] Benchmark numbers (before/after graph-write phase) pasted in the PR description. +- [ ] PR title: `perf(graph): bulk COPY FROM for the full rebuild path (PR-P1)`. + +## Implementation step list + +| # | Step | File(s) | Done when | +| --- | --- | --- | --- | +| 1 | Spike: confirm REL `COPY FROM` column order + `from_pylist` LIST typing | `build_ast_graph.py` (throwaway) | toy 2-node+1-edge `COPY CALLS FROM $rows` lands with correct endpoints; result in `_bulk_copy` docstring | +| 2 | Add `_bulk_copy` + column-order constants | `build_ast_graph.py` | helper + `_REL_*_COLUMNS`/`*_COLUMNS` defined | +| 3 | Convert `_write_nodes` (full) to bulk; delete `_CREATE_SYMBOL` | `build_ast_graph.py` | node tables bulk-loaded; Symbol count matches baseline | +| 4 | Convert `_write_edges` to per-type staging + bulk; delete dead `_CREATE_*` edge strings | `build_ast_graph.py` | edge counts match baseline; dedup + callee_declaring_role preserved | +| 5 | Convert `_write_routes_and_exposes` (+ Client/Producer/HTTP_CALLS/ASYNC_CALLS) to bulk | `build_ast_graph.py` | route/client/producer/call-edge counts match baseline | +| 6 | Convert GraphMeta write to bulk | `build_ast_graph.py` | `test_graph_meta_present_and_versioned` passes | +| 7 | Generate + commit baseline fixture | `tests/fixtures/graph_baseline_bank_chat.json` | produced from last per-row build before removal | +| 8 | Add 4 equivalence/determinism tests | `tests/test_ast_graph_build.py` | all pass | +| 9 | Run full regression + ruff; capture benchmark | repo | suites green; numbers in PR description | + +--- + +# PR-P2 — Bulk write for the incremental path + +**Goal:** apply the PR-P1 bulk primitive to the incremental rebuild path so +`increment` on a large change-set is also fast. **Depends on PR-P1** (`_bulk_copy` ++ column constants + the staging pattern). + +## File-by-file changes + +### 1. `build_ast_graph.py` — incremental path bulk conversion +- `_write_nodes_merge` (`:817`, uses `_MERGE_SYMBOL`): the incremental path is + delete-then-insert (`_delete_file_scope`, `:673`, removes changed/dependent + files' nodes+edges first), so re-emit is into a cleaned scope → convert the + per-row `_MERGE_SYMBOL` loop to `_bulk_copy(conn, "Symbol", NODE_COLUMNS, rows)`. + `_MERGE_SYMBOL` becomes dead → delete; `_write_nodes_impl` is removed if no + caller remains (it does not, after P1+P2). +- Incremental edge re-emit: apply the same per-type staging + `_bulk_copy` as + PR-P1's `_write_edges`, scoped to the re-emitted files. +- **Preserve the pass5/6 `MERGE (r:Route)` dedup** (`:3819-3821`): routes written + during the scoped step must still MERGE (not duplicate) against routes from the + global step. Keep those specific `MERGE` statements; only convert the + non-Route, non-dedup writes to bulk. Add a comment that this MERGE is + intentionally retained. +- `incremental_rebuild` (`:3535`): no algorithmic change; it calls into the + converted write helpers. + +### 2. `tests/test_incremental_graph.py` — incremental equivalence +The existing 28 tests (e.g. `test_incremental_single_file_change`, +`test_incremental_new_file`, `test_incremental_deleted_file`, +`test_incremental_phantom_nodes_preserved`, `test_incremental_dependent_expansion`, +`test_incremental_expansion_cap_fallback`, `test_incremental_crash_marker_*`, +`test_incremental_no_changes_is_noop`, `test_delete_file_scope_preserves_dependent_nodes`) +must pass unchanged. + +## Tests for PR-P2 + +1. `test_incremental_bulk_write_equivalent_to_full_rebuild` — make a single-file + change, run `increment` (bulk incremental), then build the same state via a + full rebuild (bulk full), assert identical node count, per-type edge counts, + and `GraphMeta` counters. +2. `test_incremental_route_merge_dedup_preserved` — a corpus where pass5/6 would + re-emit an existing route; assert no duplicate `Route` rows after `increment` + (the retained `MERGE (r:Route)` still dedups). + +**Must-still-pass:** full `tests/test_incremental_graph.py`. + +## Definition of done (PR-P2) + +- [ ] Incremental node/edge re-emit uses `_bulk_copy`; `_MERGE_SYMBOL` deleted. +- [ ] pass5/6 `MERGE (r:Route)` dedup retained and commented. +- [ ] `test_incremental_bulk_write_equivalent_to_full_rebuild`, + `test_incremental_route_merge_dedup_preserved` added and pass. +- [ ] Full `tests/test_incremental_graph.py` passes unchanged. +- [ ] `.venv/bin/ruff check .` clean. +- [ ] PR title: `perf(graph): bulk COPY FROM for the incremental path (PR-P2)`. + +## Implementation step list + +| # | Step | File(s) | Done when | +| --- | --- | --- | --- | +| 1 | Convert `_write_nodes_merge` to bulk; delete `_MERGE_SYMBOL` | `build_ast_graph.py` | incremental node re-emit via `_bulk_copy` | +| 2 | Convert incremental edge re-emit to per-type staging + bulk | `build_ast_graph.py` | incremental edge counts match full rebuild | +| 3 | Retain + comment the pass5/6 `MERGE (r:Route)` dedup | `build_ast_graph.py` | no duplicate Route rows after increment | +| 4 | Add incremental-equivalence + route-dedup tests | `tests/test_incremental_graph.py` | both pass; full suite green | +| 5 | ruff + full regression | repo | clean + green | + +--- + +# PR-P3 — Cached `LayeredIgnore` (+ `is_ignored` memo) as a `ContextKey` + +**Goal:** remove the ~25s paid per-flow-run from re-constructing +`LayeredIgnore(project_root)` per file and re-merging `_mega(spec)` per file. +**Independent** of PR-P1/P2. + +## File-by-file changes + +### 1. `java_index_flow_lancedb.py` — `ContextKey` for the ignore object +- Define `IGNORE = coco.ContextKey[LayeredIgnore]("java_lance_layered_ignore")` + alongside the existing `PROJECT_ROOT`/`EMBEDDER`/`LANCE_DB` definitions + (`:60-72`), using the **same** `_ck_params` (`detect_change` vs `tracked`) + detection block so it works across cocoindex versions. +- In `coco_lifespan` (where `builder.provide(PROJECT_ROOT, root)` / + `builder.provide(EMBEDDER, embedder)` are called, `~`:287-306`), add + `builder.provide(IGNORE, LayeredIgnore(root))` — built **once** per flow run. +- In `process_java_file` (`:344`), `process_sql_file` (`:416`), + `process_yaml_file` (`:464`): replace + `if LayeredIgnore(project_root).is_ignored((project_root / file.file_path.path).resolve()):` + (`:351`/`:423`/`:471`) with + `ignore = coco.use_context(IGNORE)` then + `if ignore.is_ignored((project_root / file.file_path.path).resolve()):`. + `project_root` stays (still needed for `_parse_and_enrich_java` and path + resolution). + +### 2. `path_filtering.py` — memoize `_mega` by directory +- Add an instance cache `self._mega_cache: dict[str, tuple[list[str], GitIgnoreSpec, list]] = {}` + in `LayeredIgnore.__init__`. +- In `LayeredIgnore._mega` (`:334`): key on + `dir_rel = Path(rel_project).parent.as_posix()`. If present, return the cached + `(mega, spec, meta)`; else compute via `_mega_build_for_rel` + compile, store, + return. Correctness rests on `_mega_build_for_rel` reading only + `dir_parts = parts[:-1]` (verified: it never reads the filename). +- `is_ignored` (`:345`) and `diagnose_dict` (`:377`) keep calling `_mega(rel)` + unchanged — they transparently benefit from the cache. +- Add a module/env knob `LAYERED_IGNORE_MEGA_CACHE` only if a disable path is + needed for debugging — **default enabled**, no new public surface otherwise. + +## Tests for PR-P3 + +1. `test_is_ignored_mega_caches_by_directory` — build a `LayeredIgnore` over a + fixture with nested ignore files; call `is_ignored` for several files in the + same directory; assert `_mega` is computed once (spy on `_mega_build_for_rel` + or count `GitIgnoreSpec.from_lines` calls) and that decisions match the + uncached path. +2. `test_layered_ignore_memo_preserves_decisions` — for a corpus of paths + (including nested `.java-codebase-rag/ignore` + `.gitignore` with negations), + assert `LayeredIgnore(...).is_ignored(p)` is identical with and without the + memo cache (correctness invariant). +3. `test_layered_ignore_provided_once_per_flow` (HEAVY, + `JAVA_CODEBASE_RAG_RUN_HEAVY=1`) — run the real cocoindex flow over a small + corpus; assert the `LayeredIgnore` provided via `IGNORE` is a single instance + (identity check), not reconstructed per file. + +**Must-still-pass:** `tests/test_lancedb_e2e.py::test_lancedb_ignore_file_reduces_indexed_java_files`, +any `tests/test_path_filtering*` / ignore tests, and the heavy +`test_flow_emits_vectors_progress_per_file`. + +## Definition of done (PR-P3) + +- [ ] `IGNORE` `ContextKey` defined (version-detected) + provided once in `coco_lifespan`. +- [ ] The three `process_*_file` functions consume it; no per-file `LayeredIgnore(project_root)`. +- [ ] `_mega` memoized by directory; `is_ignored`/`diagnose_dict` unchanged in result. +- [ ] `test_is_ignored_mega_caches_by_directory`, + `test_layered_ignore_memo_preserves_decisions`, + `test_layered_ignore_provided_once_per_flow` added and pass. +- [ ] Existing ignore + vectors-progress tests pass unchanged. +- [ ] `.venv/bin/ruff check .` clean. +- [ ] PR title: `perf(vectors): lifespan-cached LayeredIgnore + is_ignored memo (PR-P3)`. + +## Implementation step list + +| # | Step | File(s) | Done when | +| --- | --- | --- | --- | +| 1 | Add `IGNORE` ContextKey (version-detected) + provide in lifespan | `java_index_flow_lancedb.py` | `IGNORE` resolvable in process_*_file | +| 2 | Switch the three `process_*_file` to `coco.use_context(IGNORE)` | `java_index_flow_lancedb.py` | no per-file construction | +| 3 | Add `_mega` dirname cache | `path_filtering.py` | repeated same-dir calls hit cache | +| 4 | Add memo + correctness + once-per-flow tests | `tests/` | all pass | +| 5 | ruff + regression (incl. heavy ignore/vectors tests) | repo | clean + green | + +--- + +# Cross-PR risks and mitigations + +| # | Risk | Severity | Mitigation | +| --- | --- | --- | --- | +| 1 | REL-table `COPY FROM` column ordering/naming differs from assumption | High | PR-P1 step-1 spike locks the exact incantation before any conversion; recorded in `_bulk_copy` docstring. | +| 2 | LIST columns (`modifiers`/`annotations`/`capabilities`) mis-type under `from_pylist` | Medium | Spike asserts LIST inference; explicit `pa.schema` fallback documented. Equivalence test catches any row diff. | +| 3 | CALLS dedup / `callee_declaring_role` drift when moving to staging | High | `test_bulk_write_preserves_calls_dedup_and_callee_declaring_role` + the sampled-edge baseline check; semantics computed identically, just earlier. | +| 4 | Bulk write changes observable query results | High | Equivalence + determinism tests + full existing graph/edge/brownfield suites are the gate. | +| 5 | PR-P2 breaks the pass5/6 Route dedup | Medium | `MERGE (r:Route)` retained by name; `test_incremental_route_merge_dedup_preserved` guards it. | +| 6 | PR-P3 `_mega` memo returns stale rules if dirParts logic changes | Low | `_mega_build_for_rel` reads only `dir_parts`; `test_layered_ignore_memo_preserves_decisions` asserts parity with uncached path. | +| 7 | `ContextKey` lifespan differs across cocoindex versions | Low | Reuse the existing `_ck_params` `detect_change`/`tracked` detection block used by `PROJECT_ROOT`. | + +# Out of scope + +- MPS embedding default (already auto-selected — see proposal Out of scope). +- ANN vector index (#337) and `watch` mode (#336). +- Replacing/restructuring the cocoindex flow; changing embedding model/dim. +- Parallelizing graph analysis passes (pass1–pass6). +- Converting the incremental path in PR-P1 (it is PR-P2). +- Parquet-file or CSV bulk paths (pyarrow in-memory only). + +# Whole-plan done definition + +1. `init`/`reprocess` graph-write phase on the medium corpus drops from ~321s to + tens of seconds (benchmark in PR-P1 description). +2. `increment` on a large change-set uses the bulk path (PR-P2). +3. The vectors phase pays no per-file `LayeredIgnore`/`_mega` cost (PR-P3). +4. No ontology bump (`ontology_version` stays 17); no re-index required; all + existing graph/edge/brownfield/incremental/ignore/vectors tests pass. +5. Proposal moved to `propose/completed/` and this plan to `plans/completed/` + once all three PRs land. + +# Tracking + +- `PR-P1`: _pending_ +- `PR-P2`: _pending_ (blocked by PR-P1) +- `PR-P3`: _pending_ From 3839f364d4cd41ae324b9bb2ef15f9a109ed1763 Mon Sep 17 00:00:00 2001 From: Dmitry Teryaev Date: Mon, 22 Jun 2026 00:06:44 +0300 Subject: [PATCH 2/2] docs(plans): apply review feedback to init/increment perf plan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 5-lens subagent review of the plan found the PR-P1/P2 boundary was architecturally wrong: the graph write helpers are SHARED between the full and incremental paths, so a "full-path-only" split is impossible. - Verified call graph: _write_edges/_write_routes_and_exposes/_write_nodes_impl/ _write_meta are each called by BOTH paths; _write_clients_producers_and_calls is incremental-only (global pass5/6). - Re-split by write-FUNCTION: PR-P1 = _bulk_copy + _write_edges (the ~250s prize, accelerates both paths); PR-P2 = _write_nodes_impl + _write_routes_and_exposes + _write_clients_producers_and_calls; PR-P3 = ignore cache (independent). - GraphMeta (_write_meta) left on MERGE (shared, one row) — reverses Open Q1. - Fixed all binding sentinel greps: PR-P1 zeros the edge _CREATE_* only; PR-P2 zeros node/route/client constants + _MERGE_SYMBOL only after both routes functions convert; PR-P3 sentinel narrowed to LayeredIgnore(project_root).is_ignored (the bare-constructor grep wrongly matched once-per-run sites :177/:569, which are correctly left alone). - Load-order §1f corrected (UnresolvedCallSite before UNRESOLVED_AT; Route/Client/Producer before their edges). Test files qualified (test_brownfield_routes / test_mcp_v2_compose / test_vectors_progress / test_path_filtering). PR-P2 tests placed in TestIncrementalOrchestrator. Baseline flagged as equivalence anchor, not production invariant. PR-P1 DoD lists the four test names. Co-Authored-By: Claude --- plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md | 309 +++++------ plans/active/PLAN-INIT-INCREMENT-PERF.md | 566 ++++++++++----------- 2 files changed, 433 insertions(+), 442 deletions(-) diff --git a/plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md b/plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md index c2e6e12d..d1c602fb 100644 --- a/plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md +++ b/plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md @@ -3,12 +3,18 @@ Status: **active**. One self-contained prompt per PR. Copy the prompt verbatim into the agent, attach the files in its `@-files` block, and let it execute. +> **Revised after the PR #339 subagent review.** PRs are split by **write-function**, +> not by path — the graph write helpers are shared between the full and incremental +> paths, so converting one accelerates both. Sentinel greps were corrected (the +> PR-P3 "must return zero" grep previously over-matched once-per-run sites). + **Workflow per PR:** 1. Create the branch named in the prompt off the stated base. 2. Read the cited plan section in full **before** writing code. 3. Implement step-by-step; run the listed tests after each step. -4. Run the sentinel greps — every "must return zero" line must be empty. +4. Run the sentinel greps — every "must return zero" line must be empty, every + "must be non-zero" line must hit. 5. Paste the manual-evidence output into the PR description. 6. Open a PR with the exact title in the Definition of Done. @@ -18,13 +24,13 @@ into the agent, attach the files in its `@-files` block, and let it execute. - `server.py` is stdio — never write to stdout from anything reachable by a tool handler. - Do not add a cocoindex dependency outside `java_index_flow_lancedb.py`. - The plan is the source of truth — if this prompt and the plan disagree, the plan wins. -- Do not touch any file outside the prompt's `@-files` + the test files it names. If you think an adjacent file must change, **stop and ask** — don't ship it. +- Do not touch any file outside the prompt's `@-files` + the test files it names. If you think an adjacent file must change, **stop and ask**. - Do not loosen any existing test assertion to make it pass. - Breaking changes are allowed; no compatibility shims. --- -## PR-P1 — Bulk `COPY FROM` for the full rebuild path +## PR-P1 — Bulk `COPY FROM` for `_write_edges` (shared; the ~250s prize) **Branch:** `perf/bulk-graph-writes-p1` off `master`. **Base:** `master`. @@ -33,63 +39,62 @@ into the agent, attach the files in its `@-files` block, and let it execute. **Attach (`@-files`):** -- `@build_ast_graph.py` (the full-rebuild write path: `_write_nodes`, `_write_edges`, `_write_routes_and_exposes`, `write_ladybug`, `_node_row`, `_SCHEMA_*`, `_callee_declaring_role_at_write`, `_populate_declares_rows`, `_populate_overrides_rows`) -- `@propose/active/INIT-INCREMENT-PERF-PROPOSE.md` (design + staging invariants) -- `@tests/test_ast_graph_build.py` (existing regression net + where the new tests go) -- `@tests/_builders.py` (graph-build helpers: `build_ladybug_full_into`, etc.) +- `@build_ast_graph.py` (focus: `_write_edges:3244`, `_node_row:2994`, `_SCHEMA_*:2812-2940`, `_callee_declaring_role_at_write:1647`, `seen_calls:3282`, `seen_ucs:3317`, `_populate_declares_rows:3189`) +- `@propose/active/INIT-INCREMENT-PERF-PROPOSE.md` (design; on PR #338's branch — if absent, the staging invariants are inlined in the plan §PR-P1) +- `@tests/test_ast_graph_build.py` (regression net + where new tests go) +- `@tests/test_incremental_graph.py` (`_write_edges` is shared — incremental regression is binding) +- `@tests/_builders.py` (graph-build helpers) **Prompt:** ```` You are implementing PR-P1 from `plans/active/PLAN-INIT-INCREMENT-PERF.md`. -Read the **PR-P1** section of the plan in full before writing any code, plus the -proposal's "Staging invariants" and "Equivalence" paragraphs. The plan is the -source of truth — if this prompt and the plan disagree, the plan wins. +Read the **PR-P1** section in full. The plan wins if this prompt disagrees. -## Scope +KEY FACT: `_write_edges` (build_ast_graph.py:3244) is a SHARED helper called by +BOTH the full path (write_ladybug:3926) and the incremental path (:805). So +converting it accelerates both — there is no "full-only" edge conversion. PR-P1 +converts ONLY `_write_edges` (+ adds the `_bulk_copy` primitive). Nodes, routes, +clients/producers, and GraphMeta are PR-P2. -Implement PR-P1 exactly as specified: replace the per-row `conn.execute` writes -in the **full rebuild path** (`write_ladybug`) with bulk in-memory-pyarrow -`COPY FROM`. Concretely: +## Scope 1. **Step-1 spike (first commit):** confirm the exact REL `COPY FROM` column - order + `pa.Table.from_pylist` LIST typing with a throwaway 2-node + 1-edge + naming + `pa.Table.from_pylist` typing with a throwaway 2-Symbol + 1-CALLS toy. Record the working incantation in the `_bulk_copy` docstring. -2. Add `_bulk_copy(conn, table_name, columns, rows)` + the `*_COLUMNS` / - `_REL_*_COLUMNS` constants (column order matches `_SCHEMA_*`; REL tables list - FROM/TO first). -3. Convert `_write_nodes` (full path) to stage all node rows then - `_bulk_copy(conn, "Symbol", NODE_COLUMNS, rows)`. Delete `_CREATE_SYMBOL`. -4. Convert `_write_edges` to per-edge-type row staging (applying the SAME +2. Add `_bulk_copy(conn, table_name, columns, rows)` + the `_REL_*_COLUMNS` / + column constants (REL tables list FROM/TO first; match `_SCHEMA_*` order). +3. Convert `_write_edges` to per-edge-type row staging (apply the SAME `seen_calls`/`seen_ucs` dedup and SAME `_callee_declaring_role_at_write` - lookup, at staging time) then `_bulk_copy` each REL table. Delete the dead - `_CREATE_EXT/IMPL/INJ/DECL/OVERRIDES/CALL/UNRESOLVED/UNRESOLVED_AT` strings. -5. Convert `_write_routes_and_exposes` (+ Client/Producer/HTTP_CALLS/ASYNC_CALLS) - to bulk; delete the dead `_CREATE_ROUTE/EXPOSES/CLIENT/DECLARES_CLIENT/ - PRODUCER/DECLARES_PRODUCER/HTTP_CALL/ASYNC_CALL` strings. -6. Convert the `GraphMeta` write (`:3472-3473`) to a single-row `_bulk_copy`. -7. Generate + commit `tests/fixtures/graph_baseline_bank_chat.json` from the - **last per-row build** before you remove the per-row path. -8. Add the four named tests. + lookup at staging, before appending), then `_bulk_copy` each REL table. + Bulk-load UnresolvedCallSite NODE rows before UNRESOLVED_AT edges (Symbol + nodes are already loaded by `_write_nodes`). +4. Delete the dead module constants `_CREATE_EXT/IMPL/INJ/DECL/OVERRIDES/CALL` + and the local `_CREATE_UNRESOLVED`/`_CREATE_UNRESOLVED_AT` (defined inside + `_write_edges` at :3307/:3313 — removed with the rewrite). +5. Generate + commit `tests/fixtures/graph_baseline_bank_chat.json` from the + last per-row `_write_edges` build before removal. +6. Add the four named tests. ## Out of scope (do NOT touch) -- The incremental path: `_write_nodes_merge`, `_MERGE_SYMBOL`, `_delete_file_scope`, - `incremental_rebuild`, and the pass5/6 `MERGE (r:Route)` (`:3819-3821`) stay - exactly as-is. PR-P1 is full-rebuild only. -- `java_index_flow_lancedb.py`, `path_filtering.py`, `server.py`, `search_lancedb.py`. -- Any schema (`_SCHEMA_*` DDL), ontology, or re-index change. -- CSV or Parquet-file staging (pyarrow in-memory only). -- Loosening any existing test. +- Node writes (`_write_nodes`, `_write_nodes_impl`, `_write_nodes_merge`, + `_CREATE_SYMBOL`, `_MERGE_SYMBOL`) — PR-P2. +- Routes/clients/producers/calls (`_write_routes_and_exposes`, + `_write_clients_producers_and_calls`, and their `_CREATE_*` constants + including `_CREATE_CLIENT`/`_CREATE_PRODUCER`/`_CREATE_ROUTE` etc.) — PR-P2. +- `_write_meta` / GraphMeta — leave the MERGE alone (PR-P2 also leaves it). +- `java_index_flow_lancedb.py`, `path_filtering.py`, `server.py`. +- Any schema/ontology/re-index change. CSV or Parquet-file staging. If you find yourself wanting to touch any of the above, **stop and ask**. ## Deliverables -1. `_bulk_copy` helper + column-order constants in `build_ast_graph.py`. -2. `write_ladybug` full path bulk-loads all node tables then all REL tables. -3. Dead `_CREATE_*` / `_CREATE_SYMBOL` strings removed. +1. `_bulk_copy` helper + column-order constants. +2. `_write_edges` stages per-type rows and bulk-loads (CALLS dedup + callee_declaring_role at staging; UnresolvedCallSite before UNRESOLVED_AT). +3. Dead `_CREATE_EXT/IMPL/INJ/DECL/OVERRIDES/CALL` + locals removed. 4. `tests/fixtures/graph_baseline_bank_chat.json` committed. 5. Four new tests in `tests/test_ast_graph_build.py`. @@ -97,93 +102,99 @@ If you find yourself wanting to touch any of the above, **stop and ask**. Run, all must pass: ``` -.venv/bin/python -m pytest tests/test_ast_graph_build.py -v -.venv/bin/python -m pytest tests/test_incremental_graph.py tests/test_bank_chat_brownfield_integration.py tests/test_call_edges_e2e.py -q +.venv/bin/python -m pytest tests/test_ast_graph_build.py tests/test_incremental_graph.py -v +.venv/bin/python -m pytest tests/test_bank_chat_brownfield_integration.py tests/test_call_edges_e2e.py -q .venv/bin/ruff check . ``` -Sentinel greps — **must all return zero** (no output): +Sentinel greps — **must return zero**: ``` -grep -n "_CREATE_SYMBOL\b" build_ast_graph.py -grep -nE "conn\.execute\(_CREATE_(EXT|IMPL|INJ|DECL|OVERRIDES|CALL|UNRESOLVED|ROUTE|EXPOSES|CLIENT|DECLARES_CLIENT|PRODUCER|DECLARES_PRODUCER|HTTP_CALL|ASYNC_CALL)" build_ast_graph.py +grep -nE "_CREATE_(EXT|IMPL|INJ|DECL|OVERRIDES|CALL|UNRESOLVED|UNRESOLVED_AT)\b" build_ast_graph.py ``` -Sentinel greps — **must be non-zero** (guards against over-deletion): +Sentinel greps — **must be non-zero** (guards against over-deletion; these belong to PR-P2 / are retained): ``` -grep -n "_MERGE_SYMBOL\b" build_ast_graph.py # incremental path, kept -grep -n "MERGE (r:Route" build_ast_graph.py # pass5/6 dedup, kept -grep -n "COPY .*FROM \$rows" build_ast_graph.py # bulk path present +grep -n "_MERGE_SYMBOL\b" build_ast_graph.py # node upsert, kept until PR-P2 +grep -n "_CREATE_CLIENT\b" build_ast_graph.py # routes/clients, PR-P2 +grep -n "MERGE (r:Route" build_ast_graph.py # pass5/6 dedup, kept +grep -n "COPY .*FROM \$rows" build_ast_graph.py # bulk path present ``` ## Manual evidence (paste in PR description) -Build the fixture via the bulk path and inspect meta: ```bash rm -rf /tmp/p1 && .venv/bin/python build_ast_graph.py \ --source-root tests/bank-chat-system \ --ladybug-path /tmp/p1/code_graph.lbug --verbose .venv/bin/java-codebase-rag meta --source-root tests/bank-chat-system --index-dir /tmp/p1 ``` -Expected: meta `counts_json` and node/edge counts **identical** to a pre-PR -per-row build (paste both side by side). Note the graph-write phase timing from -the `JCIRAG_PROGRESS` lines vs the pre-PR baseline. +Expected: meta `counts_json` + node/edge counts identical to a pre-PR per-row +build (paste both). Note the graph-write phase timing from `JCIRAG_PROGRESS` +lines vs the pre-PR baseline. ## Definition of Done - [ ] Step-1 spike result recorded in `_bulk_copy` docstring. -- [ ] Full path bulk-loads nodes-then-edges; no per-row `CREATE` remains in the full path. -- [ ] `_CREATE_SYMBOL` and the dead `_CREATE_*` edge/node strings deleted. -- [ ] All four new tests pass; full regression suites pass unchanged. -- [ ] All "must return zero" sentinel greps are empty; all "non-zero" greps hit. -- [ ] `.venv/bin/ruff check .` clean. -- [ ] Benchmark (before/after graph-write phase) pasted in the PR description. -- [ ] PR title: `perf(graph): bulk COPY FROM for the full rebuild path (PR-P1)`. +- [ ] `_write_edges` stages per-type rows (CALLS dedup + callee_declaring_role at staging); UnresolvedCallSite bulk-loaded before UNRESOLVED_AT. +- [ ] `_CREATE_EXT/IMPL/INJ/DECL/OVERRIDES/CALL` + local `_CREATE_UNRESOLVED/_UNRESOLVED_AT` deleted. +- [ ] `test_bulk_write_edges_match_per_row_baseline`, `test_bulk_write_is_deterministic_double_build`, `test_bulk_write_preserves_calls_dedup_and_callee_declaring_role`, `test_bulk_write_empty_rel_table_is_noop` pass. +- [ ] Full `test_ast_graph_build.py` + `test_incremental_graph.py` pass unchanged. +- [ ] Sentinel greps: zero where required, non-zero where required. +- [ ] `.venv/bin/ruff check .` clean; benchmark in PR description. +- [ ] PR title: `perf(graph): bulk COPY FROM for _write_edges (PR-P1)`. ```` --- -## PR-P2 — Bulk write for the incremental path +## PR-P2 — Bulk write for nodes + routes/clients/producers/calls -**Branch:** `perf/bulk-graph-writes-p2` off PR-P1's branch (or `master` if PR-P1 has merged). +**Branch:** `perf/bulk-graph-writes-p2` off PR-P1's branch (or `master` if PR-P1 merged). **Base:** PR-P1 merged. **Plan section:** `plans/active/PLAN-INIT-INCREMENT-PERF.md` § PR-P2 (read this first). -**Estimated diff size:** small-medium. +**Estimated diff size:** medium. **Attach (`@-files`):** -- `@build_ast_graph.py` (`_write_nodes_merge`, `_MERGE_SYMBOL`, `_delete_file_scope`, `incremental_rebuild`, `_bulk_copy` + `_REL_*_COLUMNS` from PR-P1) -- `@tests/test_incremental_graph.py` (regression net) +- `@build_ast_graph.py` (`_write_nodes_impl:3029`, `_write_nodes:3096`, `_write_nodes_merge:817`, `_CREATE_SYMBOL`, `_MERGE_SYMBOL`, `_write_routes_and_exposes:3338`, `_write_clients_producers_and_calls:3810`, the Route `MERGE (r:Route` at `:3819`, `_write_meta:3421`, `_bulk_copy` + `_REL_*_COLUMNS` from PR-P1) +- `@tests/test_incremental_graph.py` (regression; add to `TestIncrementalOrchestrator`) **Prompt:** ```` You are implementing PR-P2 from `plans/active/PLAN-INIT-INCREMENT-PERF.md`. - -Read the **PR-P2** section in full. It depends on PR-P1's `_bulk_copy` primitive -and `_REL_*_COLUMNS` constants, which are already merged. The plan wins if this -prompt disagrees. +Read the **PR-P2** section in full. It reuses PR-P1's `_bulk_copy` + +`_REL_*_COLUMNS`. The plan wins. ## Scope -Apply PR-P1's bulk primitive to the incremental path: - -1. Convert `_write_nodes_merge` (`:817`, uses `_MERGE_SYMBOL`) to stage rows then - `_bulk_copy(conn, "Symbol", NODE_COLUMNS, rows)`. The incremental path is - delete-then-insert (`_delete_file_scope` already removed the old scope), so - plain `COPY` insert into the cleaned scope is correct. Delete `_MERGE_SYMBOL`; - remove `_write_nodes_impl` if it has no remaining caller. -2. Convert the incremental edge re-emit to per-type staging + `_bulk_copy`, - scoped to the re-emitted files (same pattern as PR-P1's `_write_edges`). -3. **Retain the pass5/6 `MERGE (r:Route)` dedup** (`:3819-3821`) verbatim and add - a one-line comment explaining it is intentionally kept (routes written during - the scoped step must MERGE, not duplicate, against the global step). Do NOT - bulk-convert the Route writes that this MERGE guards. -4. Add the two named tests. +1. Convert `_write_nodes_impl` (:3029, the shared workhorse called by + `_write_nodes` full + `_write_nodes_merge` incremental) to stage Symbol rows + then `_bulk_copy(conn, "Symbol", NODE_COLUMNS, rows)`. Delete `_CREATE_SYMBOL` + and `_MERGE_SYMBOL` (both dead once the workhorse is bulk). Do the existing + `resolve_role_and_capabilities` + `type_role_by_node_id` population before + staging, unchanged. +2. Convert `_write_routes_and_exposes` (:3338, shared) to per-table staging + + `_bulk_copy` for Route/EXPOSES/Client/Producer/DECLARES_CLIENT/DECLARES_PRODUCER/ + HTTP_CALLS/ASYNC_CALLS (keep the existing `_file_by_node_id`/`_file_by_client_id`/ + `_file_by_producer_id` source_file resolution). Bulk-load Route/Client/Producer + NODES before the EXPOSES/DECLARES_*/HTTP_CALLS/ASYNC_CALLS edges. Delete + `_CREATE_ROUTE`/`_CREATE_EXPOSES`. +3. Convert `_write_clients_producers_and_calls` (:3810, incremental-only global + pass5/6) Client/Producer/edge writes to per-type staging + `_bulk_copy` (keep + the `member_by_id`/`client_by_id`/`producer_by_id` resolution). **Retain the + `MERGE (r:Route {id:$id}) …` dedup (:3819-3828) verbatim** + add a one-line + comment it is intentionally kept. Now delete the 6 shared constants — + `_CREATE_CLIENT`/`_CREATE_PRODUCER`/`_CREATE_DECLARES_CLIENT`/ + `_CREATE_DECLARES_PRODUCER`/`_CREATE_HTTP_CALL`/`_CREATE_ASYNC_CALL` — which + are dead only after BOTH routes/exposes and clients_producers functions convert. +4. Leave `_write_meta` (:3421) and its `MERGE (m:GraphMeta …)` UNTOUCHED. +5. Add the two named tests as methods of `TestIncrementalOrchestrator` in + `tests/test_incremental_graph.py`. ## Out of scope (do NOT touch) -- The full-rebuild path (already bulk in PR-P1). -- Route dedup semantics — keep the `MERGE (r:Route)` exactly. +- `_write_edges` (done in PR-P1). +- `_write_meta` / GraphMeta MERGE — leave it. - `_delete_file_scope`, `incremental_rebuild` algorithm, dependent-expansion, crash-marker logic. - Anything outside `build_ast_graph.py` + `tests/test_incremental_graph.py`. @@ -192,28 +203,30 @@ If you find yourself wanting to touch any of the above, **stop and ask**. ## Deliverables -1. Incremental node re-emit via `_bulk_copy`; `_MERGE_SYMBOL` deleted. -2. Incremental edge re-emit via per-type `_bulk_copy`. -3. `MERGE (r:Route)` retained + commented. -4. Two new tests in `tests/test_incremental_graph.py`. +1. `_write_nodes_impl` bulk; `_CREATE_SYMBOL` + `_MERGE_SYMBOL` deleted. +2. `_write_routes_and_exposes` bulk; `_CREATE_ROUTE`/`_CREATE_EXPOSES` deleted. +3. `_write_clients_producers_and_calls` Client/Producer/edges bulk; `MERGE (r:Route)` retained + commented; 6 shared `_CREATE_*` deleted. +4. `_write_meta` untouched. +5. Two new tests in `TestIncrementalOrchestrator`. ## Tests Run, all must pass: ``` -.venv/bin/python -m pytest tests/test_incremental_graph.py -v -.venv/bin/python -m pytest tests/test_ast_graph_build.py -q +.venv/bin/python -m pytest tests/test_incremental_graph.py tests/test_ast_graph_build.py -v .venv/bin/ruff check . ``` Sentinel greps — **must return zero**: ``` -grep -n "_MERGE_SYMBOL\b" build_ast_graph.py +grep -nE "_CREATE_(SYMBOL|ROUTE|EXPOSES|CLIENT|PRODUCER|DECLARES_CLIENT|DECLARES_PRODUCER|HTTP_CALL|ASYNC_CALL)\b" build_ast_graph.py +grep -nE "_MERGE_SYMBOL\b" build_ast_graph.py ``` -Sentinel greps — **must be non-zero** (Route dedup still present; bulk still used): +Sentinel greps — **must be non-zero** (Route dedup + GraphMeta MERGE retained; bulk present): ``` grep -n "MERGE (r:Route" build_ast_graph.py +grep -n "MERGE (m:GraphMeta" build_ast_graph.py grep -n "COPY .*FROM \$rows" build_ast_graph.py ``` @@ -221,22 +234,21 @@ grep -n "COPY .*FROM \$rows" build_ast_graph.py Single-file change equivalence: ```bash -# set up an index, touch one file, increment, then full-rebuild the same state -# and diff the graphs (node count, per-type edge counts, GraphMeta counters). +# set up an index, touch one file, increment (bulk), then full-rebuild the same +# state (bulk) and diff graphs (node count, per-type edge counts, GraphMeta). ``` -Expected: incremental(bulk) == full-rebuild(bulk) for that state. Paste the -side-by-side counts. +Expected: incremental(bulk) == full-rebuild(bulk) for that state. Paste side-by-side counts. ## Definition of Done -- [ ] Incremental node/edge re-emit uses `_bulk_copy`; `_MERGE_SYMBOL` deleted. -- [ ] `MERGE (r:Route)` retained and commented. -- [ ] `test_incremental_bulk_write_equivalent_to_full_rebuild`, - `test_incremental_route_merge_dedup_preserved` pass; full - `tests/test_incremental_graph.py` passes unchanged. -- [ ] Sentinel greps pass (zero where required, non-zero where required). +- [ ] `_write_nodes_impl` bulk; `_CREATE_SYMBOL` + `_MERGE_SYMBOL` deleted. +- [ ] `_write_routes_and_exposes` bulk (Route/Client/Producer before edges); `_CREATE_ROUTE`/`_CREATE_EXPOSES` deleted. +- [ ] `_write_clients_producers_and_calls` Client/Producer/edges bulk; `MERGE (r:Route)` retained + commented; 6 shared `_CREATE_*` deleted. +- [ ] `_write_meta` untouched. +- [ ] `test_incremental_bulk_write_equivalent_to_full_rebuild`, `test_incremental_route_merge_dedup_preserved` (both in `TestIncrementalOrchestrator`) pass; full `test_incremental_graph.py` + `test_ast_graph_build.py` green. +- [ ] Sentinel greps pass. - [ ] `.venv/bin/ruff check .` clean. -- [ ] PR title: `perf(graph): bulk COPY FROM for the incremental path (PR-P2)`. +- [ ] PR title: `perf(graph): bulk COPY FROM for nodes, routes, clients/producers (PR-P2)`. ```` --- @@ -250,58 +262,63 @@ side-by-side counts. **Attach (`@-files`):** -- `@java_index_flow_lancedb.py` (`ContextKey` defs `:60-72`, `coco_lifespan` provide sites `~:287-306`, `process_java_file`/`process_sql_file`/`process_yaml_file` `:344`/`:416`/`:464`) -- `@path_filtering.py` (`LayeredIgnore`, `_mega`, `_mega_build_for_rel`, `is_ignored`) -- `@tests/test_lancedb_e2e.py` (ignore test to keep green) +- `@java_index_flow_lancedb.py` (`ContextKey` defs `:60-72`, `coco_lifespan` provide sites `:287-306`, `process_java_file:345`/`process_sql_file:417`/`process_yaml_file:465`) +- `@path_filtering.py` (`LayeredIgnore`, `_mega:334`, `_mega_build_for_rel:193`, `is_ignored:345`, `diagnose_dict:377`) +- `@tests/test_path_filtering.py` (where the two memo unit tests go) +- `@tests/test_lancedb_e2e.py` (HEAVY once-per-flow test) **Prompt:** ```` You are implementing PR-P3 from `plans/active/PLAN-INIT-INCREMENT-PERF.md`. -Read the **PR-P3** section in full. It is independent of PR-P1/P2. The plan wins -if this prompt disagrees. +Read the **PR-P3** section in full. Independent of PR-P1/P2. The plan wins. + +KEY FACT: `LayeredIgnore(project_root)` appears at FIVE sites in +java_index_flow_lancedb.py — :177, :351, :423, :471, :569. PR-P3 converts ONLY +the three `process_*_file` sites (:351/:423/:471). The other two (:177 in +`_approximate_vectors_total`, :569 in the app_main pre-walk) call +`cocoindex_excluded_patterns()` ONCE PER RUN — leave them alone. ## Scope 1. Define `IGNORE = coco.ContextKey[LayeredIgnore]("java_lance_layered_ignore")` - alongside `PROJECT_ROOT`/`EMBEDDER`/`LANCE_DB` (`:60-72`), reusing the SAME + alongside `PROJECT_ROOT`/`EMBEDDER`/`LANCE_DB` (:60-72), reusing the SAME `_ck_params` (`detect_change` vs `tracked`) detection block. -2. In `coco_lifespan`, add `builder.provide(IGNORE, LayeredIgnore(root))` next to - the other `builder.provide(...)` calls — built **once** per flow run. -3. In `process_java_file` (`:344`), `process_sql_file` (`:416`), `process_yaml_file` - (`:464`): add `ignore = coco.use_context(IGNORE)` and replace +2. In `coco_lifespan` (:287-306), add `builder.provide(IGNORE, LayeredIgnore(root))` + — built ONCE per flow run. +3. In `process_java_file`/`process_sql_file`/`process_yaml_file`: add + `ignore = coco.use_context(IGNORE)` and replace `LayeredIgnore(project_root).is_ignored((project_root / file.file_path.path).resolve())` with `ignore.is_ignored((project_root / file.file_path.path).resolve())`. - Keep `project_root` (still used for path resolution and `_parse_and_enrich_java`). + Keep `project_root`. DO NOT touch :177 or :569. 4. In `path_filtering.py` `LayeredIgnore`: add `self._mega_cache` in `__init__` - and memoize `_mega(rel)` keyed by `Path(rel_project).parent.as_posix()` (mega - depends only on the directory — `_mega_build_for_rel` reads `dir_parts` only). - `is_ignored`/`diagnose_dict` call `_mega` unchanged and benefit transparently. -5. Add the three named tests. + and memoize `_mega(rel)` keyed by `Path(rel_project).parent.as_posix()` + (`_mega_build_for_rel` reads only `dir_parts`, so this is correct). +5. Add the three named tests in the right files. ## Out of scope (do NOT touch) -- `build_ast_graph.py` and the graph write path (PR-P1/P2 own that). +- `build_ast_graph.py` and the graph write path (PR-P1/P2). - The ignore *decision* logic (`_mega_build_for_rel`, `_winning_row`, negation - scanning) — only memoize, do not alter semantics. -- Any schema, ontology, or re-index change. -- Loosening any existing test. + scanning) — only memoize. +- Sites :177 and :569. +- Any schema/ontology/re-index change. Loosening any existing test. If you find yourself wanting to touch any of the above, **stop and ask**. ## Deliverables -1. `IGNORE` ContextKey defined + provided once in `coco_lifespan`. -2. The three `process_*_file` functions consume it (no per-file construction). +1. `IGNORE` ContextKey (version-detected) + provided once in `coco_lifespan`. +2. The three `process_*_file` consume it; :177/:569 untouched. 3. `_mega` memoized by directory in `LayeredIgnore`. -4. Three new tests. +4. Three tests: two in `tests/test_path_filtering.py`, one (HEAVY) in `tests/test_lancedb_e2e.py`. ## Tests Run, all must pass: ``` -.venv/bin/python -m pytest tests/test_lancedb_e2e.py -q +.venv/bin/python -m pytest tests/test_path_filtering.py tests/test_lancedb_e2e.py -q .venv/bin/python -m pytest tests -q -k "ignore or path_filter or vectors_progress" .venv/bin/ruff check . ``` @@ -310,9 +327,9 @@ Heavy (only if you can run cocoindex e2e locally): JAVA_CODEBASE_RAG_RUN_HEAVY=1 .venv/bin/python -m pytest tests/test_lancedb_e2e.py -q ``` -Sentinel greps — **must return zero**: +Sentinel greps — **must return zero** (matches ONLY the 3 process sites; :177/:569 use the bare constructor + `cocoindex_excluded_patterns`, not `.is_ignored`): ``` -grep -nE "LayeredIgnore\(project_root\)" java_index_flow_lancedb.py +grep -nE "LayeredIgnore\(project_root\)\.is_ignored" java_index_flow_lancedb.py ``` Sentinel greps — **must be non-zero**: @@ -323,20 +340,17 @@ grep -n "_mega_cache" path_filtering.py # memo present ## Manual evidence (paste in PR description) -Show the ignore object is built once: with `JAVA_CODEBASE_RAG_RUN_HEAVY=1`, run -the flow over a small corpus and log `id(ignore)` per file (temporary instrumentation), -confirm a single object id across all files. Then confirm a micro-benchmark of -`is_ignored` over N files drops with the `_mega` cache (same-directory files hit -the cache). +With `JAVA_CODEBASE_RAG_RUN_HEAVY=1`, run the flow over a small corpus and log +`id(ignore)` per file (temporary instrumentation) — confirm a single object id +across all files. Then micro-benchmark `is_ignored` over N files: same-directory +files hit the `_mega` cache. ## Definition of Done -- [ ] `IGNORE` ContextKey defined (version-detected) + provided once in `coco_lifespan`. -- [ ] The three `process_*_file` consume it; no per-file `LayeredIgnore(project_root)`. +- [ ] `IGNORE` ContextKey (version-detected) + provided once in `coco_lifespan`. +- [ ] The three `process_*_file` consume it; :177/:569 untouched. - [ ] `_mega` memoized by directory; `is_ignored`/`diagnose_dict` results unchanged. -- [ ] `test_is_ignored_mega_caches_by_directory`, - `test_layered_ignore_memo_preserves_decisions`, - `test_layered_ignore_provided_once_per_flow` pass. +- [ ] `test_is_ignored_mega_caches_by_directory`, `test_layered_ignore_memo_preserves_decisions` (in `tests/test_path_filtering.py`), `test_layered_ignore_provided_once_per_flow` (in `tests/test_lancedb_e2e.py`, HEAVY) pass. - [ ] Existing ignore + vectors-progress tests pass unchanged. - [ ] Sentinel greps pass. - [ ] `.venv/bin/ruff check .` clean. @@ -347,10 +361,13 @@ the cache). ## Notes for the orchestrator -- **Landing order:** PR-P1 → PR-P2 (P2 depends on P1's `_bulk_copy` + - `_REL_*_COLUMNS`). PR-P3 is independent and can go first, last, or between. -- **Review between PRs:** request code review after each PR lands (see - `superpowers:requesting-code-review`) — the equivalence harness is the gate - for P1/P2; the memo-parity test is the gate for P3. -- **Sentinel greps are binding:** a non-empty "must return zero" grep means scope - leaked; a empty "must be non-zero" grep means an over-deletion. Either blocks merge. +- **Landing order:** PR-P1 → PR-P2 (P2 needs `_bulk_copy` + `_REL_*_COLUMNS`). + PR-P3 is independent (can go first, last, or between). +- **Shared-helper awareness:** `_write_edges`, `_write_routes_and_exposes`, + `_write_nodes_impl`, `_write_meta` are each called by BOTH paths. Converting + one accelerates both — so `test_incremental_graph.py` is a binding gate for + PR-P1 and PR-P2, not just PR-P2. +- **Review between PRs** (`superpowers:requesting-code-review`): the equivalence + harness gates P1/P2; the memo-parity test gates P3. +- **Sentinel greps are binding:** a non-empty "must return zero" grep = scope + leak; an empty "must be non-zero" grep = over-deletion. Either blocks merge. diff --git a/plans/active/PLAN-INIT-INCREMENT-PERF.md b/plans/active/PLAN-INIT-INCREMENT-PERF.md index 60182a9e..b6cff842 100644 --- a/plans/active/PLAN-INIT-INCREMENT-PERF.md +++ b/plans/active/PLAN-INIT-INCREMENT-PERF.md @@ -1,46 +1,64 @@ # Plan: Faster init/increment — bulk graph writes + cached ignore Status: **active (planning)**. This plan implements -[`propose/active/INIT-INCREMENT-PERF-PROPOSE.md`](../../propose/active/INIT-INCREMENT-PERF-PROPOSE.md) -(see also PR #338). - -Depends on: proposal PR #338 should land first (or this plan tracks against the -proposal text directly). No code dependency on other in-flight work. +[`propose/active/INIT-INCREMENT-PERF-PROPOSE.md`](../../propose/active/INIT-INCREMENT-PERF-PROPOSE.md). +The proposal lands via PR #338; this plan lands via PR #339 and **stacks behind +#338** (the proposal file is on #338's branch until it merges). The staging +invariants are inlined below so this plan is self-contained if #338 has not +merged yet. + +Depends on: PR #338 (proposal) — non-blocking for reading this plan (key +invariants inlined), but the proposal should merge first. + +> **Revised after a 5-lens subagent review (PR #339 thread).** The original +> draft assumed `_write_edges` / `_write_routes_and_exposes` / `_write_meta` +> were full-rebuild-only and split PRs by *path*. They are **shared by both +> paths** (verified: `_write_edges` `build_ast_graph.py:3244` is called at +> `:3926` full + `:805` incremental; `_write_routes_and_exposes` `:3338` at +> `:3930` full + `:811` incremental; `_write_meta` `:3421` at `:3933` full + +> `:3733` incremental). PRs are now split by **write-function**, not by path, +> and GraphMeta is left on MERGE (not bulk-converted). ## Goal -- Cut `init` / `reprocess` wall-clock on a medium Java corpus from ~395s toward - ~120s by replacing the per-row graph write with bulk `COPY FROM` (PR-P1). -- Extend the bulk primitive to the incremental path so `increment` on large - change-sets is also fast (PR-P2). +- Cut `init` / `reprocess` / `increment` graph-write wall-clock by replacing + per-row `conn.execute` writes with bulk `COPY FROM` — the ~81% init lever. - Remove the ~25s `LayeredIgnore` re-construction + `is_ignored` re-merge paid - per file in the cocoindex vectors phase (PR-P3). -- Preserve graph contents exactly: no ontology bump, no re-index, no query-result - change (proven by an equivalence harness). + per file in the cocoindex vectors phase. +- Preserve graph contents exactly: no ontology bump, no re-index, no + query-result change (proven by an equivalence harness). ## Principles (do not relitigate in review) -- **Byte-equivalent graph.** PR-P1/P2 change only the write mechanism; the graph - (node/edge rows, properties, `GraphMeta` counters) must be identical to today. - The equivalence harness is the merge gate — a failing equivalence test blocks - the PR. -- **Full rebuild path only in PR-P1.** The incremental path keeps its current - per-row writes until PR-P2. PR-P1 does not touch `increment`. -- **In-memory pyarrow `COPY FROM`** is the bulk mechanism (verified: the `ladybug` - wrapper forwards `COPY FROM` verbatim and accepts a pyarrow/pandas param — - `ladybug/connection.py:337/488`). Parquet-file staging is a fallback, not the +- **Byte-equivalent graph.** Every PR changes only the write mechanism; the + graph (node/edge rows, properties, `GraphMeta` counters) must be identical to + today. The equivalence harness + the full `test_incremental_graph.py` suite + are the merge gate. +- **Split by write-function, not by path.** The graph write helpers are shared + between the full path (`write_ladybug:3893`) and the incremental path + (`incremental_rebuild:3535`): `_write_edges`, `_write_routes_and_exposes`, + `_write_nodes_impl`, `_write_meta` are each called by BOTH. Converting a + shared helper accelerates **both** paths at once — there is no "full-only" + conversion for edges/routes. `_write_clients_producers_and_calls:3810` is + **incremental-only** (the global pass5/6 step; sole caller `:3716`). +- **GraphMeta stays on MERGE.** `_write_meta:3421` is shared and recomputes + counters before a single `MERGE (m:GraphMeta {key:$k})` (`:3472`). It is one + row and not worth the risk — **do not bulk-convert it**. (This reverses the + proposal's Open Q1 recommendation.) +- **In-memory pyarrow `COPY FROM`** is the bulk mechanism (verified: the + `ladybug` wrapper forwards `COPY FROM` and accepts a pyarrow param — + `ladybug/connection.py:337`/`:488`). Parquet-file is a fallback, not the default. Do not propose CSV. -- **MPS device default is out of scope** — the flow already auto-selects MPS. Do - not add a device PR. +- **MPS device default is out of scope** — the flow already auto-selects MPS. - **No new env vars, CLI flags, or public surface.** No compatibility shims. ## PR breakdown - overview -| PR | Scope | Ontology bump | Areas of concern | Test buckets | Independent of | +| PR | Scope (write-functions converted) | Ontology bump | Areas of concern | Test buckets | Independent of | | --- | --- | --- | --- | --- | --- | -| **PR-P1** | Bulk `COPY FROM` for the full rebuild path (`init`/`reprocess`) in `build_ast_graph.py` | none | REL-table `COPY FROM` column order (FROM/TO first); LIST column arrow typing; CALLS dedup + `callee_declaring_role` materialized at staging; node-before-edge load order; `GraphMeta` bulk | equivalence + determinism + baseline + full `test_ast_graph_build.py` regression | — | -| **PR-P2** | Shared bulk primitive applied to the incremental `_delete_file_scope` → re-emit path | none | Preserve pass5/6 `MERGE (r:Route)` dedup (`build_ast_graph.py:3819-3821`); incremental delete-then-emit ordering; equivalence vs full rebuild | `test_incremental_graph.py` regression + new incremental-equivalence test | depends on **PR-P1** | -| **PR-P3** | Hoist `LayeredIgnore` to a cocoindex `ContextKey` + memoize `is_ignored`'s `_mega` | none | `ContextKey` lifespan scoping (build once, not per file); `_mega` memo correctness (mega depends on the file's directory only — match_file stays per-file); no change to any ignore *decision* | flow ignore tests + memo unit tests + heavy vectors-progress test | independent of PR-P1, PR-P2 | +| **PR-P1** | Add `_bulk_copy`; convert **`_write_edges`** (shared → both paths' Symbol→Symbol edges + UnresolvedCallSite/UNRESOLVED_AT) | none | REL-table `COPY FROM` column order (FROM/TO first); CALLS dedup + `callee_declaring_role` materialized at staging; UnresolvedCallSite loaded before UNRESOLVED_AT | equivalence + determinism + baseline; full `test_ast_graph_build.py` **and** `test_incremental_graph.py` (shared helper) | — | +| **PR-P2** | Convert **`_write_nodes_impl`** (shared nodes), **`_write_routes_and_exposes`** (shared routes/clients/producers/calls), **`_write_clients_producers_and_calls`** (incremental-only global; Route MERGE preserved) | none | Route/Client/Producer nodes loaded before EXPOSES/DECLARES_*/HTTP_CALLS/ASYNC_CALLS; the 6 client/producer `_CREATE_*` constants are shared with the incremental-only function — delete only after BOTH convert; pass5/6 Route MERGE retained | `test_incremental_graph.py` regression + new incremental-equivalence test | depends on **PR-P1** (`_bulk_copy`) | +| **PR-P3** | Hoist `LayeredIgnore` to a cocoindex `ContextKey` + memoize `is_ignored`'s `_mega` by directory | none | `ContextKey` lifespan scoping; `_mega` memo correctness (mega depends on directory only); leave the once-per-run sites (`:177`, `:569`) alone; no change to any ignore *decision* | `tests/test_path_filtering.py` memo tests + `tests/test_lancedb_e2e.py` (HEAVY) once-per-flow | independent of PR-P1, PR-P2 | Landing order: **P1 → P2**; **P3** may land in any order (independent). @@ -48,336 +66,291 @@ Landing order: **P1 → P2**; **P3** may land in any order (independent). | Topic | Decision | | --- | --- | -| Bulk mechanism | In-memory pyarrow: `conn.execute("COPY
FROM $rows", {"rows": pa_table})`. Verified against `ladybug/connection.py:337` (`FROM $param`) and `:488` (pyarrow/pandas/polars accepted). | -| REL-table column rule | First two staged columns are the FROM/TO node primary keys (`Symbol.id` / `Route.id` / etc.). Exact column *naming* for REL `COPY FROM` is confirmed in PR-P1 step 1 (a 5-line spike); the contract below assumes positional FROM/TO. | -| `GraphMeta` MERGE (`build_ast_graph.py:3472-3473`) | Folded into PR-P1 bulk (it is in the full-rebuild path; the table is freshly created and empty, so `COPY` insert is correct). Resolves proposal Open Q1. | -| PR-P3 cache vehicle | cocoindex `ContextKey[LayeredIgnore]` (lifespan-scoped, built once). Resolves proposal Open Q2. | -| `is_ignored` memo | Cache `_mega(rel)` → `(mega, spec)` keyed by the file's project-relative **directory** (`Path(rel).parent.as_posix()`), because `_mega_build_for_rel` uses only `dir_parts = parts[:-1]`. `spec.match_file(rel)` stays per-file (cheap, filename-dependent). | +| Bulk mechanism | In-memory pyarrow: `conn.execute("COPY
FROM $rows", {"rows": pa_table})`. Verified `ladybug/connection.py:337` (`FROM $param`) + `:488` (pyarrow accepted). | +| REL-table column rule | First two staged columns are the FROM/TO node primary keys. Exact column naming for REL `COPY FROM` locked by the PR-P1 step-1 spike. | +| GraphMeta (`_write_meta`) | **Leave on MERGE.** Shared helper, one row, recomputes counters (`build_ast_graph.py:3421`). Reverses proposal Open Q1. | +| `_write_nodes_impl` (shared workhorse) | Converted in PR-P2; both `_write_nodes` (full, `_CREATE_SYMBOL`) and `_write_nodes_merge` (incremental, `_MERGE_SYMBOL`) call it, so converting it once kills both constants. | +| PR-P3 cache vehicle | cocoindex `ContextKey[LayeredIgnore]` (lifespan-scoped, built once). | +| `is_ignored` memo | Cache `_mega(rel)` → `(mega, spec, meta)` keyed by the file's project-relative **directory** (`Path(rel).parent.as_posix()`); `_mega_build_for_rel` reads only `dir_parts = parts[:-1]` (`path_filtering.py:226-227`), so this is correct. `spec.match_file(rel)` stays per-file. | +| Sites `:177` / `:569` | Left alone — both call `cocoindex_excluded_patterns()` **once per run** (the `_approximate_vectors_total` helper and the app_main pre-walk), not per-file. | --- -# PR-P1 — Bulk `COPY FROM` for the full rebuild path +# PR-P1 — Bulk `COPY FROM` for `_write_edges` (shared; the ~250s prize) -**Goal:** replace the per-row `conn.execute` writes in `write_ladybug` (the -init/reprocess full-rebuild entry, `build_ast_graph.py:3893`) with bulk -in-memory-pyarrow `COPY FROM`. Graph contents stay byte-equivalent. +**Goal:** add the bulk primitive and convert the shared `_write_edges` helper. +Because `_write_edges` is called by both paths, this accelerates the +Symbol→Symbol edges + UnresolvedCallSite writes for **both** `init` and +`increment` (the largest graph-write cost: ~250s of ~321s). ## File-by-file changes -### 1. `build_ast_graph.py` — bulk-write primitive + staging - -#### 1a. New helper `_bulk_copy` -Add near the other write helpers (after `_node_row`, ~`build_ast_graph.py:2994`): +### 1. `build_ast_graph.py` — `_bulk_copy` primitive + `_write_edges` conversion +#### 1a. New helper `_bulk_copy` (add near `_node_row`, ~`:2994`) ```python import pyarrow as pa -def _bulk_copy(conn: ladybug.Connection, table_name: str, columns: list[str], rows: list[dict]) -> None: - """Bulk-load rows into a (node or rel) table via in-memory pyarrow COPY FROM. +def _bulk_copy(conn, table_name, columns, rows): + """Bulk-load rows into a node/rel table via in-memory pyarrow COPY FROM. - `columns` fixes column order — for REL tables the first two MUST be the - FROM/TO node ids (kuzu requirement). Empty `rows` is a no-op. + `columns` fixes column order; for REL tables the first two MUST be the + FROM/TO node primary keys (kuzu requirement). Empty `rows` is a no-op. """ if not rows: return - tbl = pa.Table.from_pylist(rows) # infers types from non-empty data - # from_pylist infers STRING / list correctly for non-empty rows; - # step-1 spike confirms LIST columns (modifiers/annotations/capabilities) - # infer as pa.list_(pa.string()). If any column is all-empty-list, pass an - # explicit pa.schema derived from the _SCHEMA_* strings instead. + tbl = pa.Table.from_pylist(rows) conn.execute(f"COPY {table_name} FROM $rows", {"rows": tbl}) ``` +Column-order constants next to the `_SCHEMA_*` strings (`~:2812`), each matching +its `_SCHEMA_*` order; for REL tables the first two entries are the endpoint ids: +`_REL_EXTENDS_COLUMNS = ["FROM","TO","source_file","dst_name","dst_fqn","resolved"]`, +`_REL_CALLS_COLUMNS = ["FROM","TO","source_file","callee_declaring_role", <…resolved props>]`, +`UNRESOLVED_CALL_SITE_COLUMNS`, `_REL_UNRESOLVED_AT_COLUMNS = ["FROM","TO","source_file"]`, etc. + +> **Step-1 spike (mandatory, first commit of PR-P1):** confirm (a) the exact REL +> `COPY FROM` column naming — toy 2-Symbol + 1-CALLS `COPY CALLS FROM $rows` +> with `["FROM","TO",…]`, assert the edge lands with correct endpoints and +> `callee_declaring_role`; (b) `pa.Table.from_pylist` type inference for any LIST +> columns this helper touches. Record the working incantation in the docstring. +> (On kuzu 0.11.3 + the repo's pybind backend, all-empty LIST columns infer as +> `list` and are accepted — confirmed by review — so an explicit pa.schema +> is a fallback, not required.) + +#### 1b. Convert `_write_edges` (`:3244`, shared) to per-type staging + bulk +Today it loops `conn.execute(_CREATE_EXT, {…})` etc. with two dedup sets — +`seen_calls` (`:3282-3288`, key `(src_id,dst_id,arg_count,call_site_line)` — +verified) and `seen_ucs` (`:3317-3321`) — and a `callee_declaring_role` lookup +(`_callee_declaring_role_at_write`, `:1647`/`:3302`). Restructure to accumulate +per-edge-type row lists, applying the **same** dedup and **same** +`callee_declaring_role` materialization **before appending**, then `_bulk_copy` +each REL table once. REL row dicts get `FROM`/`TO` = src/dst node ids plus the +properties in `_SCHEMA_*` order. + +**Within-helper load order (kuzu validates endpoint existence):** Symbol nodes +are already loaded by `_write_nodes` (called before `_write_edges` in both +paths). So bulk-load the **UnresolvedCallSite node rows before the UNRESOLVED_AT +edge rows** (UNRESOLVED_AT is Symbol→UnresolvedCallSite). The Symbol→Symbol edges +(EXTENDS/IMPLEMENTS/INJECTS/DECLARES/OVERRIDES/CALLS) reference only already-loaded +Symbol nodes. -Column-order constants (define next to the `_SCHEMA_*` strings, ~`:2812`), each -matching its `_SCHEMA_*` column order exactly — for REL tables the first two -entries are the endpoint ids: - -- `NODE_COLUMNS` = the `Symbol` property columns in `_SCHEMA_NODE` order. -- `ROUTE_COLUMNS`, `CLIENT_COLUMNS`, `PRODUCER_COLUMNS`, `GRAPHMETA_COLUMNS`, - `UNRESOLVED_CALL_SITE_COLUMNS` — same, from their `_SCHEMA_*`. -- For each REL table: `_REL_*_COLUMNS` = `["FROM", "TO", ]` - for `EXTENDS/IMPLEMENTS/INJECTS/DECLARES/OVERRIDES/CALLS/UNRESOLVED_AT/EXPOSES/ - DECLARES_CLIENT/DECLARES_PRODUCER/HTTP_CALLS/ASYNC_CALLS`. - -> **Step-1 spike (mandatory, first commit of PR-P1):** confirm (a) the exact -> REL `COPY FROM` column naming — build a 2-node + 1-edge toy, `COPY CALLS FROM -> $rows` with columns `["FROM","TO","source_file","resolved",...]`, assert the -> edge lands with correct endpoints; (b) `pa.Table.from_pylist` type inference -> for the `modifiers`/`annotations`/`capabilities` LIST columns. Record the -> working incantation in the helper docstring. If inference fails on empty -> lists, build an explicit `pa.schema` map keyed by table. - -#### 1b. Convert `_write_nodes` (full path) to bulk -`_write_nodes` (`:3096`) currently calls `_write_nodes_impl(..., symbol_query=_CREATE_SYMBOL)` -which loops `conn.execute(_CREATE_SYMBOL, _node_row(...))` per node. Replace the -full-path body: iterate packages / files / types / members exactly as -`_write_nodes_impl` does, but append each `_node_row(...)` to a list, then call -`_bulk_copy(conn, "Symbol", NODE_COLUMNS, node_rows)`. Keep the role/capability -resolution (`resolve_role_and_capabilities`) and `tables.type_role_by_node_id` -population identical (those are pure in-memory steps before staging). - -`_write_nodes_impl` + `_MERGE_SYMBOL` stay — they are the **incremental** path -(`_write_nodes_merge`, `:817`) used until PR-P2. `_CREATE_SYMBOL` becomes dead -(only `_write_nodes` used it) → delete it. - -#### 1c. Convert `_write_edges` to bulk -`_write_edges` (`:3244`) loops per edge, calling `conn.execute(_CREATE_EXT, -{...})` etc. with two dedup sets (`seen_calls` `:3282-3288`, `seen_ucs` -`:3317-3321`) and a `callee_declaring_role` lookup (`_callee_declaring_role_at_write`, -`:1647`/`:3302`). Restructure to accumulate **per-edge-type row lists**, applying -the *same* dedup and *same* `callee_declaring_role` materialization, then -`_bulk_copy` each REL table once: - -```python -extends_rows: list[dict] = [] -... -for ...: - extends_rows.append({"FROM": src_id, "TO": dst_id, "source_file": ..., "dst_name": ..., "dst_fqn": ..., "resolved": ...}) -... -calls_rows.append(...) # only when key not in seen_calls; include "callee_declaring_role" -... -_bulk_copy(conn, "EXTENDS", _REL_EXTENDS_COLUMNS, extends_rows) -_bulk_copy(conn, "CALLS", _REL_CALLS_COLUMNS, calls_rows) # CALLS_COLUMNS ends with callee_declaring_role -... -``` - -The dedup sets and `_callee_declaring_role_at_write` calls are computed -**before** appending (i.e. at staging), exactly preserving current semantics. `_CREATE_EXT`/`_CREATE_IMPL`/`_CREATE_INJ`/`_CREATE_DECL`/`_CREATE_OVERRIDES`/ -`_CREATE_CALL`/`_CREATE_UNRESOLVED`/`_CREATE_UNRESOLVED_AT` become dead → delete. - -`_populate_declares_rows` (`:3189`) and `_populate_overrides_rows` (`:3206`) are -pure in-memory population — unchanged; they feed the DECLARES/OVERRIDES row lists. - -#### 1d. Convert `_write_routes_and_exposes` (and Client/Producer/HTTP_CALLS/ASYNC_CALLS) to bulk -Same pattern (`~`:3349-3408`): stage `Route`/`Client`/`Producer` node rows and -`EXPOSES`/`DECLARES_CLIENT`/`DECLARES_PRODUCER`/`HTTP_CALLS`/`ASYNC_CALLS` edge -rows (with FROM/TO ids), then `_bulk_copy` each. The pass5/6 `MERGE (r:Route)` -dedup (`:3819-3821`) is **incremental-only** and untouched in PR-P1; in the full -path routes are emitted once into a fresh table, so plain `COPY` insert is -correct. Delete the now-dead `_CREATE_ROUTE`/`_CREATE_EXPOSES`/`_CREATE_CLIENT`/ -`_CREATE_DECLARES_CLIENT`/`_CREATE_PRODUCER`/`_CREATE_DECLARES_PRODUCER`/ -`_CREATE_HTTP_CALL`/`_CREATE_ASYNC_CALL` strings. - -#### 1e. Convert `GraphMeta` write to bulk -The `MERGE (m:GraphMeta {key: $k})` loop (`:3472-3473`) becomes a single-row -`_bulk_copy(conn, "GraphMeta", GRAPHMETA_COLUMNS, [meta_row])` (table is freshly -created and empty in a full rebuild). Keep the row contents (ontology_version, -built_at, source_root, parse_errors, counts_json) identical. - -#### 1f. `write_ladybug` load order (node-before-edge) -`write_ladybug` (`:3893`) already calls `_write_nodes` → `_write_edges` → -`_write_routes_and_exposes` → GraphMeta. Preserve this order: all node tables -(`Symbol`, `Route`, `Client`, `Producer`, `UnresolvedCallSite`, `GraphMeta`) are -bulk-loaded before any REL table, so endpoint rows exist when kuzu validates REL -`COPY FROM`. Add an inline comment stating the ordering invariant. - -### 2. `tests/test_ast_graph_build.py` — equivalence harness + regression - -Add an equivalence harness and a committed baseline. The existing 26 tests in -this file (e.g. `test_schema_has_all_expected_tables`, -`test_graph_meta_present_and_versioned`, `test_each_edge_type_populated`, +`_CREATE_CALL` (module constants) become dead → delete. `_CREATE_UNRESOLVED` and +`_CREATE_UNRESOLVED_AT` are **locals** defined inside `_write_edges` (`:3307`/`:3313`), +removed when the function is rewritten to bulk. `_populate_declares_rows` +(`:3189`) / `_populate_overrides_rows` (`:3206`) are pure in-memory population — +unchanged. + +### 2. `tests/test_ast_graph_build.py` — equivalence harness + baseline +The existing 26 tests are the regression net (e.g. +`test_schema_has_all_expected_tables`, `test_each_edge_type_populated`, `test_pass3_callee_declaring_role_bank_annotated_types`, -`test_pass3_unresolved_call_site_emitted`, `test_pass3_known_external_calls_preserved`) -must continue to pass unchanged — they ARE the regression net. - -Add a committed baseline artifact `tests/fixtures/graph_baseline_bank_chat.json` -(node count, per-type edge counts, `GraphMeta` counters, and N=3 sampled edge -property rows per type incl. `source_file` and CALLS `callee_declaring_role`). -Generated once during PR-P1 from the **last per-row build** before its removal, -committed, and regenerated only when `ontology_version` changes. +`test_pass3_unresolved_call_site_emitted`, +`test_pass3_known_external_calls_preserved`). Add a committed baseline +`tests/fixtures/graph_baseline_bank_chat.json` (node count, per-type edge counts, +`GraphMeta` counters, and N=3 sampled edge property rows per type incl. +`source_file` and CALLS `callee_declaring_role`). **It is an equivalence anchor, +not a production invariant** — regenerated from the last per-row build before +removal, and regenerated only when `ontology_version` changes (it does not in +this plan). Asserting invariants here is acceptable because PR-P1 is a +behavior-preserving write-mechanism swap. ### 3. `scripts/bench_init_graph_write.py` (new, dev-only) — benchmark -A small script that times `init` (or `build_ast_graph.py --source-root -`) on a medium corpus and prints the graph-write phase delta vs the -pre-PR baseline. Not shipped in the package; lives under `scripts/`. Documents -the measured speedup in the PR description. +Times `init`/`build_ast_graph.py` on a medium corpus; prints the graph-write +phase delta. Not packaged; documents the measured speedup in the PR description. ## Tests for PR-P1 -1. `test_bulk_write_graph_matches_per_row_baseline` — build `tests/bank-chat-system` +1. `test_bulk_write_edges_match_per_row_baseline` — build `tests/bank-chat-system` via the bulk path, assert node count, per-type edge counts, `GraphMeta` - counters, and the sampled edge property rows equal `tests/fixtures/graph_baseline_bank_chat.json`. + counters, and sampled edge rows equal `graph_baseline_bank_chat.json`. 2. `test_bulk_write_is_deterministic_double_build` — build bank-chat twice to two - DB paths via the bulk path, assert identical node count, per-type edge counts, - `GraphMeta` counters, and a query battery (`MATCH (s:Symbol) RETURN ...`, - `neighbors`-shaped reads). Models on existing - `test_29_determinism_pass4_route_ids` / `test_overrides_edge_set_deterministic_double_build`. -3. `test_bulk_write_preserves_calls_dedup_and_callee_declaring_role` — assert - CALLS rows are deduped by `(src,dst,argc,line)` and carry the correct - `callee_declaring_role` (reuse the `@Service` callee assertion from - `test_pass3_callee_declaring_role_bank_annotated_types` against a bulk build). + DBs via the bulk path, assert identical counts + query battery. Models on + `tests/test_brownfield_routes.py::test_29_determinism_pass4_route_ids` and + `tests/test_mcp_v2_compose.py::test_overrides_edge_set_deterministic_double_build`. +3. `test_bulk_write_preserves_calls_dedup_and_callee_declaring_role` — CALLS rows + deduped by `(src,dst,argc,line)`; carry correct `callee_declaring_role` + (reuse the `@Service` callee assertion against a bulk build). 4. `test_bulk_write_empty_rel_table_is_noop` — a corpus with no `EXTENDS` edges - must not error (`_bulk_copy` no-ops on empty rows) and the table is empty. + must not error (`_bulk_copy` no-ops on empty rows). -**Must-still-pass (regression, do not loosen):** the full `tests/test_ast_graph_build.py`, -`tests/test_bank_chat_brownfield_integration.py`, `tests/test_call_edges_e2e.py`, -and `tests/test_incremental_graph.py` suites (incremental path untouched). +**Must-still-pass (regression — `_write_edges` is shared, so both paths):** full +`tests/test_ast_graph_build.py`, `tests/test_incremental_graph.py` (28 tests), +`tests/test_bank_chat_brownfield_integration.py`, `tests/test_call_edges_e2e.py`. ## Definition of done (PR-P1) -- [ ] `_bulk_copy` helper added; step-1 spike result recorded in its docstring. -- [ ] `write_ladybug` full path writes all node tables then all REL tables via `_bulk_copy`; no per-row `CREATE` remains in the full path. -- [ ] Dead `_CREATE_*` strings + `_CREATE_SYMBOL` deleted. -- [ ] `test_bulk_write_graph_matches_per_row_baseline`, +- [ ] `_bulk_copy` helper added; step-1 spike result in its docstring. +- [ ] `_write_edges` stages per-type rows (CALLS dedup + `callee_declaring_role` at staging) and bulk-loads UnresolvedCallSite before UNRESOLVED_AT. +- [ ] `_CREATE_EXT/IMPL/INJ/DECL/OVERRIDES/CALL` deleted; local `_CREATE_UNRESOLVED/_UNRESOLVED_AT` gone with the rewrite. +- [ ] `test_bulk_write_edges_match_per_row_baseline`, `test_bulk_write_is_deterministic_double_build`, `test_bulk_write_preserves_calls_dedup_and_callee_declaring_role`, - `test_bulk_write_empty_rel_table_is_noop` added and pass. -- [ ] Full existing graph/edge/brownfield/incremental test suites pass unchanged. + `test_bulk_write_empty_rel_table_is_noop` pass. +- [ ] Full `test_ast_graph_build.py` + `test_incremental_graph.py` pass unchanged. - [ ] `.venv/bin/ruff check .` clean. -- [ ] Benchmark numbers (before/after graph-write phase) pasted in the PR description. -- [ ] PR title: `perf(graph): bulk COPY FROM for the full rebuild path (PR-P1)`. +- [ ] Sentinel greps pass (see AGENT-PROMPTS); benchmark numbers in PR description. +- [ ] PR title: `perf(graph): bulk COPY FROM for _write_edges (PR-P1)`. ## Implementation step list | # | Step | File(s) | Done when | | --- | --- | --- | --- | -| 1 | Spike: confirm REL `COPY FROM` column order + `from_pylist` LIST typing | `build_ast_graph.py` (throwaway) | toy 2-node+1-edge `COPY CALLS FROM $rows` lands with correct endpoints; result in `_bulk_copy` docstring | -| 2 | Add `_bulk_copy` + column-order constants | `build_ast_graph.py` | helper + `_REL_*_COLUMNS`/`*_COLUMNS` defined | -| 3 | Convert `_write_nodes` (full) to bulk; delete `_CREATE_SYMBOL` | `build_ast_graph.py` | node tables bulk-loaded; Symbol count matches baseline | -| 4 | Convert `_write_edges` to per-type staging + bulk; delete dead `_CREATE_*` edge strings | `build_ast_graph.py` | edge counts match baseline; dedup + callee_declaring_role preserved | -| 5 | Convert `_write_routes_and_exposes` (+ Client/Producer/HTTP_CALLS/ASYNC_CALLS) to bulk | `build_ast_graph.py` | route/client/producer/call-edge counts match baseline | -| 6 | Convert GraphMeta write to bulk | `build_ast_graph.py` | `test_graph_meta_present_and_versioned` passes | -| 7 | Generate + commit baseline fixture | `tests/fixtures/graph_baseline_bank_chat.json` | produced from last per-row build before removal | -| 8 | Add 4 equivalence/determinism tests | `tests/test_ast_graph_build.py` | all pass | -| 9 | Run full regression + ruff; capture benchmark | repo | suites green; numbers in PR description | +| 1 | Spike: REL `COPY FROM` column order + `from_pylist` typing | `build_ast_graph.py` (throwaway) | toy CALLS edge lands with correct endpoints; result in `_bulk_copy` docstring | +| 2 | Add `_bulk_copy` + `_REL_*_COLUMNS`/column constants | `build_ast_graph.py` | helper + constants defined | +| 3 | Convert `_write_edges` to per-type staging + bulk; load UnresolvedCallSite before UNRESOLVED_AT | `build_ast_graph.py` | edge counts match baseline; dedup + callee_declaring_role preserved | +| 4 | Delete dead `_CREATE_EXT/IMPL/INJ/DECL/OVERRIDES/CALL` + locals | `build_ast_graph.py` | sentinel greps pass | +| 5 | Generate + commit baseline | `tests/fixtures/graph_baseline_bank_chat.json` | from last per-row `_write_edges` build | +| 6 | Add 4 tests | `tests/test_ast_graph_build.py` | all pass | +| 7 | Full regression (ast_graph_build + incremental) + ruff; benchmark | repo | green; numbers in PR description | --- -# PR-P2 — Bulk write for the incremental path +# PR-P2 — Bulk write for nodes + routes/clients/producers/calls -**Goal:** apply the PR-P1 bulk primitive to the incremental rebuild path so -`increment` on a large change-set is also fast. **Depends on PR-P1** (`_bulk_copy` -+ column constants + the staging pattern). +**Goal:** convert the remaining graph writes. **Depends on PR-P1's `_bulk_copy`.** ## File-by-file changes -### 1. `build_ast_graph.py` — incremental path bulk conversion -- `_write_nodes_merge` (`:817`, uses `_MERGE_SYMBOL`): the incremental path is - delete-then-insert (`_delete_file_scope`, `:673`, removes changed/dependent - files' nodes+edges first), so re-emit is into a cleaned scope → convert the - per-row `_MERGE_SYMBOL` loop to `_bulk_copy(conn, "Symbol", NODE_COLUMNS, rows)`. - `_MERGE_SYMBOL` becomes dead → delete; `_write_nodes_impl` is removed if no - caller remains (it does not, after P1+P2). -- Incremental edge re-emit: apply the same per-type staging + `_bulk_copy` as - PR-P1's `_write_edges`, scoped to the re-emitted files. -- **Preserve the pass5/6 `MERGE (r:Route)` dedup** (`:3819-3821`): routes written - during the scoped step must still MERGE (not duplicate) against routes from the - global step. Keep those specific `MERGE` statements; only convert the - non-Route, non-dedup writes to bulk. Add a comment that this MERGE is - intentionally retained. -- `incremental_rebuild` (`:3535`): no algorithmic change; it calls into the - converted write helpers. +### 1. `build_ast_graph.py` + +#### 1a. Convert `_write_nodes_impl` (`:3029`, shared workhorse) +It is called by `_write_nodes` (`:3103`, full, `_CREATE_SYMBOL`) and +`_write_nodes_merge` (`:825`, incremental, `_MERGE_SYMBOL`). Convert its body to +stage all Symbol rows (packages/files/types/members, with the existing +`resolve_role_and_capabilities` + `type_role_by_node_id` population done before +staging) then `_bulk_copy(conn, "Symbol", NODE_COLUMNS, rows)`. Both wrappers now +share one bulk path; `_CREATE_SYMBOL` and `_MERGE_SYMBOL` become dead → delete +both. (`_write_nodes_impl`'s per-row loop is gone; the two wrappers may collapse +to one, but keep both names if they differ in non-write setup — minimize churn.) +> Note: this removes the ~40-line node-row-building duplication that a +> full-path-only conversion would have created — converting the shared workhorse +> avoids it entirely. + +#### 1b. Convert `_write_routes_and_exposes` (`:3338`, shared) to bulk +Today it loops over `routes_rows`/`exposes_rows`/`client_rows`/`declares_client_rows`/ +`producer_rows`/`declares_producer_rows`/`http_call_rows`/`async_call_rows`, calling +`_CREATE_ROUTE`/`_CREATE_EXPOSES`/`_CREATE_CLIENT`/`_CREATE_DECLARES_CLIENT`/ +`_CREATE_PRODUCER`/`_CREATE_DECLARES_PRODUCER`/`_CREATE_HTTP_CALL`/`_CREATE_ASYNC_CALL` +with the existing `_file_by_node_id`/`_file_by_client_id`/`_file_by_producer_id` +source_file resolution. Stage each table's rows (applying that resolution) and +`_bulk_copy`. **Load order within the helper:** bulk-load Route/Client/Producer +NODE rows before the EXPOSES/DECLARES_CLIENT/DECLARES_PRODUCER/HTTP_CALLS/ +ASYNC_CALLS edges (those edges reference those nodes + already-loaded Symbol). +`_CREATE_ROUTE`/`_CREATE_EXPOSES` become dead → delete. + +#### 1c. Convert `_write_clients_producers_and_calls` (`:3810`, incremental-only) to bulk +This is the global pass5/6 step (sole caller `incremental_rebuild:3716`). It +writes Route (via `MERGE (r:Route {id:$id}) …` to dedup against the scoped step, +`:3819-3828`), Client/Producer nodes (`_CREATE_CLIENT`/`_CREATE_PRODUCER`), and +DECLARES_CLIENT/DECLARES_PRODUCER/HTTP_CALLS/ASYNC_CALLS edges. Convert the +Client/Producer/edge writes to per-type staging + `_bulk_copy` (same +`member_by_id`/`client_by_id`/`producer_by_id` source_file resolution). +**Retain the `MERGE (r:Route …)` dedup verbatim** — routes written during the +scoped step must not duplicate; add a one-line comment that it is intentionally +kept. (Route upsert stays MERGE; only the Client/Producer/edge writes bulk.) +Now that BOTH `_write_routes_and_exposes` and `_write_clients_producers_and_calls` +are converted, the 6 shared constants — `_CREATE_CLIENT`/`_CREATE_PRODUCER`/ +`_CREATE_DECLARES_CLIENT`/`_CREATE_DECLARES_PRODUCER`/`_CREATE_HTTP_CALL`/ +`_CREATE_ASYNC_CALL` — are dead → delete all six. + +#### 1d. `_write_meta` — UNCHANGED +Leave `_write_meta` (`:3421`) on its `MERGE (m:GraphMeta …)`. Do not touch. ### 2. `tests/test_incremental_graph.py` — incremental equivalence -The existing 28 tests (e.g. `test_incremental_single_file_change`, -`test_incremental_new_file`, `test_incremental_deleted_file`, -`test_incremental_phantom_nodes_preserved`, `test_incremental_dependent_expansion`, -`test_incremental_expansion_cap_fallback`, `test_incremental_crash_marker_*`, -`test_incremental_no_changes_is_noop`, `test_delete_file_scope_preserves_dependent_nodes`) -must pass unchanged. +Add the new tests as **methods of `TestIncrementalOrchestrator`** (same class as +`test_incremental_single_file_change`, `:230`-ish). The existing 28 tests must +pass unchanged. ## Tests for PR-P2 -1. `test_incremental_bulk_write_equivalent_to_full_rebuild` — make a single-file - change, run `increment` (bulk incremental), then build the same state via a - full rebuild (bulk full), assert identical node count, per-type edge counts, - and `GraphMeta` counters. -2. `test_incremental_route_merge_dedup_preserved` — a corpus where pass5/6 would - re-emit an existing route; assert no duplicate `Route` rows after `increment` - (the retained `MERGE (r:Route)` still dedups). +1. `test_incremental_bulk_write_equivalent_to_full_rebuild` (in + `TestIncrementalOrchestrator`) — single-file change → `increment` (bulk) → + full rebuild of that state (bulk) → assert identical node count, per-type edge + counts, `GraphMeta` counters. +2. `test_incremental_route_merge_dedup_preserved` (in + `TestIncrementalOrchestrator`) — a corpus where pass5/6 re-emits an existing + route → no duplicate `Route` rows after `increment` (the retained MERGE + dedups). -**Must-still-pass:** full `tests/test_incremental_graph.py`. +**Must-still-pass:** full `tests/test_incremental_graph.py`, `test_ast_graph_build.py`. ## Definition of done (PR-P2) -- [ ] Incremental node/edge re-emit uses `_bulk_copy`; `_MERGE_SYMBOL` deleted. -- [ ] pass5/6 `MERGE (r:Route)` dedup retained and commented. -- [ ] `test_incremental_bulk_write_equivalent_to_full_rebuild`, - `test_incremental_route_merge_dedup_preserved` added and pass. -- [ ] Full `tests/test_incremental_graph.py` passes unchanged. -- [ ] `.venv/bin/ruff check .` clean. -- [ ] PR title: `perf(graph): bulk COPY FROM for the incremental path (PR-P2)`. +- [ ] `_write_nodes_impl` bulk; `_CREATE_SYMBOL` + `_MERGE_SYMBOL` deleted. +- [ ] `_write_routes_and_exposes` bulk; `_CREATE_ROUTE`/`_CREATE_EXPOSES` deleted. +- [ ] `_write_clients_producers_and_calls` Client/Producer/edge writes bulk; `MERGE (r:Route)` retained + commented; 6 shared `_CREATE_*` deleted. +- [ ] `_write_meta` untouched. +- [ ] Both new tests pass (in `TestIncrementalOrchestrator`); full incremental + ast_graph_build suites green. +- [ ] Sentinel greps pass; `.venv/bin/ruff check .` clean. +- [ ] PR title: `perf(graph): bulk COPY FROM for nodes, routes, clients/producers (PR-P2)`. ## Implementation step list | # | Step | File(s) | Done when | | --- | --- | --- | --- | -| 1 | Convert `_write_nodes_merge` to bulk; delete `_MERGE_SYMBOL` | `build_ast_graph.py` | incremental node re-emit via `_bulk_copy` | -| 2 | Convert incremental edge re-emit to per-type staging + bulk | `build_ast_graph.py` | incremental edge counts match full rebuild | -| 3 | Retain + comment the pass5/6 `MERGE (r:Route)` dedup | `build_ast_graph.py` | no duplicate Route rows after increment | -| 4 | Add incremental-equivalence + route-dedup tests | `tests/test_incremental_graph.py` | both pass; full suite green | +| 1 | Convert `_write_nodes_impl` to bulk; delete `_CREATE_SYMBOL`+`_MERGE_SYMBOL` | `build_ast_graph.py` | node counts match baseline; both wrappers share bulk | +| 2 | Convert `_write_routes_and_exposes` to bulk; load Route/Client/Producer before edges; delete `_CREATE_ROUTE`/`_CREATE_EXPOSES` | `build_ast_graph.py` | route/client/producer/call counts match | +| 3 | Convert `_write_clients_producers_and_calls` Client/Producer/edges to bulk; retain Route MERGE; delete 6 shared `_CREATE_*` | `build_ast_graph.py` | no duplicate routes; sentinel greps pass | +| 4 | Add 2 tests to `TestIncrementalOrchestrator` | `tests/test_incremental_graph.py` | both pass; full suite green | | 5 | ruff + full regression | repo | clean + green | --- # PR-P3 — Cached `LayeredIgnore` (+ `is_ignored` memo) as a `ContextKey` -**Goal:** remove the ~25s paid per-flow-run from re-constructing -`LayeredIgnore(project_root)` per file and re-merging `_mega(spec)` per file. -**Independent** of PR-P1/P2. +**Goal:** remove the ~25s from re-constructing `LayeredIgnore(project_root)` per +file and re-merging `_mega` per file. **Independent** of PR-P1/P2. ## File-by-file changes -### 1. `java_index_flow_lancedb.py` — `ContextKey` for the ignore object +### 1. `java_index_flow_lancedb.py` - Define `IGNORE = coco.ContextKey[LayeredIgnore]("java_lance_layered_ignore")` - alongside the existing `PROJECT_ROOT`/`EMBEDDER`/`LANCE_DB` definitions - (`:60-72`), using the **same** `_ck_params` (`detect_change` vs `tracked`) - detection block so it works across cocoindex versions. -- In `coco_lifespan` (where `builder.provide(PROJECT_ROOT, root)` / - `builder.provide(EMBEDDER, embedder)` are called, `~`:287-306`), add - `builder.provide(IGNORE, LayeredIgnore(root))` — built **once** per flow run. -- In `process_java_file` (`:344`), `process_sql_file` (`:416`), - `process_yaml_file` (`:464`): replace - `if LayeredIgnore(project_root).is_ignored((project_root / file.file_path.path).resolve()):` - (`:351`/`:423`/`:471`) with - `ignore = coco.use_context(IGNORE)` then - `if ignore.is_ignored((project_root / file.file_path.path).resolve()):`. - `project_root` stays (still needed for `_parse_and_enrich_java` and path - resolution). - -### 2. `path_filtering.py` — memoize `_mega` by directory -- Add an instance cache `self._mega_cache: dict[str, tuple[list[str], GitIgnoreSpec, list]] = {}` + alongside `PROJECT_ROOT`/`EMBEDDER`/`LANCE_DB` (`:60-72`), reusing the SAME + `_ck_params` (`detect_change` vs `tracked`) detection block. +- In `coco_lifespan` (`:287-306`), add `builder.provide(IGNORE, LayeredIgnore(root))` + — built **once** per flow run. +- In `process_java_file` (`:345`), `process_sql_file` (`:417`), `process_yaml_file` + (`:465`): add `ignore = coco.use_context(IGNORE)` and replace + `LayeredIgnore(project_root).is_ignored((project_root / file.file_path.path).resolve())` + (`:351`/`:423`/`:471`) with `ignore.is_ignored((project_root / file.file_path.path).resolve())`. + Keep `project_root` (still used for path resolution + `_parse_and_enrich_java`). +- **Leave `:177` and `:569` alone** — they call `cocoindex_excluded_patterns()` + once per run (the `_approximate_vectors_total` helper and the app_main + pre-walk), not per file. + +### 2. `path_filtering.py` +- Add `self._mega_cache: dict[str, tuple[list[str], GitIgnoreSpec, list]] = {}` in `LayeredIgnore.__init__`. -- In `LayeredIgnore._mega` (`:334`): key on - `dir_rel = Path(rel_project).parent.as_posix()`. If present, return the cached - `(mega, spec, meta)`; else compute via `_mega_build_for_rel` + compile, store, - return. Correctness rests on `_mega_build_for_rel` reading only - `dir_parts = parts[:-1]` (verified: it never reads the filename). -- `is_ignored` (`:345`) and `diagnose_dict` (`:377`) keep calling `_mega(rel)` - unchanged — they transparently benefit from the cache. -- Add a module/env knob `LAYERED_IGNORE_MEGA_CACHE` only if a disable path is - needed for debugging — **default enabled**, no new public surface otherwise. +- In `_mega` (`:334`): key on `Path(rel_project).parent.as_posix()`; return cached + `(mega, spec, meta)` if present, else compute via `_mega_build_for_rel` + + `GitIgnoreSpec.from_lines`, store, return. Correctness rests on + `_mega_build_for_rel` reading only `dir_parts` (`:226-227`). +- `is_ignored` (`:345`) and `diagnose_dict` (`:377`) call `_mega` unchanged and + benefit transparently (both consume the full `(mega, spec, meta)` tuple). ## Tests for PR-P3 -1. `test_is_ignored_mega_caches_by_directory` — build a `LayeredIgnore` over a - fixture with nested ignore files; call `is_ignored` for several files in the - same directory; assert `_mega` is computed once (spy on `_mega_build_for_rel` - or count `GitIgnoreSpec.from_lines` calls) and that decisions match the - uncached path. -2. `test_layered_ignore_memo_preserves_decisions` — for a corpus of paths - (including nested `.java-codebase-rag/ignore` + `.gitignore` with negations), - assert `LayeredIgnore(...).is_ignored(p)` is identical with and without the - memo cache (correctness invariant). -3. `test_layered_ignore_provided_once_per_flow` (HEAVY, - `JAVA_CODEBASE_RAG_RUN_HEAVY=1`) — run the real cocoindex flow over a small - corpus; assert the `LayeredIgnore` provided via `IGNORE` is a single instance - (identity check), not reconstructed per file. +1. `test_is_ignored_mega_caches_by_directory` — in `tests/test_path_filtering.py`; + assert `_mega` is computed once per directory (spy on `_mega_build_for_rel`) + and decisions match the uncached path. +2. `test_layered_ignore_memo_preserves_decisions` — in `tests/test_path_filtering.py`; + for a corpus with nested ignore + gitignore negations, assert `is_ignored` is + identical with and without the cache. +3. `test_layered_ignore_provided_once_per_flow` — in `tests/test_lancedb_e2e.py` + (HEAVY, `JAVA_CODEBASE_RAG_RUN_HEAVY=1`); run the real flow, assert a single + `LayeredIgnore` instance (identity check), not per-file. **Must-still-pass:** `tests/test_lancedb_e2e.py::test_lancedb_ignore_file_reduces_indexed_java_files`, -any `tests/test_path_filtering*` / ignore tests, and the heavy -`test_flow_emits_vectors_progress_per_file`. +`tests/test_path_filtering.py`, and the heavy +`tests/test_vectors_progress.py::test_flow_emits_vectors_progress_per_file`. ## Definition of done (PR-P3) -- [ ] `IGNORE` `ContextKey` defined (version-detected) + provided once in `coco_lifespan`. -- [ ] The three `process_*_file` functions consume it; no per-file `LayeredIgnore(project_root)`. -- [ ] `_mega` memoized by directory; `is_ignored`/`diagnose_dict` unchanged in result. -- [ ] `test_is_ignored_mega_caches_by_directory`, - `test_layered_ignore_memo_preserves_decisions`, - `test_layered_ignore_provided_once_per_flow` added and pass. -- [ ] Existing ignore + vectors-progress tests pass unchanged. +- [ ] `IGNORE` ContextKey (version-detected) + provided once in `coco_lifespan`. +- [ ] The three `process_*_file` consume it; sites `:177`/`:569` untouched. +- [ ] `_mega` memoized by directory; `is_ignored`/`diagnose_dict` results unchanged. +- [ ] The 3 named tests pass; existing ignore/vectors-progress tests pass unchanged. - [ ] `.venv/bin/ruff check .` clean. - [ ] PR title: `perf(vectors): lifespan-cached LayeredIgnore + is_ignored memo (PR-P3)`. @@ -385,10 +358,10 @@ any `tests/test_path_filtering*` / ignore tests, and the heavy | # | Step | File(s) | Done when | | --- | --- | --- | --- | -| 1 | Add `IGNORE` ContextKey (version-detected) + provide in lifespan | `java_index_flow_lancedb.py` | `IGNORE` resolvable in process_*_file | -| 2 | Switch the three `process_*_file` to `coco.use_context(IGNORE)` | `java_index_flow_lancedb.py` | no per-file construction | +| 1 | Add `IGNORE` ContextKey (version-detected) + provide in lifespan | `java_index_flow_lancedb.py` | resolvable in process_*_file | +| 2 | Switch the three `process_*_file` to `coco.use_context(IGNORE)` | `java_index_flow_lancedb.py` | only `:351`/`:423`/`:471` changed; `:177`/`:569` untouched | | 3 | Add `_mega` dirname cache | `path_filtering.py` | repeated same-dir calls hit cache | -| 4 | Add memo + correctness + once-per-flow tests | `tests/` | all pass | +| 4 | Add memo + correctness tests in `test_path_filtering.py`; once-per-flow in `test_lancedb_e2e.py` | `tests/` | all pass | | 5 | ruff + regression (incl. heavy ignore/vectors tests) | repo | clean + green | --- @@ -397,32 +370,33 @@ any `tests/test_path_filtering*` / ignore tests, and the heavy | # | Risk | Severity | Mitigation | | --- | --- | --- | --- | -| 1 | REL-table `COPY FROM` column ordering/naming differs from assumption | High | PR-P1 step-1 spike locks the exact incantation before any conversion; recorded in `_bulk_copy` docstring. | -| 2 | LIST columns (`modifiers`/`annotations`/`capabilities`) mis-type under `from_pylist` | Medium | Spike asserts LIST inference; explicit `pa.schema` fallback documented. Equivalence test catches any row diff. | -| 3 | CALLS dedup / `callee_declaring_role` drift when moving to staging | High | `test_bulk_write_preserves_calls_dedup_and_callee_declaring_role` + the sampled-edge baseline check; semantics computed identically, just earlier. | -| 4 | Bulk write changes observable query results | High | Equivalence + determinism tests + full existing graph/edge/brownfield suites are the gate. | -| 5 | PR-P2 breaks the pass5/6 Route dedup | Medium | `MERGE (r:Route)` retained by name; `test_incremental_route_merge_dedup_preserved` guards it. | -| 6 | PR-P3 `_mega` memo returns stale rules if dirParts logic changes | Low | `_mega_build_for_rel` reads only `dir_parts`; `test_layered_ignore_memo_preserves_decisions` asserts parity with uncached path. | -| 7 | `ContextKey` lifespan differs across cocoindex versions | Low | Reuse the existing `_ck_params` `detect_change`/`tracked` detection block used by `PROJECT_ROOT`. | +| 1 | REL-table `COPY FROM` column order/naming differs from assumption | High | PR-P1 step-1 spike locks it before conversion. | +| 2 | Shared-helper conversion changes incremental output | High | `_write_edges`/`_write_routes_and_exposes`/`_write_nodes_impl` are shared → full `test_incremental_graph.py` (28 tests) is a merge gate for P1 and P2, plus the new incremental-equivalence test. | +| 3 | CALLS dedup / `callee_declaring_role` drift when moved to staging | High | `test_bulk_write_preserves_calls_dedup_and_callee_declaring_role` + sampled-edge baseline. | +| 4 | Deleting a `_CREATE_*` constant still used by the other path | High | PR-P2 deletes the 6 client/producer constants only AFTER both `_write_routes_and_exposes` and `_write_clients_producers_and_calls` are converted (same PR). Sentinel greps enforce. | +| 5 | pass5/6 Route dedup breaks | Medium | `MERGE (r:Route)` retained by name; `test_incremental_route_merge_dedup_preserved` guards it. | +| 6 | PR-P3 `_mega` memo returns stale rules | Low | `_mega_build_for_rel` reads only `dir_parts`; `test_layered_ignore_memo_preserves_decisions` asserts parity. | +| 7 | PR-P3 sentinel over-matches `:177`/`:569` | Medium | Sentinel is `LayeredIgnore\(project_root\)\.is_ignored` (matches only the 3 process sites), NOT the bare constructor. | +| 8 | `ContextKey` lifespan differs across cocoindex versions | Low | Reuse the existing `_ck_params` `detect_change`/`tracked` detection block. | # Out of scope -- MPS embedding default (already auto-selected — see proposal Out of scope). -- ANN vector index (#337) and `watch` mode (#336). +- MPS embedding default (already auto-selected). +- GraphMeta bulk conversion (left on MERGE — shared, one row). +- ANN index (#337) and `watch` mode (#336). - Replacing/restructuring the cocoindex flow; changing embedding model/dim. - Parallelizing graph analysis passes (pass1–pass6). -- Converting the incremental path in PR-P1 (it is PR-P2). - Parquet-file or CSV bulk paths (pyarrow in-memory only). +- Converting `:177`/`:569` (once-per-run setup, not hot paths). # Whole-plan done definition -1. `init`/`reprocess` graph-write phase on the medium corpus drops from ~321s to - tens of seconds (benchmark in PR-P1 description). -2. `increment` on a large change-set uses the bulk path (PR-P2). -3. The vectors phase pays no per-file `LayeredIgnore`/`_mega` cost (PR-P3). -4. No ontology bump (`ontology_version` stays 17); no re-index required; all +1. `init`/`reprocess`/`increment` graph-write phase on the medium corpus drops + from ~321s to tens of seconds (benchmark in PR-P1; completed in PR-P2). +2. The vectors phase pays no per-file `LayeredIgnore`/`_mega` cost (PR-P3). +3. No ontology bump (`ontology_version` stays 17); no re-index required; all existing graph/edge/brownfield/incremental/ignore/vectors tests pass. -5. Proposal moved to `propose/completed/` and this plan to `plans/completed/` +4. Proposal moved to `propose/completed/` and this plan to `plans/completed/` once all three PRs land. # Tracking