feat: incremental graph rebuild (3-PR sequence: G1+G2+G3)#284
Conversation
…ntal graph rebuild (PR-G1) This commit implements PR-G1 of the incremental graph rebuild plan: - Bump ONTOLOGY_VERSION to 17 (requires re-index) - Add source_file STRING column to all 12 edge DDL constants - Update _write_edges() to pass source_file for EXTENDS, IMPLEMENTS, INJECTS, DECLARES, OVERRIDES, CALLS, UNRESOLVED_AT - Update _write_routes_and_exposes() to pass source_file for EXPOSES, DECLARES_CLIENT, DECLARES_PRODUCER, HTTP_CALLS, ASYNC_CALLS - Add FileHashTracker class for detecting file changes (added, changed, removed) - Add 9 tests for FileHashTracker and edge schema validation Scope: PR-G1 (Hash tracker + source_file edge schema) Plan: plans/active/PLAN-INCREMENTAL-GRAPH.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tests that verify EXTENDS edge dependent expansion were missing pass2_edges() calls in their setup, resulting in no EXTENDS edges being written to the graph. Also fixed crash marker not being cleaned up in the _fallback_to_full code path and invalid Kuzu SHOW_TABLES syntax in test_incremental_pass5_6_always_global. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Process edge deletions across all scope files before deleting any nodes. The previous per-file loop could fail when file B has an EXTENDS edge to file A — deleting A's nodes first left dangling edges that Kuzu refused to drop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-G2) Add subprocess wrapper that passes --incremental flag to build_ast_graph.py. Part of incremental graph rebuild implementation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Update increment command to run both CocoIndex catch-up and incremental Kuzu graph update - Add --vectors-only flag to preserve old Lance-only behavior - Update CLI help texts and documentation - Emit JSON output from incremental_rebuild for mode detection - Add/update tests for new increment behavior Increment now updates both Lance and Kuzu by default. The old stale warning is only emitted when --vectors-only is used. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…mental rebuild - Fix _write_clients_producers_and_calls: use correct parameter names ($sid/$cid/$pid/$rid) matching Cypher templates, add missing fields (strategy, method_call, raw_uri, match, direction, raw_topic) - Use dict lookup instead of O(n) list scan for client/producer source_file - Use keyword args for MemberEntry placeholder construction - Delete db+conn before fallback to avoid file lock - Remove redundant import json inside main() - Remove stale duplicate comment in pass1_parse Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix _write_clients_producers_and_calls: use asdict(row) for Client/Producer nodes instead of manually constructed dicts with wrong field names (kind vs client_kind, target vs target_service, missing 10+ fields) - Fix _delete_file_scope: add ALL edge tables to Phase 1 deletion (was missing EXPOSES, DECLARES_CLIENT, DECLARES_PRODUCER, HTTP_CALLS, ASYNC_CALLS — would crash on any Spring codebase with controllers) - Use DETACH DELETE for Route/Client/Producer nodes as safety net - Fix N+1 query in dependent expansion: single IN-query instead of per-file Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR #284 Review: Incremental Graph Rebuild (G1+G2+G3)+2003 / -67 lines across 8 files — architecture is sound, but there's a critical runtime bug that must be fixed before merge. Critical Bug:
|
| Edge type | Missing params |
|---|---|
DECLARES_CLIENT |
strategy |
DECLARES_PRODUCER |
strategy |
HTTP_CALLS |
strategy, method_call, raw_uri, match |
ASYNC_CALLS |
strategy, direction, raw_topic, match |
3. O(n²) lookup — Source file resolution for HTTP_CALLS/ASYNC_CALLS builds a new list and does .index() for every row:
tables.client_rows[[c.id for c in tables.client_rows].index(row.client_id)].filenameShould use a dict lookup like the existing _write_routes_and_exposes does.
Why tests don't catch this: The minimal test fixtures (class A {}, class B extends A {}) have no routes/clients/producers, so these loops are never entered. A test with Spring-annotated classes would surface the crash.
Significant Issues (should fix)
4. Duplicated _load_existing_types / _load_existing_types_filtered — These two functions (and their _members counterparts) are ~50 lines each and differ only by the AND NOT (s.filename IN $exclude_files) clause. Extract common logic into a single function with an optional exclude_files param.
5. Repetitive _find_dependents — Six nearly identical if/elif branches that only differ by the edge label name. Loop over a list of edge type strings instead.
6. _write_nodes_merge duplicates _write_nodes — These ~70-line functions differ only by using _MERGE_SYMBOL vs _CREATE_SYMBOL. Factor to a shared helper that accepts the query template as a parameter.
7. _file_by_node_id built twice independently — Once in _write_edges and once in _write_routes_and_exposes. Compute once and share.
Minor Issues (nice to fix)
8. Duplicate comment in pass1_parse — "Skip files not in scope" appears twice.
9. import json inside main() and _cmd_increment — should be top-level imports.
10. FileHashTracker.save() silently swallows OSError — consider logging a warning so it's discoverable when writes fail repeatedly.
11. _fallback_to_full duplicates hash-init logic from the no-DB branch of incremental_rebuild — extract to a shared helper.
12. AGENTS.md says ontology_version is 15 but is now 17 — worth updating.
Test Coverage Gap
No test with Spring annotations (routes, clients, producers) that exercises the pass 5-6 global path with actual Client/Producer/HTTP_CALLS/ASYNC_CALLS edges. This is why the critical bug above slipped through. Adding one test against the bank-chat-system fixture (which has Spring controllers) would catch it.
Summary
The design — phase-based deletion, single-hop dependent expansion with cap, crash marker, automatic fallback — is well-considered. The critical bug in _write_clients_producers_and_calls (wrong param names + missing params) must be fixed before merge. The duplication issues are worth addressing but not blocking.
- Merge _load_existing_types/_load_existing_types_filtered into single function with optional exclude_files parameter (same for members) - Simplify _find_dependents: loop over edge type strings instead of six identical if/elif branches - Factor _write_nodes and _write_nodes_merge to shared _write_nodes_impl accepting the query template as parameter (~70 lines deduplicated) - Extract _build_file_by_node_id and share between _write_edges and _write_routes_and_exposes (was built twice independently) - Extract _init_hash_tracker helper for duplicated hash-init logic in _fallback_to_full and no-DB branch of incremental_rebuild - Add warning log in FileHashTracker.save() instead of silent OSError - Update AGENTS.md ontology_version references from 15 to 17 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…eview round 2 - Write phantom Route nodes in _write_clients_producers_and_calls using MERGE (pass5 creates phantom routes for cross-service calls that were never persisted to Kuzu, silently dropping HTTP_CALLS/ASYNC_CALLS edges) - Remove redundant _load_existing_members before full pass1_parse in global pass 5-6 step (was creating duplicate stub members alongside full entries) - Use conn.close() + del instead of bare del for Kuzu handle cleanup on fallback (avoids relying on CPython ref-counting for file lock release) - Add FileNotFoundError handling in FileHashTracker.detect_changes for files that vanish between listing and hashing - Remove redundant inline import json in cli.py _cmd_increment - Update stale _delete_file_scope docstring to match implementation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Scope
Implements the full incremental graph rebuild plan from
plans/active/PLAN-INCREMENTAL-GRAPH.mdas a single PR containing three logically distinct layers:PR-G1: Hash tracker +
source_fileedge schemasource_file STRINGcolumn to all 12 edge table DDLs for file-scoped deletionFileHashTrackerclass (SHA-256, atomic save, change detection)ONTOLOGY_VERSIONfrom 16 → 17 (re-index required)PR-G2: Incremental rebuild orchestrator
incremental_rebuild()function with scoped pass 1–4 and global pass 5–6_load_existing_types(),_load_existing_members()for cross-file resolution_find_dependents()for single-hop dependent expansion (cap: 50 files)_delete_file_scope()(all edges before any nodes)_scoped_write()for writing into existing DB without schema drop.graph_increment_in_progressmarker file with fallback to full rebuild--incrementalCLI flag tobuild_ast_graph.pyrun_incremental_graph()wrapper topipeline.pyPR-G3: CLI integration
_cmd_increment()to run incremental graph update after CocoIndex--vectors-onlyflag to preserve old Lance-only behavior_INCREMENT_WARNING_LINES/_emit_increment_kuzu_warning()docs/JAVA-CODEBASE-RAG-CLI.mdFiles changed
ast_java.pyONTOLOGY_VERSION16 → 17build_ast_graph.pysource_file STRINGon all 12 edge DDLs + edge-write queries +FileHashTrackerbuild_ast_graph.pyincremental_rebuild(),_load_existing_*(),_find_dependents(),_delete_file_scope(),_scoped_write(),--incrementalflagjava_codebase_rag/pipeline.pyrun_incremental_graph()wrapperjava_codebase_rag/cli.pyincrementcalls graph update,--vectors-only, removes stale warningtests/test_incremental_graph.pytests/test_java_codebase_rag_cli.pyREADME.mddocs/JAVA-CODEBASE-RAG-CLI.mdincrementcommand docsManual Evidence
Design Notes
_delete_file_scopedeletes ALL edges across all scope files first, then deletes nodes. This prevents Kuzu errors when file A's nodes have incoming edges from file B that haven't been cleaned up yet.source_filesemantics: Origin-side file only (e.g., for CALLS edges, it's the caller's filename). Dependent expansion covers target-side changes.Reindex Required
Existing installations must run
java-codebase-rag reprocessonce after upgrading to add thesource_filecolumn to edge tables. This is a one-time migration triggered by theONTOLOGY_VERSIONbump from 16 to 17.🤖 Generated with Claude Code