feat: incremental graph rebuild (3-PR sequence: G1+G2+G3) by HumanBean17 · Pull Request #284 · HumanBean17/java-codebase-rag

HumanBean17 · 2026-06-07T12:43:43Z

Scope

Implements the full incremental graph rebuild plan from plans/active/PLAN-INCREMENTAL-GRAPH.md as a single PR containing three logically distinct layers:

PR-G1: Hash tracker + `source_file` edge schema

Adds source_file STRING column to all 12 edge table DDLs for file-scoped deletion
Implements FileHashTracker class (SHA-256, atomic save, change detection)
Bumps ONTOLOGY_VERSION from 16 → 17 (re-index required)

PR-G2: Incremental rebuild orchestrator

Adds incremental_rebuild() function with scoped pass 1–4 and global pass 5–6
Implements _load_existing_types(), _load_existing_members() for cross-file resolution
Implements _find_dependents() for single-hop dependent expansion (cap: 50 files)
Implements phase-based _delete_file_scope() (all edges before any nodes)
Implements _scoped_write() for writing into existing DB without schema drop
Crash safety via .graph_increment_in_progress marker file with fallback to full rebuild
Adds --incremental CLI flag to build_ast_graph.py
Adds run_incremental_graph() wrapper to pipeline.py

PR-G3: CLI integration

Updates _cmd_increment() to run incremental graph update after CocoIndex
Adds --vectors-only flag to preserve old Lance-only behavior
Removes stale _INCREMENT_WARNING_LINES / _emit_increment_kuzu_warning()
Updates README CLI cheat sheet and roadmap
Updates docs/JAVA-CODEBASE-RAG-CLI.md

Files changed

File	PR	Changes
`ast_java.py`	G1	`ONTOLOGY_VERSION` 16 → 17
`build_ast_graph.py`	G1	`source_file STRING` on all 12 edge DDLs + edge-write queries + `FileHashTracker`
`build_ast_graph.py`	G2	`incremental_rebuild()`, `_load_existing_*()`, `_find_dependents()`, `_delete_file_scope()`, `_scoped_write()`, `--incremental` flag
`java_codebase_rag/pipeline.py`	G2/G3	`run_incremental_graph()` wrapper
`java_codebase_rag/cli.py`	G3	`increment` calls graph update, `--vectors-only`, removes stale warning
`tests/test_incremental_graph.py`	G1+G2	22 tests (9 G1 + 13 G2)
`tests/test_java_codebase_rag_cli.py`	G3	Updated stale warning test + 5 new CLI tests
`README.md`	G3	CLI cheat sheet + roadmap update
`docs/JAVA-CODEBASE-RAG-CLI.md`	G3	`increment` command docs

Manual Evidence

# All 22 incremental tests pass
.venv/bin/python -m pytest tests/test_incremental_graph.py -v
============================== 22 passed ==============================

# CLI tests pass
.venv/bin/python -m pytest tests/test_java_codebase_rag_cli.py -v
============================== 19 passed ==============================

# Lint checks pass
.venv/bin/ruff check build_ast_graph.py ast_java.py tests/test_incremental_graph.py java_codebase_rag/cli.py java_codebase_rag/pipeline.py README.md docs/JAVA-CODEBASE-RAG-CLI.md tests/test_java_codebase_rag_cli.py
All checks passed!

# Full test suite passes
.venv/bin/python -m pytest tests -v
============ 670 passed, 9 skipped, 0 failed ============

Design Notes

Phase-based deletion: _delete_file_scope deletes ALL edges across all scope files first, then deletes nodes. This prevents Kuzu errors when file A's nodes have incoming edges from file B that haven't been cleaned up yet.
Pass 5–6 always global: Client/producer extraction and cross-service matching iterate all members/routes — cheap in-memory operations that ensure consistency.
source_file semantics: Origin-side file only (e.g., for CALLS edges, it's the caller's filename). Dependent expansion covers target-side changes.

Reindex Required

Existing installations must run java-codebase-rag reprocess once after upgrading to add the source_file column to edge tables. This is a one-time migration triggered by the ONTOLOGY_VERSION bump from 16 to 17.

🤖 Generated with Claude Code

…ntal graph rebuild (PR-G1) This commit implements PR-G1 of the incremental graph rebuild plan: - Bump ONTOLOGY_VERSION to 17 (requires re-index) - Add source_file STRING column to all 12 edge DDL constants - Update _write_edges() to pass source_file for EXTENDS, IMPLEMENTS, INJECTS, DECLARES, OVERRIDES, CALLS, UNRESOLVED_AT - Update _write_routes_and_exposes() to pass source_file for EXPOSES, DECLARES_CLIENT, DECLARES_PRODUCER, HTTP_CALLS, ASYNC_CALLS - Add FileHashTracker class for detecting file changes (added, changed, removed) - Add 9 tests for FileHashTracker and edge schema validation Scope: PR-G1 (Hash tracker + source_file edge schema) Plan: plans/active/PLAN-INCREMENTAL-GRAPH.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Tests that verify EXTENDS edge dependent expansion were missing pass2_edges() calls in their setup, resulting in no EXTENDS edges being written to the graph. Also fixed crash marker not being cleaned up in the _fallback_to_full code path and invalid Kuzu SHOW_TABLES syntax in test_incremental_pass5_6_always_global. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Process edge deletions across all scope files before deleting any nodes. The previous per-file loop could fail when file B has an EXTENDS edge to file A — deleting A's nodes first left dangling edges that Kuzu refused to drop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-G2) Add subprocess wrapper that passes --incremental flag to build_ast_graph.py. Part of incremental graph rebuild implementation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Update increment command to run both CocoIndex catch-up and incremental Kuzu graph update - Add --vectors-only flag to preserve old Lance-only behavior - Update CLI help texts and documentation - Emit JSON output from incremental_rebuild for mode detection - Add/update tests for new increment behavior Increment now updates both Lance and Kuzu by default. The old stale warning is only emitted when --vectors-only is used. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…mental rebuild - Fix _write_clients_producers_and_calls: use correct parameter names ($sid/$cid/$pid/$rid) matching Cypher templates, add missing fields (strategy, method_call, raw_uri, match, direction, raw_topic) - Use dict lookup instead of O(n) list scan for client/producer source_file - Use keyword args for MemberEntry placeholder construction - Delete db+conn before fallback to avoid file lock - Remove redundant import json inside main() - Remove stale duplicate comment in pass1_parse Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Fix _write_clients_producers_and_calls: use asdict(row) for Client/Producer nodes instead of manually constructed dicts with wrong field names (kind vs client_kind, target vs target_service, missing 10+ fields) - Fix _delete_file_scope: add ALL edge tables to Phase 1 deletion (was missing EXPOSES, DECLARES_CLIENT, DECLARES_PRODUCER, HTTP_CALLS, ASYNC_CALLS — would crash on any Spring codebase with controllers) - Use DETACH DELETE for Route/Client/Producer nodes as safety net - Fix N+1 query in dependent expansion: single IN-query instead of per-file Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

HumanBean17 · 2026-06-07T18:25:12Z

PR #284 Review: Incremental Graph Rebuild (G1+G2+G3)

+2003 / -67 lines across 8 files — architecture is sound, but there's a critical runtime bug that must be fixed before merge.

Critical Bug: `_write_clients_producers_and_calls` will crash at runtime

This function has multiple issues that would cause failures on any real codebase with routes/clients/producers:

1. Wrong parameter names — The Cypher templates use $sid/$cid/$pid/$rid for MATCH node lookups, but the function passes $src/$dst:

# _CREATE_DECLARES_CLIENT expects: MATCH (s:Symbol {id: $sid}), (c:Client {id: $cid})
conn.execute(_CREATE_DECLARES_CLIENT, {
    "src": row.symbol_id,   # BUG: should be "sid"
    "dst": row.client_id,   # BUG: should be "cid"
    ...
})

Same mismatch for DECLARES_PRODUCER ($sid/$pid → "src"/"dst"), HTTP_CALLS ($cid/$rid → "src"/"dst"), and ASYNC_CALLS ($pid/$rid → "src"/"dst").

2. Missing parameters — All four edge-writing loops omit required template parameters:

Edge type	Missing params
`DECLARES_CLIENT`	`strategy`
`DECLARES_PRODUCER`	`strategy`
`HTTP_CALLS`	`strategy`, `method_call`, `raw_uri`, `match`
`ASYNC_CALLS`	`strategy`, `direction`, `raw_topic`, `match`

3. O(n²) lookup — Source file resolution for HTTP_CALLS/ASYNC_CALLS builds a new list and does .index() for every row:

tables.client_rows[[c.id for c in tables.client_rows].index(row.client_id)].filename

Should use a dict lookup like the existing _write_routes_and_exposes does.

Why tests don't catch this: The minimal test fixtures (class A {}, class B extends A {}) have no routes/clients/producers, so these loops are never entered. A test with Spring-annotated classes would surface the crash.

Significant Issues (should fix)

4. Duplicated _load_existing_types / _load_existing_types_filtered — These two functions (and their _members counterparts) are ~50 lines each and differ only by the AND NOT (s.filename IN $exclude_files) clause. Extract common logic into a single function with an optional exclude_files param.

5. Repetitive _find_dependents — Six nearly identical if/elif branches that only differ by the edge label name. Loop over a list of edge type strings instead.

6. _write_nodes_merge duplicates _write_nodes — These ~70-line functions differ only by using _MERGE_SYMBOL vs _CREATE_SYMBOL. Factor to a shared helper that accepts the query template as a parameter.

7. _file_by_node_id built twice independently — Once in _write_edges and once in _write_routes_and_exposes. Compute once and share.

Minor Issues (nice to fix)

8. Duplicate comment in pass1_parse — "Skip files not in scope" appears twice.

9. import json inside main() and _cmd_increment — should be top-level imports.

10. FileHashTracker.save() silently swallows OSError — consider logging a warning so it's discoverable when writes fail repeatedly.

11. _fallback_to_full duplicates hash-init logic from the no-DB branch of incremental_rebuild — extract to a shared helper.

12. AGENTS.md says ontology_version is 15 but is now 17 — worth updating.

Test Coverage Gap

No test with Spring annotations (routes, clients, producers) that exercises the pass 5-6 global path with actual Client/Producer/HTTP_CALLS/ASYNC_CALLS edges. This is why the critical bug above slipped through. Adding one test against the bank-chat-system fixture (which has Spring controllers) would catch it.

Summary

The design — phase-based deletion, single-hop dependent expansion with cap, crash marker, automatic fallback — is well-considered. The critical bug in _write_clients_producers_and_calls (wrong param names + missing params) must be fixed before merge. The duplication issues are worth addressing but not blocking.

- Merge _load_existing_types/_load_existing_types_filtered into single function with optional exclude_files parameter (same for members) - Simplify _find_dependents: loop over edge type strings instead of six identical if/elif branches - Factor _write_nodes and _write_nodes_merge to shared _write_nodes_impl accepting the query template as parameter (~70 lines deduplicated) - Extract _build_file_by_node_id and share between _write_edges and _write_routes_and_exposes (was built twice independently) - Extract _init_hash_tracker helper for duplicated hash-init logic in _fallback_to_full and no-DB branch of incremental_rebuild - Add warning log in FileHashTracker.save() instead of silent OSError - Update AGENTS.md ontology_version references from 15 to 17 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…eview round 2 - Write phantom Route nodes in _write_clients_producers_and_calls using MERGE (pass5 creates phantom routes for cross-service calls that were never persisted to Kuzu, silently dropping HTTP_CALLS/ASYNC_CALLS edges) - Remove redundant _load_existing_members before full pass1_parse in global pass 5-6 step (was creating duplicate stub members alongside full entries) - Use conn.close() + del instead of bare del for Kuzu handle cleanup on fallback (avoids relying on CPython ref-counting for file lock release) - Add FileNotFoundError handling in FileHashTracker.detect_changes for files that vanish between listing and hashing - Remove redundant inline import json in cli.py _cmd_increment - Update stale _delete_file_scope docstring to match implementation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

HumanBean17 and others added 5 commits June 7, 2026 15:42

feat: add run_incremental_graph() wrapper for incremental rebuild (PR…

0e4fa31

…-G2) Add subprocess wrapper that passes --incremental flag to build_ast_graph.py. Part of incremental graph rebuild implementation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

HumanBean17 changed the title ~~feat: add source_file to edge schemas and FileHashTracker for incremental graph rebuild (PR-G1)~~ feat: incremental graph rebuild (3-PR sequence: G1+G2+G3) Jun 7, 2026

HumanBean17 and others added 2 commits June 7, 2026 18:06

HumanBean17 and others added 2 commits June 7, 2026 21:48

HumanBean17 merged commit 67a76de into master Jun 7, 2026
1 check passed

This was referenced Jun 7, 2026

chore: release v0.4.0 #286

Merged

fix(update): run incremental graph rebuild, drop stale "not implemented" warning #310

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: incremental graph rebuild (3-PR sequence: G1+G2+G3)#284

feat: incremental graph rebuild (3-PR sequence: G1+G2+G3)#284
HumanBean17 merged 9 commits into
masterfrom
feat/incremental-graph

HumanBean17 commented Jun 7, 2026 •

edited

Loading

Uh oh!

HumanBean17 commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HumanBean17 commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

PR-G1: Hash tracker + source_file edge schema

PR-G2: Incremental rebuild orchestrator

PR-G3: CLI integration

Files changed

Manual Evidence

Design Notes

Reindex Required

Uh oh!

HumanBean17 commented Jun 7, 2026

PR #284 Review: Incremental Graph Rebuild (G1+G2+G3)

Critical Bug: _write_clients_producers_and_calls will crash at runtime

Significant Issues (should fix)

Minor Issues (nice to fix)

Test Coverage Gap

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HumanBean17 commented Jun 7, 2026 •

edited

Loading

PR-G1: Hash tracker + `source_file` edge schema

Critical Bug: `_write_clients_producers_and_calls` will crash at runtime