Skip to content

fix(cli): erase removes graph/cocoindex.db/.graph_hashes.json by type (#346)#348

Merged
HumanBean17 merged 2 commits into
masterfrom
bugfix/erase
Jun 30, 2026
Merged

fix(cli): erase removes graph/cocoindex.db/.graph_hashes.json by type (#346)#348
HumanBean17 merged 2 commits into
masterfrom
bugfix/erase

Conversation

@HumanBean17

Copy link
Copy Markdown
Owner

Summary

Fixes #346.

java-codebase-rag erase reported success: true but did not delete the LadybugDB graph (code_graph.lbug), because the deletion was type-blind:

Path Real type Code used Result
code_graph.lbug regular file shutil.rmtree(path, ignore_errors=True) rmtree on a file raises → swallowed → no-op
cocoindex.db directory Path.unlink() IsADirectoryError → swallowed by except OSErrorno-op
.graph_hashes.json regular file (never targeted) survives

The surviving code_graph.lbug then made the next init refuse (exit 2, pointing back at erase --yes) — a deadloop of the documented clean-slate workflow (erase --yesinit).

Fix

Add a _rm_any(path) helper that dispatches on type (file / dir / symlink), so both the file-backed and directory-backed LadybugDB layouts are handled. erase now removes code_graph.lbug, cocoindex.db, and .graph_hashes.json, and lists the hash store in the Will delete: preview.

def _rm_any(path: Path) -> None:
    try:
        if path.is_dir() and not path.is_symlink():
            shutil.rmtree(path, ignore_errors=True)
        elif path.exists() or path.is_symlink():
            path.unlink()
    except OSError:
        pass

Does reprocess have the same problem? — No

Investigated explicitly per the task. reprocess (default path → run_refresh_pipeline) rebuilds in place and never relies on the broken deletion:

  • Its full rebuild opens the existing .lbug and calls _drop_all() on every node + edge/REL table (Symbol, Route, Client, Producer, GraphMeta, CALLS, HTTP_CALLS, ASYNC_CALLS, EXTENDS, IMPLEMENTES, DECLARES, …), then recreates the schema and rewrites fresh.
  • _init_hash_tracker deliberately resets .graph_hashes.json to mirror exactly the indexed files (no stale hashes).
  • cocoindex --full-reprocess rebuilds Lance + cocoindex.db.

A repo-wide grep confirms cli.py:625,628 were the only type-blind deletes of ladybug_path/cocoindex_db — so init/increment/reprocess are all unaffected. No change to reprocess.

Tests

  • New (always-on): test_erase_removes_graph_file_cocoindex_dir_and_hash_store — creates a real on-disk layout (code_graph.lbug file, cocoindex.db/ dir, .graph_hashes.json file), runs erase --yes, asserts all three are gone. No embedding-model dependency → runs on every CI job. Watched this fail (erase left code_graph.lbug on disk) before the fix and pass after (TDD red→green).
  • Fixed false-green: test_init_after_erase_succeeds previously erased an empty index dir and then inited (so it never erased a real graph). Converted to a real init → erase → re-init lifecycle: asserts code_graph.lbug is gone after erase and that the second init succeeds (rc 0).

Manual evidence (issue reproduction)

IDX=/tmp/erase-bug2
java-codebase-rag init  --source-root tests/bank-chat-system --index-dir "$IDX" --quiet
java-codebase-rag erase --source-root tests/bank-chat-system --index-dir "$IDX" --yes
ls "$IDX"                  # empty (previously: code_graph.lbug + cocoindex.db/ + .graph_hashes.json survived)
java-codebase-rag init  --source-root tests/bank-chat-system --index-dir "$IDX" --quiet   # success (previously exit 2)

Validation

  • .venv/bin/ruff check . → clean
  • .venv/bin/python -m pytest tests -q (serial) → 851 passed, 14 skipped, 0 errors
  • No reindex / env-var / ontology change (CLI-only fix).

🤖 Generated with Claude Code

…#346)

`erase` reported success but left code_graph.lbug on disk because its
deletion was type-blind: shutil.rmtree silently no-ops on a regular file
(code_graph.lbug) and Path.unlink raises IsADirectoryError on a directory
(cocoindex.db), both swallowed; .graph_hashes.json was never targeted.
The next init then refused (exit 2), deadlooping the documented
`erase --yes` -> `init` clean-slate workflow.

Replace the type-blind deletes with a _rm_any helper that dispatches on
type (file/dir/symlink — a symlinked dir is unlinked, never recursed into,
so the target is not followed), so both the file-backed and dir-backed
LadybugDB layouts are handled. erase now also removes .graph_hashes.json
and lists it in the "Will delete:" preview. Deletion failures are warned
to stderr instead of swallowed, so erase no longer reports success while
leaving an artifact behind (the same silent-failure class as #346).

`reprocess` is unaffected: its full rebuild opens the existing .lbug and
_drop_all()s every node + edge table in place, and _init_hash_tracker
resets .graph_hashes.json — it never relies on the broken deletion.

Tests: add an always-on regression that creates a real lbug-file /
cocoindex.db-dir / hash-store layout and asserts erase removes all three;
convert the false-green test_init_after_erase_succeeds into a real
build -> erase -> re-init lifecycle check.

Co-Authored-By: Claude <noreply@anthropic.com>
@HumanBean17

Copy link
Copy Markdown
Owner Author

Self-review before requesting review

I ran a high-effort code review on this diff (3 finder angles + verification) and addressed the actionable findings:

Addressed

  • Deletion failures now warn to stderr instead of being swallowed. The original _rm_any caught OSError: pass, so on a read-only fs / permission denial / EBUSY, erase would still emit success: true while leaving an artifact — the same silent-failure class as Bug: erase leaves the LadybugDB graph on disk; subsequent init refuses (exit 2) #346. Now each failure prints warning: failed to remove <path>: <err>, matching the existing Lance drop_table warning convention in the same function. Verified empirically (chmod 555 on the index dir → 3 warnings).
  • path.unlink(missing_ok=True) for cleaner intent vs. relying on the catch for a TOCTOU vanish.
  • Deduped the .graph_hashes.json literal via a local graph_hashes_path var (was spelled out twice).

Considered and intentionally left out (out of scope for #346)

  • .graph_increment_in_progress (incremental crash marker, build_ast_graph.py:3814) and .graph_hashes.json.tmp (atomic-write temp) are not removed by erase. They only exist after a crashed build, are self-healing on the next run, and are not part of the deadloop Bug: erase leaves the LadybugDB graph on disk; subsequent init refuses (exit 2) #346 reports. Removing the whole index_dir to close the class was rejected because it would bypass the clean lancedb.drop_table API (rm-ing .lance dirs directly corrupts LanceDB metadata). Happy to address these in a follow-up if you want a stricter "clean slate".
  • Kept the symlink-aware branch (is_dir() and not is_symlink()): dropping it would let rmtree follow a symlinked dir and delete the target — the current logic is correct here.

Validation

  • ruff check . clean · pytest tests -q (serial) → 851 passed, 14 skipped, 0 errors · issue reproduction now succeeds end-to-end (empty index dir after erase; re-init no longer exits 2).

reprocess was investigated per the task and is not affected — it rebuilds in place (_drop_all on every node+edge table, _init_hash_tracker resets hashes) and never relies on the broken deletion.

Co-Authored-By: Claude <noreply@anthropic.com>
@HumanBean17 HumanBean17 merged commit 66bb3a0 into master Jun 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: erase leaves the LadybugDB graph on disk; subsequent init refuses (exit 2)

1 participant