Skip to content

serialize lance optimize to fix reprocess commit-conflict race#309

Merged
HumanBean17 merged 1 commit into
masterfrom
fix/lance-optimize-race
Jun 13, 2026
Merged

serialize lance optimize to fix reprocess commit-conflict race#309
HumanBean17 merged 1 commit into
masterfrom
fix/lance-optimize-race

Conversation

@HumanBean17

Copy link
Copy Markdown
Owner

Scope

Fixes #308. java-codebase-rag reprocess floods stderr with:

ERROR cocoindex.connectors.lancedb._target: Exception in optimizing LanceDB table javacodeindex_java_code
RuntimeError: lance error: Retryable commit conflict for version 4424: This Rewrite transaction was preempted by concurrent transaction Delete at version 4424. Please retry.

Root cause

cocoindex 1.0.7 schedules table.optimize() (a LanceDB Rewrite/compaction transaction) as a background asyncio task, concurrently with mutation batches that issue table.delete() (Delete transactions). LanceDB does not allow a Rewrite to commit concurrently with a Delete (upstream lancedb#1504 — "We do not support concurrent deletes right now. I'd recommend serializing…"). cocoindex's _run_optimize logs and never retries on this conflict, so the table is left un-optimized/fragmented and stderr floods. lancedb.AsyncTable.optimize() has no retry parameter.

The fix (3 parts)

  1. Disable cocoindex's concurrent background optimize at the sourcejava_index_flow_lancedb.py: add _NUM_TXN_BEFORE_OPTIMIZE = 10**12 (with a comment citing the race + cocoindex 1.0.7) and pass num_transactions_before_optimize=_NUM_TXN_BEFORE_OPTIMIZE to all three lancedb.mount_table_target(...) calls. This stops any background optimize() from running during the flow, so the Rewrite-vs-Delete race cannot occur. Safe: optimize() is purely maintenance (compact/prune/index); upsert/delete correctness via merge_insert does not depend on it.

  2. New serialized optimize helper with retry guardjava_codebase_rag/lance_optimize.py:

    • LANCE_TABLE_NAMES constant (the three tables) — single source of truth, imported by the flow instead of the inline literals.
    • async def optimize_lance_tables(index_dir, *, quiet=False) -> dict[str, str]: lazy import lancedb (the flow imports this module for the constant and must not pay the lancedb import cost); connect_asynclist_tables → per-table open_table + optimize(). Retry loop (6 attempts, exponential backoff 0.1 * 2**attempt) on errors whose str(exc) contains "Retryable commit conflict" OR "preempted by concurrent transaction"; non-conflict errors are not retried. Missing tables (e.g. a repo with no SQL/YAML) are reported skipped. db.close() runs in finally (it is a sync method in lancedb 0.30.x). All diagnostics go to stderr (this is callable from the stdio MCP / JSON-stdout paths); per-table status returned as a dict, errors captured as "error: <text>".
  3. Wire the post-optimize into both cocoindex chokepoints — run optimize only after cocoindex returns exit 0 (no concurrent writers → clean optimize):

    • pipeline.run_cocoindex_update (java_codebase_rag/pipeline.py, used by init / increment / reprocess --vectors-only): after the subprocess completes with code == 0, asyncio.run(optimize_lance_tables(...)). Index dir resolved from the passed env (JAVA_CODEBASE_RAG_INDEX_DIR, set by config.subprocess_env / apply_to_os_environ — the same key the flow's lifespan reads). If absent, skip with a stderr warning (do not crash). The CompletedProcess return is unchanged on optimize failure; outcome logged to stderr.
    • server.run_refresh_pipeline (server.py, default reprocess): in the if ok: branch, before the graph-build step, await optimize_lance_tables(<resolved index_dir>, quiet=quiet). Index dir resolved the same way the server does (env var → <root>/.java-codebase-rag). New optional field optimize_error: str | None on RefreshIndexOutput; an optimize failure is surfaced via that field + message + stderr, but never flips success/exit semantics for a vectors phase that succeeded.

Manual / test evidence

$ .venv/bin/ruff check .
All checks passed!

$ .venv/bin/python -m pytest tests -q
746 passed, 11 skipped, 18 warnings in 480.72s

(The 11 skips are the heavy e2e tests gated behind JAVA_CODEBASE_RAG_RUN_HEAVY=1, per tests/README.md.)

New tests in tests/test_lance_optimize.py (fakes the lancedb async conn/table — no real LanceDB needed):

  • test_optimize_retries_commit_conflict_then_succeeds — 2 conflicts then ok → asserts 3 calls, status ok.
  • test_optimize_does_not_retry_non_conflict_error — a ValueError is captured per-table, not retried (1 call).
  • test_optimize_reports_missing_table_as_skipped — absent tables come back skipped, no exception.
  • test_optimize_closes_connection_even_on_open_failuredb.close() runs in finally.
  • test_lance_table_names_constant_matches_search_lancedb_tables — single source of truth agrees with search_lancedb.TABLES.

tests/test_lancedb_e2e.py (heavy, runs --full-reprocess): added an assertion that the cocoindex flow stderr contains no "Retryable commit conflict" / "preempted by concurrent transaction" markers after the fix — this is the fixture where a race regression would surface.

Updated tests/fixtures/cli_progress_stdout/reprocess_quiet_success.stdout.txt baseline for the new additive optimize_error field (sorted-key JSON, as serialized by the CLI).

Notes

  • No schema/ontology bump; no re-index-required callout (this is not a schema change — optimize() is pure maintenance).
  • No deviation from the plan. AsyncConnection.close() is a sync method in lancedb 0.30.x (verified in .venv/.../lancedb/db.py), so the helper calls db.close() directly rather than await db.close(). The helper uses list_tables() (the non-deprecated API, matching cocoindex's own _list_table_names helper) with a table_names() fallback.

🤖 Generated with Claude Code

cocoindex 1.0.7 schedules table.optimize() (a LanceDB Rewrite transaction)
as a background asyncio task that races concurrent table.delete() (Delete)
transactions, which LanceDB rejects (upstream lancedb#1504), flooding
reprocess stderr with "Retryable commit conflict ... preempted by concurrent
transaction Delete" and leaving tables un-optimized.

- Disable the in-flow background optimize by setting
  num_transactions_before_optimize=10**12 on all three mount_table_target
  calls in java_index_flow_lancedb.py (optimize is pure maintenance; upsert/
  delete correctness via merge_insert does not depend on it).
- Add java_codebase_rag/lance_optimize.py with a serialized
  optimize_lance_tables() helper that runs table.optimize() once per table
  after the flow returns (no concurrent writers), with retry + exponential
  backoff on the residual commit-conflict. LANCE_TABLE_NAMES becomes the
  single source of truth, imported by the flow.
- Wire the post-flow optimize into both cocoindex chokepoints:
  pipeline.run_cocoindex_update (used by init/increment/reprocess
  --vectors-only) and server.run_refresh_pipeline (default reprocess). An
  optimize failure is surfaced via stderr / the new RefreshIndexOutput
  optimize_error field / message; success is never flipped.

No schema/ontology bump; no re-index-required callout.

Co-Authored-By: Claude <noreply@anthropic.com>
@HumanBean17 HumanBean17 force-pushed the fix/lance-optimize-race branch from dcaddb7 to 5c51baa Compare June 13, 2026 15:18
@HumanBean17 HumanBean17 merged commit c26b94f into master Jun 13, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reprocess error

1 participant