serialize lance optimize to fix reprocess commit-conflict race#309
Merged
Conversation
cocoindex 1.0.7 schedules table.optimize() (a LanceDB Rewrite transaction) as a background asyncio task that races concurrent table.delete() (Delete) transactions, which LanceDB rejects (upstream lancedb#1504), flooding reprocess stderr with "Retryable commit conflict ... preempted by concurrent transaction Delete" and leaving tables un-optimized. - Disable the in-flow background optimize by setting num_transactions_before_optimize=10**12 on all three mount_table_target calls in java_index_flow_lancedb.py (optimize is pure maintenance; upsert/ delete correctness via merge_insert does not depend on it). - Add java_codebase_rag/lance_optimize.py with a serialized optimize_lance_tables() helper that runs table.optimize() once per table after the flow returns (no concurrent writers), with retry + exponential backoff on the residual commit-conflict. LANCE_TABLE_NAMES becomes the single source of truth, imported by the flow. - Wire the post-flow optimize into both cocoindex chokepoints: pipeline.run_cocoindex_update (used by init/increment/reprocess --vectors-only) and server.run_refresh_pipeline (default reprocess). An optimize failure is surfaced via stderr / the new RefreshIndexOutput optimize_error field / message; success is never flipped. No schema/ontology bump; no re-index-required callout. Co-Authored-By: Claude <noreply@anthropic.com>
dcaddb7 to
5c51baa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope
Fixes #308.
java-codebase-rag reprocessfloods stderr with:Root cause
cocoindex 1.0.7 schedules
table.optimize()(a LanceDB Rewrite/compaction transaction) as a background asyncio task, concurrently with mutation batches that issuetable.delete()(Delete transactions). LanceDB does not allow a Rewrite to commit concurrently with a Delete (upstream lancedb#1504 — "We do not support concurrent deletes right now. I'd recommend serializing…"). cocoindex's_run_optimizelogs and never retries on this conflict, so the table is left un-optimized/fragmented and stderr floods.lancedb.AsyncTable.optimize()has no retry parameter.The fix (3 parts)
Disable cocoindex's concurrent background optimize at the source —
java_index_flow_lancedb.py: add_NUM_TXN_BEFORE_OPTIMIZE = 10**12(with a comment citing the race + cocoindex 1.0.7) and passnum_transactions_before_optimize=_NUM_TXN_BEFORE_OPTIMIZEto all threelancedb.mount_table_target(...)calls. This stops any backgroundoptimize()from running during the flow, so the Rewrite-vs-Delete race cannot occur. Safe:optimize()is purely maintenance (compact/prune/index); upsert/delete correctness via merge_insert does not depend on it.New serialized optimize helper with retry guard —
java_codebase_rag/lance_optimize.py:LANCE_TABLE_NAMESconstant (the three tables) — single source of truth, imported by the flow instead of the inline literals.async def optimize_lance_tables(index_dir, *, quiet=False) -> dict[str, str]: lazyimport lancedb(the flow imports this module for the constant and must not pay the lancedb import cost);connect_async→list_tables→ per-tableopen_table+optimize(). Retry loop (6 attempts, exponential backoff0.1 * 2**attempt) on errors whosestr(exc)contains"Retryable commit conflict"OR"preempted by concurrent transaction"; non-conflict errors are not retried. Missing tables (e.g. a repo with no SQL/YAML) are reportedskipped.db.close()runs infinally(it is a sync method in lancedb 0.30.x). All diagnostics go to stderr (this is callable from the stdio MCP / JSON-stdout paths); per-table status returned as a dict, errors captured as"error: <text>".Wire the post-optimize into both cocoindex chokepoints — run optimize only after cocoindex returns exit 0 (no concurrent writers → clean optimize):
pipeline.run_cocoindex_update(java_codebase_rag/pipeline.py, used byinit/increment/reprocess --vectors-only): after the subprocess completes withcode == 0,asyncio.run(optimize_lance_tables(...)). Index dir resolved from the passedenv(JAVA_CODEBASE_RAG_INDEX_DIR, set byconfig.subprocess_env/apply_to_os_environ— the same key the flow's lifespan reads). If absent, skip with a stderr warning (do not crash). TheCompletedProcessreturn is unchanged on optimize failure; outcome logged to stderr.server.run_refresh_pipeline(server.py, defaultreprocess): in theif ok:branch, before the graph-build step,await optimize_lance_tables(<resolved index_dir>, quiet=quiet). Index dir resolved the same way the server does (env var →<root>/.java-codebase-rag). New optional fieldoptimize_error: str | NoneonRefreshIndexOutput; an optimize failure is surfaced via that field +message+ stderr, but never flipssuccess/exit semantics for a vectors phase that succeeded.Manual / test evidence
(The 11 skips are the heavy e2e tests gated behind
JAVA_CODEBASE_RAG_RUN_HEAVY=1, pertests/README.md.)New tests in
tests/test_lance_optimize.py(fakes the lancedb async conn/table — no real LanceDB needed):test_optimize_retries_commit_conflict_then_succeeds— 2 conflicts then ok → asserts 3 calls, statusok.test_optimize_does_not_retry_non_conflict_error— aValueErroris captured per-table, not retried (1 call).test_optimize_reports_missing_table_as_skipped— absent tables come backskipped, no exception.test_optimize_closes_connection_even_on_open_failure—db.close()runs infinally.test_lance_table_names_constant_matches_search_lancedb_tables— single source of truth agrees withsearch_lancedb.TABLES.tests/test_lancedb_e2e.py(heavy, runs--full-reprocess): added an assertion that the cocoindex flow stderr contains no"Retryable commit conflict"/"preempted by concurrent transaction"markers after the fix — this is the fixture where a race regression would surface.Updated
tests/fixtures/cli_progress_stdout/reprocess_quiet_success.stdout.txtbaseline for the new additiveoptimize_errorfield (sorted-key JSON, as serialized by the CLI).Notes
optimize()is pure maintenance).AsyncConnection.close()is a sync method in lancedb 0.30.x (verified in.venv/.../lancedb/db.py), so the helper callsdb.close()directly rather thanawait db.close(). The helper useslist_tables()(the non-deprecated API, matching cocoindex's own_list_table_nameshelper) with atable_names()fallback.🤖 Generated with Claude Code