Skip to content

Concurrent deferred-remap compaction + optimize_indices can corrupt table; list_indices panics #6623

@wjones127

Description

@wjones127

Summary

A deferred-remap compaction (compact_files(defer_index_remap=True)) committed concurrently with an optimize_indices against an older dataset version can leave the table in a corrupt state where any call that walks the indices — including Dataset.list_indices() — panics with:

called `Result::unwrap()` on an `Err` value: InvalidInput {
  source: "The compaction plan included a rewrite group that was a split of indexed and non-indexed data: [...]",
  location: rust/lance-index/src/frag_reuse.rs:330
}

The panic site is the .unwrap() in Dataset::load_indices (rust/lance/src/index.rs:824 on released builds, :902 on main) when applying the FRI's remap_fragment_bitmap to a user index whose fragment_bitmap straddles a rewrite group's old_frags.

This is a table-corruption bug: once committed, the table can no longer be read via the index APIs. PR #6610 prevents new occurrences by rejecting the conflicting commit, but does not repair tables that were already written in the broken state.

Reproduction

Reliably reproduces on main prior to #6610 with the following sequence:

  1. Write frag0, build a vector index over it.
  2. Append frag1, snapshot a stale Dataset handle.
  3. Append frag2 on the up-to-date handle.
  4. plan_compaction + rewrite_files of [frag1, frag2] with defer_index_remap=true (compute RewriteResult but do not commit).
  5. On the stale handle, run optimize_indices — commits a CreateIndex covering frag1 only (frag2 didn't exist at that version).
  6. Commit the rewrite via commit_compaction. On pre-fix: reject Rewrite vs CreateIndex when FRI groups straddle bitmap #6610 builds this succeeds.

Resulting state:

  • index segment A: bitmap = {frag0} (original)
  • index segment B: bitmap = {frag1} (from stale optimize)
  • FRI group: old=[frag1, frag2] → new=[frag3]

Segment B's bitmap straddles the FRI group, and any load_indices call panics.

A self-contained Rust test that reproduces this on main is in flight — link to draft PR will follow.

Production trigger

Reported on a real table with a rewrite group old=[382843, 382844, 382845, 382846]. ds.list_indices() panics; the table is unreadable through the index path.

Impact

Proposed fix

Add a Dataset.repair() API (Rust + Python) that:

  1. Detects this corruption non-destructively (reads via read_manifest_indexes, bypassing the FRI auto-remap in load_indices).
  2. Repairs by removing the straddling old fragment IDs from each affected index segment's fragment_bitmap. Previously-indexed rows in the merged new fragment fall through to flat scan until the next optimize_indices — no data loss, no retraining required.

Also:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcritical-fixBugs that cause crashes, security vulnerabilities, or incorrect data.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions