Summary
Writing a list<uint64> column with one very long row (50,000 random uint64 values) and the rest empty panics the v2.1 miniblock encoder. Same data writes fine with data_storage_version='2.0' and '2.2'. Since v2.1 is the default on pylance==6.0.0-beta.3, any caller that doesn't explicitly pin a version is exposed.
This looks related to #6184 / fixed by #6234, but PR #6234's repdef_too_sparse_for_miniblock heuristic only inspects the global levels/values ratio. It doesn't catch the case where a single row's list dominates one chunk's repetition buffer, even when the global ratio looks healthy.
Reproduction
import random
import lance
import pyarrow as pa
rng = random.Random(0)
rows = [[] for _ in range(50_000)]
rows[25_000] = [rng.getrandbits(64) for _ in range(50_000)]
table = pa.table({"ids": pa.array(rows, type=pa.list_(pa.uint64()))})
lance.write_dataset(table, "/tmp/repro.lance", data_storage_version="2.1")
Expected
write_dataset succeeds, or returns a clean Python error indicating the encoder cannot represent this data in v2.1 (so callers can fall back).
Actual
thread 'lance-cpu' panicked at rust/lance-encoding/src/encodings/logical/primitive.rs:3973:13:
assertion failed: chunk_bytes <= max_chunk_size
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at rust/lance-core/src/utils/tokio.rs:126:24:
called `Result::unwrap()` on an `Err` value: RecvError(())
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: RecvError(())
The downstream RecvError(()) (the tokio.rs:126 panic) is the channel receiver dying after the upstream lance-cpu panic kills the encoder thread; it's not a separate bug.
A closely-related variant of this trigger surfaces as the OSError form at primitive.rs:3921:
OSError: Encountered internal error. Definition buffer size (NN bytes) too large,
rust/lance-encoding/src/encodings/logical/primitive.rs:3921:21
— same root cause, hit through the def-buffer size check rather than the chunk-bytes assertion, depending on the data shape.
Environment
pylance == 6.0.0-beta.3
- macOS aarch64 and Linux x86_64
- Python 3.11
Workarounds confirmed
data_storage_version |
Result |
"2.0" |
✅ writes fine (no miniblock) |
"2.1" |
❌ panics |
"2.2" |
✅ writes fine (support_large_chunk → u32 chunk size from #4959) |
Summary
Writing a
list<uint64>column with one very long row (50,000 random uint64 values) and the rest empty panics the v2.1 miniblock encoder. Same data writes fine withdata_storage_version='2.0'and'2.2'. Since v2.1 is the default onpylance==6.0.0-beta.3, any caller that doesn't explicitly pin a version is exposed.This looks related to #6184 / fixed by #6234, but PR #6234's
repdef_too_sparse_for_miniblockheuristic only inspects the global levels/values ratio. It doesn't catch the case where a single row's list dominates one chunk's repetition buffer, even when the global ratio looks healthy.Reproduction
Expected
write_datasetsucceeds, or returns a clean Python error indicating the encoder cannot represent this data in v2.1 (so callers can fall back).Actual
The downstream
RecvError(())(thetokio.rs:126panic) is the channel receiver dying after the upstreamlance-cpupanic kills the encoder thread; it's not a separate bug.A closely-related variant of this trigger surfaces as the
OSErrorform atprimitive.rs:3921:— same root cause, hit through the def-buffer size check rather than the chunk-bytes assertion, depending on the data shape.
Environment
pylance == 6.0.0-beta.3Workarounds confirmed
data_storage_version"2.0""2.1""2.2"support_large_chunk→ u32 chunk size from #4959)