Skip to content

write_dataset panics on a list<uint64> column with one very long row (default v2.1 encoder) #6626

@pengw0048

Description

@pengw0048

Summary

Writing a list<uint64> column with one very long row (50,000 random uint64 values) and the rest empty panics the v2.1 miniblock encoder. Same data writes fine with data_storage_version='2.0' and '2.2'. Since v2.1 is the default on pylance==6.0.0-beta.3, any caller that doesn't explicitly pin a version is exposed.

This looks related to #6184 / fixed by #6234, but PR #6234's repdef_too_sparse_for_miniblock heuristic only inspects the global levels/values ratio. It doesn't catch the case where a single row's list dominates one chunk's repetition buffer, even when the global ratio looks healthy.

Reproduction

import random
import lance
import pyarrow as pa

rng = random.Random(0)
rows = [[] for _ in range(50_000)]
rows[25_000] = [rng.getrandbits(64) for _ in range(50_000)]
table = pa.table({"ids": pa.array(rows, type=pa.list_(pa.uint64()))})

lance.write_dataset(table, "/tmp/repro.lance", data_storage_version="2.1")

Expected

write_dataset succeeds, or returns a clean Python error indicating the encoder cannot represent this data in v2.1 (so callers can fall back).

Actual

thread 'lance-cpu' panicked at rust/lance-encoding/src/encodings/logical/primitive.rs:3973:13:
assertion failed: chunk_bytes <= max_chunk_size
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread '<unnamed>' panicked at rust/lance-core/src/utils/tokio.rs:126:24:
called `Result::unwrap()` on an `Err` value: RecvError(())

pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: RecvError(())

The downstream RecvError(()) (the tokio.rs:126 panic) is the channel receiver dying after the upstream lance-cpu panic kills the encoder thread; it's not a separate bug.

A closely-related variant of this trigger surfaces as the OSError form at primitive.rs:3921:

OSError: Encountered internal error. Definition buffer size (NN bytes) too large,
rust/lance-encoding/src/encodings/logical/primitive.rs:3921:21

— same root cause, hit through the def-buffer size check rather than the chunk-bytes assertion, depending on the data shape.

Environment

  • pylance == 6.0.0-beta.3
  • macOS aarch64 and Linux x86_64
  • Python 3.11

Workarounds confirmed

data_storage_version Result
"2.0" ✅ writes fine (no miniblock)
"2.1" ❌ panics
"2.2" ✅ writes fine (support_large_chunk → u32 chunk size from #4959)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions