What's broken
I have an Intel Xeon E5-2609 (Sandy Bridge, 2012 — has AVX and SSE4.2, no AVX2 and no FMA). The published lancedb wheel embeds lance, and import lancedb crashes immediately:
$ python -c "import lancedb"
[1] Illegal instruction (core dumped)
A friend hit the same thing on an AMD FX-7500 (Steamroller, 2014 — has AVX + FMA, no AVX2). Both CPUs are pre-Haswell on the AVX2 timeline.
The cause is that the workspace .cargo/config.toml compiles with target-cpu=haswell + target-feature=+avx2,+fma,+f16c, which bakes AVX2 and FMA into every byte of compiled code — both the explicit SIMD kernels and any auto-vectorized loop in plain Rust. The existing runtime SIMD dispatch in lance-core::utils::cpu::SIMD_SUPPORT never gets a chance to run; the binary traps on its first AVX2 instruction at load time.
Why it's worth fixing
The neighboring libraries in any data-science user's import path don't have this problem. On the same Sandy Bridge box:
| Library |
What it ships |
What we saw |
| pyarrow |
runtime_info.simd_level == 'avx' |
Imports cleanly at AVX tier |
| numpy |
baseline=X86_V2, AVX2/AVX-512 listed under "not found" |
Imports cleanly at V2 baseline |
| lancedb (embeds lance) |
heavy AVX2 + AVX-512 instructions, no runtime guard |
Illegal instruction (core dumped) |
A user who can import numpy as np; import pyarrow as pa cannot necessarily import lancedb. Lance is the outlier in the trio.
Affected hardware
Anything pre-Haswell on the AVX2 timeline:
- Intel: Sandy Bridge (2011), Ivy Bridge (2012), Westmere (2010), Nehalem (2008)
- AMD: Bulldozer / Piledriver (2011-2012), Steamroller (2014, has FMA but no AVX2 — e.g. FX-7500)
Modern data-center hosts are all AVX2 or better, so this isn't blocking production. It does block lance on workstations, homelabs, older laptops, and any environment where someone is using lance alongside numpy and pyarrow expecting parity with how those libraries handle the hardware.
The fix
The implementation:
- Lowers the workspace x86_64 baseline from
target-cpu=haswell to target-cpu=x86-64-v2 (matches numpy's published-wheel baseline — Nehalem-class)
- Adds runtime SIMD dispatch with 5-tier coverage (scalar / AVX / AVX+FMA / AVX2+FMA / AVX-512) to the f32/f64 hot distance kernels in lance-linalg
- Uses the same dispatch shape lance already uses for its u8 distance kernels (
dot_u8.rs, cosine_u8.rs, l2_u8.rs) and for the f16/bf16 paths in norm_l2.rs — no new external dependencies
- Adds a
lance.simd_info() Python API that mirrors pyarrow.runtime_info() so users can verify which tier dispatch picked on their host
- Adds a
qemu-x86_64 -cpu Nehalem CI job so any future SIGILL leak fails CI before shipping
- The existing AVX2 path is preserved as one of the per-tier kernels — modern-CPU compiled output is unchanged from today, so no regression by construction; the only execution change is for hosts that today SIGILL, which now land on the AVX or scalar tier instead
Side benefit — the same workspace config change automatically fixes lance Java JNI users (the JNI build inherits the workspace baseline; no separate config there).
PR (on my fork, not upstream): tobocop2#2. Per-kernel design rationale, asm evidence, and bench methodology are in the PR description.
Verified end-to-end on the failing hardware
On my Intel Xeon E5-2609 (Sandy Bridge — same CPU class the published wheel SIGILLs on):
$ grep -m1 'model name' /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
$ grep -m1 '^flags' /proc/cpuinfo | tr ' ' '\n' | grep -E '^(sse4_2|avx|avx2|fma|avx512f)$' | sort -u
avx
sse4_2
Pre-fix — install the published wheel, observe SIGILL:
$ pip install lancedb
$ python -c "import lancedb"
[1] Illegal instruction (core dumped)
Post-fix — build the wheel from this branch (via tobocop2/lancedb#2, which embeds tobocop2/lance#2), install, run a vector-search round-trip:
$ git clone -b fix/runtime-simd-pre-haswell https://github.com/tobocop2/lancedb.git
$ cd lancedb/python
$ maturin build --release
$ pip install ./target/wheels/lancedb-*.whl
$ python <<'PY'
import lancedb, tempfile, pyarrow as pa
schema = pa.schema([pa.field("vec", pa.list_(pa.float32(), 3))])
with tempfile.TemporaryDirectory() as d:
t = lancedb.connect(d).create_table("t",
pa.Table.from_pylist([{"vec": [1.0, 2.0, 3.0]}], schema=schema))
print(t.search([1.0, 2.0, 3.0]).limit(1).to_arrow().to_pylist())
PY
[{'vec': [1.0, 2.0, 3.0], '_distance': 0.0}]
Import succeeds, table create + vector search succeed. Runtime dispatch picks the AVX tier; the AVX2/AVX-512 kernels stay dormant on this CPU and the binary doesn't trap.
I would really appreciate this. I'm working on https://github.com/tobocop2/lilbee and I'm really invested in this project — it would be incredible if it could target older architectures. It would be incredible to make our projects accessible to older hardware and I would really love if this could happen.
To be transparent: the PRs are on my fork (not opened upstream) because this isn't my domain of expertise and the implementation is AI-generated. I verified it works end-to-end on the failing hardware (the Sandy Bridge box above), but wanted to be upfront about that.
I'd be happy to target my PRs upstream if it makes sense and I'd be happy to roll in any feedback.
What's broken
I have an Intel Xeon E5-2609 (Sandy Bridge, 2012 — has AVX and SSE4.2, no AVX2 and no FMA). The published lancedb wheel embeds lance, and
import lancedbcrashes immediately:A friend hit the same thing on an AMD FX-7500 (Steamroller, 2014 — has AVX + FMA, no AVX2). Both CPUs are pre-Haswell on the AVX2 timeline.
The cause is that the workspace
.cargo/config.tomlcompiles withtarget-cpu=haswell+target-feature=+avx2,+fma,+f16c, which bakes AVX2 and FMA into every byte of compiled code — both the explicit SIMD kernels and any auto-vectorized loop in plain Rust. The existing runtime SIMD dispatch inlance-core::utils::cpu::SIMD_SUPPORTnever gets a chance to run; the binary traps on its first AVX2 instruction at load time.Why it's worth fixing
The neighboring libraries in any data-science user's import path don't have this problem. On the same Sandy Bridge box:
runtime_info.simd_level == 'avx'baseline=X86_V2, AVX2/AVX-512 listed under "not found"Illegal instruction (core dumped)A user who can
import numpy as np; import pyarrow as pacannot necessarilyimport lancedb. Lance is the outlier in the trio.Affected hardware
Anything pre-Haswell on the AVX2 timeline:
Modern data-center hosts are all AVX2 or better, so this isn't blocking production. It does block lance on workstations, homelabs, older laptops, and any environment where someone is using lance alongside numpy and pyarrow expecting parity with how those libraries handle the hardware.
The fix
The implementation:
target-cpu=haswelltotarget-cpu=x86-64-v2(matches numpy's published-wheel baseline — Nehalem-class)dot_u8.rs,cosine_u8.rs,l2_u8.rs) and for the f16/bf16 paths innorm_l2.rs— no new external dependencieslance.simd_info()Python API that mirrorspyarrow.runtime_info()so users can verify which tier dispatch picked on their hostqemu-x86_64 -cpu NehalemCI job so any future SIGILL leak fails CI before shippingSide benefit — the same workspace config change automatically fixes lance Java JNI users (the JNI build inherits the workspace baseline; no separate config there).
PR (on my fork, not upstream): tobocop2#2. Per-kernel design rationale, asm evidence, and bench methodology are in the PR description.
Verified end-to-end on the failing hardware
On my Intel Xeon E5-2609 (Sandy Bridge — same CPU class the published wheel SIGILLs on):
Pre-fix — install the published wheel, observe SIGILL:
Post-fix — build the wheel from this branch (via
tobocop2/lancedb#2, which embedstobocop2/lance#2), install, run a vector-search round-trip:Import succeeds, table create + vector search succeed. Runtime dispatch picks the AVX tier; the AVX2/AVX-512 kernels stay dormant on this CPU and the binary doesn't trap.
I would really appreciate this. I'm working on https://github.com/tobocop2/lilbee and I'm really invested in this project — it would be incredible if it could target older architectures. It would be incredible to make our projects accessible to older hardware and I would really love if this could happen.
To be transparent: the PRs are on my fork (not opened upstream) because this isn't my domain of expertise and the implementation is AI-generated. I verified it works end-to-end on the failing hardware (the Sandy Bridge box above), but wanted to be upfront about that.
I'd be happy to target my PRs upstream if it makes sense and I'd be happy to roll in any feedback.