Skip to content

Commit 142d231

Browse files
Dandandanclaude
andcommitted
perf(take): Vectorise bounds check in take_native (-8-10%)
The non-null path of `take_native` performed a bounds check per index via `values[index.as_usize()]`. The per-lane branch blocks autovectorisation and dominates the hot loop for primitive take. Reduce each CHUNK=16 indices to their maximum via `fold`+`max` (no short-circuit, so LLVM SIMD-reduces it to two `ldp q` + three `umax.4s` + one `umaxv.4s` on aarch64) and bounds-check the max once per chunk. The panic path is a `#[cold]` helper so `max_idx` does not need to be kept live for format args on the hot path (no stack spill per chunk). Signed index types sign-extend to `usize::MAX` on `as_usize()`, so negative indices still fail the check. Measured on aarch64 (Apple Silicon) with `cargo bench --bench take_kernels`: take i32 512 309 ns → 279 ns (−9.7%) take i32 1024 469 ns → 431 ns (−8.1%) No change to `take` panic semantics (still panics on OOB) or to the null-indices branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 89b1497 commit 142d231

File tree

1 file changed

+44
-5
lines changed

1 file changed

+44
-5
lines changed

arrow-select/src/take.rs

Lines changed: 44 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -443,11 +443,50 @@ fn take_native<T: ArrowNativeType, I: ArrowPrimitiveType>(
443443
},
444444
})
445445
.collect(),
446-
None => indices
447-
.values()
448-
.iter()
449-
.map(|index| values[index.as_usize()])
450-
.collect(),
446+
None => {
447+
// Vectorised bounds check: reduce each chunk to its maximum index
448+
// using `fold`+`max` (no short-circuit, so LLVM can SIMD-reduce it)
449+
// and assert once per chunk. Signed natives sign-extend to `usize`,
450+
// so negative indices become `usize::MAX` and are still rejected.
451+
// Output is written into a preallocated buffer via `chunks_exact_mut`
452+
// so the gather compiles to a straight SIMD store.
453+
#[cold]
454+
#[inline(never)]
455+
fn oob(max_idx: usize, values_len: usize) -> ! {
456+
panic!("index out of bounds: the len is {values_len} but the index is {max_idx}")
457+
}
458+
const CHUNK: usize = 16;
459+
let idx = indices.values();
460+
let values_len = values.len();
461+
let len = idx.len();
462+
let mut out: Vec<T> = Vec::with_capacity(len);
463+
// SAFETY: `T: ArrowNativeType` is `Copy` with no `Drop`; every slot
464+
// is written before `out` is read, so uninitialised memory is never
465+
// observed.
466+
unsafe { out.set_len(len) };
467+
let rem_len = len % CHUNK;
468+
let (out_full, out_tail) = out.split_at_mut(len - rem_len);
469+
let idx_chunks = idx.chunks_exact(CHUNK);
470+
let idx_remainder = idx_chunks.remainder();
471+
for (out_chunk, idx_chunk) in
472+
out_full.chunks_exact_mut(CHUNK).zip(idx_chunks)
473+
{
474+
let max_idx = idx_chunk
475+
.iter()
476+
.fold(0usize, |acc, &i| acc.max(i.as_usize()));
477+
if max_idx >= values_len {
478+
oob(max_idx, values_len);
479+
}
480+
for (o, &i) in out_chunk.iter_mut().zip(idx_chunk) {
481+
// SAFETY: max_idx < values_len ⇒ every index in chunk is in bounds.
482+
*o = unsafe { *values.get_unchecked(i.as_usize()) };
483+
}
484+
}
485+
for (o, &i) in out_tail.iter_mut().zip(idx_remainder) {
486+
*o = values[i.as_usize()];
487+
}
488+
out.into()
489+
}
451490
}
452491
}
453492

0 commit comments

Comments
 (0)