Commit 142d231
perf(take): Vectorise bounds check in
The non-null path of `take_native` performed a bounds check per index
via `values[index.as_usize()]`. The per-lane branch blocks
autovectorisation and dominates the hot loop for primitive take.
Reduce each CHUNK=16 indices to their maximum via `fold`+`max` (no
short-circuit, so LLVM SIMD-reduces it to two `ldp q` + three `umax.4s`
+ one `umaxv.4s` on aarch64) and bounds-check the max once per chunk.
The panic path is a `#[cold]` helper so `max_idx` does not need to be
kept live for format args on the hot path (no stack spill per chunk).
Signed index types sign-extend to `usize::MAX` on `as_usize()`, so
negative indices still fail the check.
Measured on aarch64 (Apple Silicon) with `cargo bench --bench
take_kernels`:
take i32 512 309 ns → 279 ns (−9.7%)
take i32 1024 469 ns → 431 ns (−8.1%)
No change to `take` panic semantics (still panics on OOB) or to the
null-indices branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>take_native (-8-10%)1 parent 89b1497 commit 142d231
1 file changed
+44
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
443 | 443 | | |
444 | 444 | | |
445 | 445 | | |
446 | | - | |
447 | | - | |
448 | | - | |
449 | | - | |
450 | | - | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
451 | 490 | | |
452 | 491 | | |
453 | 492 | | |
| |||
0 commit comments