Conversation
Cortex-A520 is the big winner here, going from around 4.21GiB/s to around 7.73GiB/s (83.44% improvement), but this filter is already stupidly fast.
|
AI disclosure: I wrote a sliding-window |
| let mut x: Simd<u8, STRIDE_BYTES> = Simd::<u8, STRIDE_BYTES>::from_slice(current_chunk); | ||
| let b: Simd<u8, STRIDE_BYTES> = Simd::<u8, STRIDE_BYTES>::from_slice(previous_chunk); | ||
| x = x + b; // Wrapping addition | ||
| x.copy_to_slice(current_chunk); |
There was a problem hiding this comment.
It might be possible to write a scalar implementation here that rustc would be able to vectorize
| let chunks = current.len() / STRIDE_BYTES; | ||
|
|
||
| let (simd_current, remainder_current) = current.split_at_mut(chunks * STRIDE_BYTES); | ||
| let (simd_previous, remainder_previous) = previous.split_at(chunks * STRIDE_BYTES); | ||
|
|
||
| let current_iter = simd_current.chunks_exact_mut(STRIDE_BYTES); | ||
| let previous_iter = simd_previous.chunks_exact(STRIDE_BYTES); | ||
| let combined_iter = current_iter.zip(previous_iter); |
There was a problem hiding this comment.
I think you should be able to do something like:
let mut current_iter = current.chunks_exact_mut(STRIDE_BYTES);
let mut previous_iter = previous.chunks_exact(STRIDE_BYTES);
for (current_chunk, previous_chunk) in (&mut current_iter).zip(&mut previous_iter) {
...
}
let remainder_current = current_iter.remainder();
let remainder_previous = previous_iter.remainder();
...|
Looking at this again, I think we should first try a scalar version that operates on 64-byte blocks. If LLVM isn't able to suitably auto-vectorize it, then that really starts to feel like an optimization bug that should be reported/fixed there. The entire operation is just Something like this should be all we need: let (prev_chunks, prev_remainder) = previous.as_chunks::<64>();
let (curr_chunks, curr_remainder) = current.as_chunks_mut::<64>();
for (curr_chunk, prev_chunk) in curr_chunks.iter_mut().zip(prev_chunks) {
for (curr, &above) in curr_chunk.iter_mut().zip(prev_chunk) {
*curr = curr.wrapping_add(above);
}
}
for (curr, &above) in curr_remainder.iter_mut().zip(prev_remainder) {
*curr = curr.wrapping_add(above);
} |
Implements a
portable_simdversion of the Up filter - merely adds the previous and current row's 64 bytes simultaneously. This filter is ludicrously fast already on a big out-of-order CPU, but does seem to help the Cortex-A520.Results from Silicon (3bpp)
The same implementation is shared across all pixel depths.
Draft until the scaffolding in #632 is merged.