Skip to content

portable_simd version of Up#643

Draft
Sentimentron wants to merge 3 commits intoimage-rs:masterfrom
Sentimentron:portable_simd-up
Draft

portable_simd version of Up#643
Sentimentron wants to merge 3 commits intoimage-rs:masterfrom
Sentimentron:portable_simd-up

Conversation

@Sentimentron
Copy link
Contributor

Implements a portable_simd version of the Up filter - merely adds the previous and current row's 64 bytes simultaneously. This filter is ludicrously fast already on a big out-of-order CPU, but does seem to help the Cortex-A520.

Results from Silicon (3bpp)

CPU Baseline Result Speedup
Arm Cortex A520 4211.8 MiB/s 7726.3 MiB/s 83.44%
Arm Cortex X4 64949.0 MiB/s 67270.0 MiB/s 3.57%
Apple Silicon M2 65217.0 MiB/s 65269.0 MiB/s 0.08%
AMD EPYC 7B13 37272.0 MiB/s 37536.0 MiB/s 0.71%

The same implementation is shared across all pixel depths.

Draft until the scaffolding in #632 is merged.

Cortex-A520 is the big winner here, going from around 4.21GiB/s to
around 7.73GiB/s (83.44% improvement), but this filter is already
stupidly fast.
@Sentimentron
Copy link
Contributor Author

AI disclosure: I wrote a sliding-window portable_simd implementation of the Paeth filter (3bpp). I then used the Gemini family of LLMs provided by my employer to automatically adapt this code to the Up filter and optimize it to achieve the best possible code-generation and performance across all other micro-architectures in simulation. This PR is derived from that output, but includes documentation and other cleanups.

Comment on lines +309 to +312
let mut x: Simd<u8, STRIDE_BYTES> = Simd::<u8, STRIDE_BYTES>::from_slice(current_chunk);
let b: Simd<u8, STRIDE_BYTES> = Simd::<u8, STRIDE_BYTES>::from_slice(previous_chunk);
x = x + b; // Wrapping addition
x.copy_to_slice(current_chunk);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible to write a scalar implementation here that rustc would be able to vectorize

Comment on lines +299 to +306
let chunks = current.len() / STRIDE_BYTES;

let (simd_current, remainder_current) = current.split_at_mut(chunks * STRIDE_BYTES);
let (simd_previous, remainder_previous) = previous.split_at(chunks * STRIDE_BYTES);

let current_iter = simd_current.chunks_exact_mut(STRIDE_BYTES);
let previous_iter = simd_previous.chunks_exact(STRIDE_BYTES);
let combined_iter = current_iter.zip(previous_iter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to do something like:

let mut current_iter = current.chunks_exact_mut(STRIDE_BYTES);
let mut previous_iter = previous.chunks_exact(STRIDE_BYTES);

for (current_chunk, previous_chunk) in (&mut current_iter).zip(&mut previous_iter) {
    ...
}

let remainder_current = current_iter.remainder();
let remainder_previous = previous_iter.remainder();
...

@fintelia
Copy link
Contributor

fintelia commented Dec 2, 2025

Looking at this again, I think we should first try a scalar version that operates on 64-byte blocks. If LLVM isn't able to suitably auto-vectorize it, then that really starts to feel like an optimization bug that should be reported/fixed there. The entire operation is just a += b with two non-overlapping slices!

Something like this should be all we need:

let (prev_chunks, prev_remainder) = previous.as_chunks::<64>();
let (curr_chunks, curr_remainder) = current.as_chunks_mut::<64>();

for (curr_chunk, prev_chunk) in curr_chunks.iter_mut().zip(prev_chunks) {
    for (curr, &above) in curr_chunk.iter_mut().zip(prev_chunk) {
        *curr = curr.wrapping_add(above);
    }
}

for (curr, &above) in curr_remainder.iter_mut().zip(prev_remainder) {
    *curr = curr.wrapping_add(above);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants