portable_simd version of Up by Sentimentron · Pull Request #643 · image-rs/image-png

Sentimentron · 2025-09-09T18:45:53Z

Implements a portable_simd version of the Up filter - merely adds the previous and current row's 64 bytes simultaneously. This filter is ludicrously fast already on a big out-of-order CPU, but does seem to help the Cortex-A520.

Results from Silicon (3bpp)

CPU	Baseline	Result	Speedup
Arm Cortex A520	4211.8 MiB/s	7726.3 MiB/s	83.44%
Arm Cortex X4	64949.0 MiB/s	67270.0 MiB/s	3.57%
Apple Silicon M2	65217.0 MiB/s	65269.0 MiB/s	0.08%
AMD EPYC 7B13	37272.0 MiB/s	37536.0 MiB/s	0.71%

The same implementation is shared across all pixel depths.

Draft until the scaffolding in #632 is merged.

Cortex-A520 is the big winner here, going from around 4.21GiB/s to around 7.73GiB/s (83.44% improvement), but this filter is already stupidly fast.

Sentimentron · 2025-09-09T18:59:25Z

AI disclosure: I wrote a sliding-window portable_simd implementation of the Paeth filter (3bpp). I then used the Gemini family of LLMs provided by my employer to automatically adapt this code to the Up filter and optimize it to achieve the best possible code-generation and performance across all other micro-architectures in simulation. This PR is derived from that output, but includes documentation and other cleanups.

fintelia · 2025-09-10T05:06:21Z

src/filter.rs

+            let mut x: Simd<u8, STRIDE_BYTES> = Simd::<u8, STRIDE_BYTES>::from_slice(current_chunk);
+            let b: Simd<u8, STRIDE_BYTES> = Simd::<u8, STRIDE_BYTES>::from_slice(previous_chunk);
+            x = x + b; // Wrapping addition
+            x.copy_to_slice(current_chunk);


It might be possible to write a scalar implementation here that rustc would be able to vectorize

fintelia · 2025-09-10T05:09:23Z

src/filter.rs

+        let chunks = current.len() / STRIDE_BYTES;
+
+        let (simd_current, remainder_current) = current.split_at_mut(chunks * STRIDE_BYTES);
+        let (simd_previous, remainder_previous) = previous.split_at(chunks * STRIDE_BYTES);
+
+        let current_iter = simd_current.chunks_exact_mut(STRIDE_BYTES);
+        let previous_iter = simd_previous.chunks_exact(STRIDE_BYTES);
+        let combined_iter = current_iter.zip(previous_iter);


I think you should be able to do something like:

let mut current_iter = current.chunks_exact_mut(STRIDE_BYTES); let mut previous_iter = previous.chunks_exact(STRIDE_BYTES); for (current_chunk, previous_chunk) in (&mut current_iter).zip(&mut previous_iter) { ... } let remainder_current = current_iter.remainder(); let remainder_previous = previous_iter.remainder(); ...

fintelia · 2025-12-02T07:47:24Z

Looking at this again, I think we should first try a scalar version that operates on 64-byte blocks. If LLVM isn't able to suitably auto-vectorize it, then that really starts to feel like an optimization bug that should be reported/fixed there. The entire operation is just a += b with two non-overlapping slices!

Something like this should be all we need:

let (prev_chunks, prev_remainder) = previous.as_chunks::<64>();
let (curr_chunks, curr_remainder) = current.as_chunks_mut::<64>();

for (curr_chunk, prev_chunk) in curr_chunks.iter_mut().zip(prev_chunks) {
    for (curr, &above) in curr_chunk.iter_mut().zip(prev_chunk) {
        *curr = curr.wrapping_add(above);
    }
}

for (curr, &above) in curr_remainder.iter_mut().zip(prev_remainder) {
    *curr = curr.wrapping_add(above);
}

Sentimentron added 3 commits September 3, 2025 20:54

perf: portable_simd 3bpp paeth filter

f6ba382

fix: restrict portable_simd feature to "unstable"

1b94193

perf: portable_simd version of "Up"

a79614c

Cortex-A520 is the big winner here, going from around 4.21GiB/s to around 7.73GiB/s (83.44% improvement), but this filter is already stupidly fast.

fintelia reviewed Sep 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

portable_simd version of Up#643

portable_simd version of Up#643
Sentimentron wants to merge 3 commits intoimage-rs:masterfrom
Sentimentron:portable_simd-up

Sentimentron commented Sep 9, 2025

Uh oh!

Sentimentron commented Sep 9, 2025

Uh oh!

fintelia Sep 10, 2025

Uh oh!

fintelia Sep 10, 2025

Uh oh!

fintelia commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sentimentron commented Sep 9, 2025

Uh oh!

Sentimentron commented Sep 9, 2025

Uh oh!

fintelia Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

fintelia Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

fintelia commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants