Skip to content

portable_simd version of Avg (4bpp)#641

Open
Sentimentron wants to merge 1 commit intoimage-rs:masterfrom
Sentimentron:portable_simd-avg-bpp4
Open

portable_simd version of Avg (4bpp)#641
Sentimentron wants to merge 1 commit intoimage-rs:masterfrom
Sentimentron:portable_simd-avg-bpp4

Conversation

@Sentimentron
Copy link
Contributor

Implements a RGBA version of the Avg filter with portable_simd intrinsics.

CPU Baseline Result Speedup
Arm Cortex A520 415.9 MiB/s 707.4 MiB/s 70.08%
Arm Cortex X4 2053.9 MiB/s 2334.9 MiB/s 13.68%
Apple Silicon M2 2053.9 MiB/s 2173.5 MiB/s 3.62%
AMD EPYC 7B13 2425.8 MiB/s 2150.8 MiB/s -11.34%

Marked as draft until #632 is completed.

@Sentimentron
Copy link
Contributor Author

AI disclosure: I wrote a original sliding-window portable_simd implementation of the Paeth filter (3bpp) and optimized it for best performance on the Cortex A520. I then used the Gemini family of LLMs provided by my employer to automatically adapt this code to the Avg filter from a written description, then optimize it to achieve the best possible code-generation and performance across all other micro-architectures in simulation. This PR is derived from that output, but includes documentation and other cleanups.

@okaneco
Copy link
Contributor

okaneco commented Sep 12, 2025

There's another 4bpp case for the first row, where previous.is_empty(), not sure if you've tried that already.

image-png/src/filter.rs

Lines 612 to 624 in f33b850

BytesPerPixel::Four => {
let mut prev = [0; 4];
for chunk in current.chunks_exact_mut(4) {
let new_chunk = [
chunk[0].wrapping_add(prev[0] / 2),
chunk[1].wrapping_add(prev[1] / 2),
chunk[2].wrapping_add(prev[2] / 2),
chunk[3].wrapping_add(prev[3] / 2),
];
*TryInto::<&mut [u8; 4]>::try_into(chunk).unwrap() = new_chunk;
prev = new_chunk;
}
}

@Sentimentron
Copy link
Contributor Author

I hadn't tried it - wrote some quick code for it but it seems that the unfilter benchmark doesn't test this edge case... 🤔

@Sentimentron
Copy link
Contributor Author

Also, if any contributors have access to some Intel hardware, could they give this portable_simd version a try? (Otherwise I'll cfg-gate it off in a subsequent version to avoid the AMD Epyc 7B13 regression).

@Sentimentron
Copy link
Contributor Author

Sentimentron commented Dec 3, 2025

Rebaselining to rustc/cargo 1.93.0-nightly (2a7c49606 2025-11-25):

CPU Baseline Result Speedup
Arm Cortex A520 434.84 MiB/s 740.59 MiB/s 70.58%
Arm Cortex X4 2052.5 MiB/s 2.3308 MiB/s 13.56%
Apple Silicon M2 2094.4 MiB/s 2167.4 MiB/s 3.51%
Apple Silicon M4 Pro 2771.9 MiB/s 2808.5 MiB/s 1.16% (insignificant)
AMD EPYC 7B13 2716.2 MiB/s 2335.8 MiB/s -13.98%

Overall, I'd say it's still probably worth it for aarch64 systems, the A520 gain is particularly nice to have for low-end devices.

@Sentimentron Sentimentron force-pushed the portable_simd-avg-bpp4 branch from 096960b to d221c8f Compare December 3, 2025 20:55
@Sentimentron Sentimentron marked this pull request as ready for review December 9, 2025 19:48
Again, Cortex-A520 seems the big winner here, going from 434 MiB/s to
about 740 MiB/s (70% faster), X4 benefits less (about 13%).
@Sentimentron Sentimentron force-pushed the portable_simd-avg-bpp4 branch from d221c8f to 94039f0 Compare March 14, 2026 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants