portable_simd version of Sub (3bpp/4bpp)#642
Draft
Sentimentron wants to merge 2 commits intoimage-rs:masterfrom
Draft
portable_simd version of Sub (3bpp/4bpp)#642Sentimentron wants to merge 2 commits intoimage-rs:masterfrom
Sentimentron wants to merge 2 commits intoimage-rs:masterfrom
Conversation
Contributor
|
I see mixed results on my Zen 5 CPU. On the bpp=3 case, throughput regresses 13% from 9.2 GB/s to 8.0 GB/s with the default target-cpu and regresses to 6.4 GB/s with target-cpu=native (?). However for bpp=4 I see nearly double the performance, going from 10.4 GB/s to 19.4 GB/s. And I discovered that I was able to get an additional 30% improvement to 25.1 GB/s by removing the data-dependency between the two unrolled loop iterations. (There's probably a better way to splat the low 4 bytes of the SIMD vector to all the elements)
// Process chunk 2
let mut x_vec2: SimdVector = SimdVector::from_slice(chunk2_slice);
- let carry_in_vec2 = prev_pixel_val_for_chunk2.resize::<STRIDE_BYTES>(0u8);
- x_vec2 = x_vec2 + carry_in_vec2;
x_vec2 = x_vec2 + x_vec2.shift_elements_right::<BPP>(0u8);
x_vec2 = x_vec2 + x_vec2.shift_elements_right::<{ 2 * BPP }>(0u8);
x_vec2 = x_vec2 + x_vec2.shift_elements_right::<{ 4 * BPP }>(0u8);
x_vec2 = x_vec2 + x_vec2.shift_elements_right::<{ 8 * BPP }>(0u8);
+
+ x_vec2 = x_vec2
+ + std::simd::simd_swizzle!(
+ prev_pixel_val_for_chunk2.resize::<STRIDE_BYTES>(0u8),
+ [
+ 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0,
+ 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1,
+ 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
+ ]
+ );
+
prev_pixel_val = x_vec2.extract::<{ STRIDE_BYTES - BPP }, BPP>();
x_vec2.copy_to_slice(chunk2_slice);
} |
Prefix sum improves performance on A520 and Epyc by 122.40% and 42.87% respectively.
a807fe4 to
56f73e3
Compare
Contributor
Author
|
Checking in with current status on Results from silicon (bpp=3)
Results from silicon (bpp=4)
(Both are from the Rust default CPU setting). I'll next try the new simd_swizzle approach |
Improves performance by around 40% on the Epyc system, 431% on the Cortex-A520.
56f73e3 to
7acabce
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Results from silicon (bpp=3)
Results from silicon (bpp=4)
Opened as a draft until #632 is resolved.