fix: use saturating arithmetic in SIMD FDCT to prevent overflow (#444)#453
Open
lilith wants to merge 2 commits intomozilla:masterfrom
Open
fix: use saturating arithmetic in SIMD FDCT to prevent overflow (#444)#453lilith wants to merge 2 commits intomozilla:masterfrom
lilith wants to merge 2 commits intomozilla:masterfrom
Conversation
…lla#444) Overshoot deringing can push level-shifted sample values to +-158 (vs normal +-128). The SIMD forward DCT column pass final butterfly (tmp10+tmp11, tmp10-tmp11) sums 8 such row-pass outputs, reaching +-40448 — exceeding the signed 16-bit range. The wrapping paddw causes catastrophic sign flips in DC/AC[1] coefficients, producing visible inverted-brightness 8x8 block artifacts. The C ISLOW DCT is immune (uses 32-bit JLONG intermediates). Only the final even-part butterfly can overflow; all earlier stages have >=12543 margin within 16-bit range. Fix: change paddw/psubw to paddsw/psubsw (saturating) for the final butterfly in the column pass of all 8 SIMD implementations: x86_64 SSE2/AVX2, i386 SSE2/AVX2/MMX, ARM NEON, MIPS64 MMI, PowerPC AltiVec Validated on 71 images x 9 quality levels (639 encodes): 637 byte- identical, 2 corrected (imdb.com screenshot Q25/Q50, both 9 bytes smaller with fixed coefficients). Fixes mozilla#444
Add an 8x8 half-black/half-white test image that triggers the overshoot deringing overflow. Encode at Q25 without -revert (deringing active), decode, and verify MD5 of decoded pixels. Without the fix, decoded pixels are completely inverted (left=255 instead of 0, right=0 instead of 255).
lilith
added a commit
to imazen/codec-corpus
that referenced
this pull request
Jan 31, 2026
Synthetic 8x8 block patterns that trigger 16-bit SIMD forward DCT overflow when overshoot deringing is enabled. Vertical half-black/ half-white splits produce row-pass intermediates of ±5056; the column pass sums 8 identical values (40,448 > i16 max 32,767), causing catastrophic sign flips. Includes triggering patterns (vertical splits, single block) and controls (horizontal split, checkerboard) that do NOT trigger.
lilith
added a commit
to imazen/mozjpeg-rs
that referenced
this pull request
Jan 31, 2026
Add SimdOps::avx2_i16() to select the experimental 16-bit packed AVX2 DCT path, and Encoder::simd_ops() to inject it into the encoder. This allows reproducing the deringing + i16 SIMD overflow bug where the column-pass butterfly sum (8 × 5056 = 40,448) exceeds i16::MAX. Production paths use i32 intermediates and are immune. Cross-references: - PR: mozilla/mozjpeg#453 - Test patterns: imazen/codec-corpus imageflow/test_inputs/dct_overflow_patterns/ - Regression tests: tests/encode_tests.rs test_issue444_*
lilith
added a commit
to imazen/mozjpeg-rs
that referenced
this pull request
Jan 31, 2026
When overshoot deringing pushes inputs to ±158, the column-pass butterfly accumulates values that can exceed i16::MAX (e.g., 8 × 5056 = 40,448). Wrapping add caused catastrophic sign flips in decoded pixels. All butterfly adds/subs in dodct() now use _mm256_adds_epi16 / _mm256_subs_epi16 (saturating). The column-pass descale rounding constant add is also saturating to prevent 32767 + 2 wrapping to -32767. Row pass uses wrapping _mm256_slli_epi16 (values naturally in safe range). Worst-case saturation clamps coefficient [0,1] from 8294 to 8191 (1.2%), which is at most 1 quantization step on extreme deringing patterns. Verified: all 6 synthetic overflow patterns pass (max_diff ≤ 2 vs i32 ref). See mozilla/mozjpeg#453.
ziemek99
reviewed
Mar 29, 2026
Comment on lines
463
to
464
| paddw xmm7, [GOTOFF(ebx,PW_DESCALE_P2X)] | ||
| paddw xmm5, [GOTOFF(ebx,PW_DESCALE_P2X)] |
There was a problem hiding this comment.
Initial "lost" fix cdb6c34 changed these ones to paddsw as well.
ziemek99
reviewed
Mar 29, 2026
Comment on lines
455
to
456
| paddw xmm7, [rel PW_DESCALE_P2X] | ||
| paddw xmm5, [rel PW_DESCALE_P2X] |
There was a problem hiding this comment.
Initial "lost" fix cdb6c34 changed these ones to paddsw as well.
ziemek99
reviewed
Mar 29, 2026
Comment on lines
444
to
445
| paddw mm5, [GOTOFF(ebx,PW_DESCALE_P2X)] | ||
| paddw mm7, [GOTOFF(ebx,PW_DESCALE_P2X)] |
There was a problem hiding this comment.
Initial "lost" fix cdb6c34 changed these ones to paddsw as well.
lilith
added a commit
to imazen/mozjpeg-rs
that referenced
this pull request
Apr 12, 2026
preprocess_deringing computed fslope and lslope in bare i16 arithmetic:
let mut fslope = (f1 - f2).max(MAX_SAMPLE - f1);
let mut lslope = (l1 - l2).max(MAX_SAMPLE - l1);
For the current in-contract use (level-shifted samples in -128..=127,
with overshoot values up to MAX_SAMPLE + 31 = 158 written back into the
block by a previous run), the largest magnitudes are well within i16
range: 158 - (-128) = 286 fits easily. But the subtractions are i16 -
i16, so pathological callers — or a future switch to wider sample types
— would wrap in release builds (panic in debug). That class of bug is
exactly what the mozilla/mozjpeg#453 i16-SIMD-DCT overflow documented in
CLAUDE.md warns about, except here in the scalar preprocessing step.
Widen the subtractions to i32 and saturate back to i16 via clamp. The
pattern matches what catmull_rom() already does at src/deringing.rs:121
for its tangent computation. For in-contract data the clamp is a no-op;
for out-of-contract data it clamps instead of wrapping.
zenjpeg sidesteps the whole issue by operating on f32 samples
(zenjpeg/zenjpeg/src/encode/deringing.rs:113). Moving mozjpeg-rs to
f32 would be the architecturally clean answer but would break the
byte-parity property we depend on with C mozjpeg, so widen-to-i32 is
the minimal-churn defensive fix.
Verified bit-exact with C mozjpeg via tests/parity_benchmark after the
change:
Baseline Q55-Q95: 0.00% delta, 0.00% max dev (all 6 levels)
Progressive Q55-Q95: 0.00% delta, 0.00% max dev (all 6 levels)
Trellis modes: -0.05% to -0.80% (unchanged from pre-fix numbers)
Max Compression: -0.72% to +0.21% (unchanged from pre-fix numbers)
All 7 deringing unit tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
paddw/psubwtopaddsw/psubsw(saturating) for the final even-part butterfly (tmp10+tmp11,tmp10-tmp11) in all 8 SIMD implementationsProblem
Overshoot deringing pushes level-shifted sample values to ±158 (vs normal ±128). The SIMD ISLOW forward DCT uses 16-bit packed arithmetic. After the row pass produces intermediate values up to ±5056, the column pass final butterfly sums 8 identical row outputs: 8 × 5056 = 40448, exceeding the signed 16-bit maximum of 32767.
The wrapping
paddwcauses sign flips in DC and low-frequency AC coefficients, producing visible inverted-brightness 8x8 block artifacts. The C ISLOW DCT is immune because it uses 32-bitJLONGintermediates.Trigger: Any 8x8 block with a hard black/white edge (e.g.,
[0,0,0,0, 255,255,255,255]) at quality ≤ Q57.Fix
Only the final even-part butterfly can overflow. All earlier butterfly stages have ≥12543 margin within the 16-bit range. The fix changes exactly 1-2 instructions per architecture:
paddw/psubw→paddsw/psubsw(2 ops)vpaddw→vpaddsw(1 op, shared macro)paddw/psubw→paddsw/psubsw(2 ops)vpaddw→vpaddsw(1 op, shared macro)paddw/psubw→paddsw/psubsw(2 ops)vaddq/vsubq→vqaddq/vqsubq(2 ops)_mm_add_pi16/_mm_sub_pi16→_mm_adds/_mm_subs(2 ops)vec_add/vec_sub→vec_adds/vec_subs(2 ops)Validation
Tested across 71 images (Kodak + CLIC 2025 + web screenshots) at 9 quality levels (639 total encodes):
Test plan
Fixes #444