fix: use saturating arithmetic in SIMD FDCT to prevent overflow (#444) by lilith · Pull Request #453 · mozilla/mozjpeg

lilith · 2026-01-31T03:12:19Z

Summary

Fix 16-bit integer overflow in the SIMD forward DCT column pass that causes catastrophic sign flips when overshoot deringing is active
Change paddw/psubw to paddsw/psubsw (saturating) for the final even-part butterfly (tmp10+tmp11, tmp10-tmp11) in all 8 SIMD implementations
Add regression test with minimal 8x8 reproduction image

Problem

Overshoot deringing pushes level-shifted sample values to ±158 (vs normal ±128). The SIMD ISLOW forward DCT uses 16-bit packed arithmetic. After the row pass produces intermediate values up to ±5056, the column pass final butterfly sums 8 identical row outputs: 8 × 5056 = 40448, exceeding the signed 16-bit maximum of 32767.

The wrapping paddw causes sign flips in DC and low-frequency AC coefficients, producing visible inverted-brightness 8x8 block artifacts. The C ISLOW DCT is immune because it uses 32-bit JLONG intermediates.

Trigger: Any 8x8 block with a hard black/white edge (e.g., [0,0,0,0, 255,255,255,255]) at quality ≤ Q57.

Fix

Only the final even-part butterfly can overflow. All earlier butterfly stages have ≥12543 margin within the 16-bit range. The fix changes exactly 1-2 instructions per architecture:

Architecture	Instructions changed
x86_64 SSE2	`paddw`/`psubw` → `paddsw`/`psubsw` (2 ops)
x86_64 AVX2	`vpaddw` → `vpaddsw` (1 op, shared macro)
i386 SSE2	`paddw`/`psubw` → `paddsw`/`psubsw` (2 ops)
i386 AVX2	`vpaddw` → `vpaddsw` (1 op, shared macro)
i386 MMX	`paddw`/`psubw` → `paddsw`/`psubsw` (2 ops)
ARM NEON	`vaddq`/`vsubq` → `vqaddq`/`vqsubq` (2 ops)
MIPS64 MMI	`_mm_add_pi16`/`_mm_sub_pi16` → `_mm_adds`/`_mm_subs` (2 ops)
PowerPC AltiVec	`vec_add`/`vec_sub` → `vec_adds`/`vec_subs` (2 ops)

Validation

Tested across 71 images (Kodak + CLIC 2025 + web screenshots) at 9 quality levels (639 total encodes):

637/639 byte-identical to the unfixed build (99.7%)
2 corrected: `imdb.com` screenshot at Q25 and Q50 — both 9 bytes smaller with fixed coefficients, max pixel difference 255 (complete inversion fixed)

Test plan

CTest regression test: 8x8 repro image → encode Q25 → decode → MD5 verify
Verified test fails with unfixed code (MD5 mismatch)
Verified test passes with fix
Full corpus comparison confirms no unintended changes to photographic images

Fixes #444

…lla#444) Overshoot deringing can push level-shifted sample values to +-158 (vs normal +-128). The SIMD forward DCT column pass final butterfly (tmp10+tmp11, tmp10-tmp11) sums 8 such row-pass outputs, reaching +-40448 — exceeding the signed 16-bit range. The wrapping paddw causes catastrophic sign flips in DC/AC[1] coefficients, producing visible inverted-brightness 8x8 block artifacts. The C ISLOW DCT is immune (uses 32-bit JLONG intermediates). Only the final even-part butterfly can overflow; all earlier stages have >=12543 margin within 16-bit range. Fix: change paddw/psubw to paddsw/psubsw (saturating) for the final butterfly in the column pass of all 8 SIMD implementations: x86_64 SSE2/AVX2, i386 SSE2/AVX2/MMX, ARM NEON, MIPS64 MMI, PowerPC AltiVec Validated on 71 images x 9 quality levels (639 encodes): 637 byte- identical, 2 corrected (imdb.com screenshot Q25/Q50, both 9 bytes smaller with fixed coefficients). Fixes mozilla#444

Add an 8x8 half-black/half-white test image that triggers the overshoot deringing overflow. Encode at Q25 without -revert (deringing active), decode, and verify MD5 of decoded pixels. Without the fix, decoded pixels are completely inverted (left=255 instead of 0, right=0 instead of 255).

Synthetic 8x8 block patterns that trigger 16-bit SIMD forward DCT overflow when overshoot deringing is enabled. Vertical half-black/ half-white splits produce row-pass intermediates of ±5056; the column pass sums 8 identical values (40,448 > i16 max 32,767), causing catastrophic sign flips. Includes triggering patterns (vertical splits, single block) and controls (horizontal split, checkerboard) that do NOT trigger.

Add SimdOps::avx2_i16() to select the experimental 16-bit packed AVX2 DCT path, and Encoder::simd_ops() to inject it into the encoder. This allows reproducing the deringing + i16 SIMD overflow bug where the column-pass butterfly sum (8 × 5056 = 40,448) exceeds i16::MAX. Production paths use i32 intermediates and are immune. Cross-references: - PR: mozilla/mozjpeg#453 - Test patterns: imazen/codec-corpus imageflow/test_inputs/dct_overflow_patterns/ - Regression tests: tests/encode_tests.rs test_issue444_*

When overshoot deringing pushes inputs to ±158, the column-pass butterfly accumulates values that can exceed i16::MAX (e.g., 8 × 5056 = 40,448). Wrapping add caused catastrophic sign flips in decoded pixels. All butterfly adds/subs in dodct() now use _mm256_adds_epi16 / _mm256_subs_epi16 (saturating). The column-pass descale rounding constant add is also saturating to prevent 32767 + 2 wrapping to -32767. Row pass uses wrapping _mm256_slli_epi16 (values naturally in safe range). Worst-case saturation clamps coefficient [0,1] from 8294 to 8191 (1.2%), which is at most 1 quantization step on extreme deringing patterns. Verified: all 6 synthetic overflow patterns pass (max_diff ≤ 2 vs i32 ref). See mozilla/mozjpeg#453.

ziemek99 · 2026-03-29T10:23:33Z

simd/i386/jfdctint-sse2.asm

    paddw       xmm7, [GOTOFF(ebx,PW_DESCALE_P2X)]
    paddw       xmm5, [GOTOFF(ebx,PW_DESCALE_P2X)]


Initial "lost" fix cdb6c34 changed these ones to paddsw as well.

ziemek99 · 2026-03-29T10:23:51Z

simd/x86_64/jfdctint-sse2.asm

    paddw       xmm7, [rel PW_DESCALE_P2X]
    paddw       xmm5, [rel PW_DESCALE_P2X]


Initial "lost" fix cdb6c34 changed these ones to paddsw as well.

ziemek99 · 2026-03-29T10:24:17Z

simd/i386/jfdctint-mmx.asm

    paddw       mm5, [GOTOFF(ebx,PW_DESCALE_P2X)]
    paddw       mm7, [GOTOFF(ebx,PW_DESCALE_P2X)]


Initial "lost" fix cdb6c34 changed these ones to paddsw as well.

preprocess_deringing computed fslope and lslope in bare i16 arithmetic: let mut fslope = (f1 - f2).max(MAX_SAMPLE - f1); let mut lslope = (l1 - l2).max(MAX_SAMPLE - l1); For the current in-contract use (level-shifted samples in -128..=127, with overshoot values up to MAX_SAMPLE + 31 = 158 written back into the block by a previous run), the largest magnitudes are well within i16 range: 158 - (-128) = 286 fits easily. But the subtractions are i16 - i16, so pathological callers — or a future switch to wider sample types — would wrap in release builds (panic in debug). That class of bug is exactly what the mozilla/mozjpeg#453 i16-SIMD-DCT overflow documented in CLAUDE.md warns about, except here in the scalar preprocessing step. Widen the subtractions to i32 and saturate back to i16 via clamp. The pattern matches what catmull_rom() already does at src/deringing.rs:121 for its tangent computation. For in-contract data the clamp is a no-op; for out-of-contract data it clamps instead of wrapping. zenjpeg sidesteps the whole issue by operating on f32 samples (zenjpeg/zenjpeg/src/encode/deringing.rs:113). Moving mozjpeg-rs to f32 would be the architecturally clean answer but would break the byte-parity property we depend on with C mozjpeg, so widen-to-i32 is the minimal-churn defensive fix. Verified bit-exact with C mozjpeg via tests/parity_benchmark after the change: Baseline Q55-Q95: 0.00% delta, 0.00% max dev (all 6 levels) Progressive Q55-Q95: 0.00% delta, 0.00% max dev (all 6 levels) Trellis modes: -0.05% to -0.80% (unchanged from pre-fix numbers) Max Compression: -0.72% to +0.21% (unchanged from pre-fix numbers) All 7 deringing unit tests pass.

lilith added 2 commits January 30, 2026 19:33

ziemek99 reviewed Mar 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use saturating arithmetic in SIMD FDCT to prevent overflow (#444)#453

fix: use saturating arithmetic in SIMD FDCT to prevent overflow (#444)#453
lilith wants to merge 2 commits intomozilla:masterfrom
imazen:fix/overshoot-deringing-overflow

lilith commented Jan 31, 2026

Uh oh!

ziemek99 Mar 29, 2026

Uh oh!

ziemek99 Mar 29, 2026

Uh oh!

ziemek99 Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		paddw xmm7, [GOTOFF(ebx,PW_DESCALE_P2X)]
		paddw xmm5, [GOTOFF(ebx,PW_DESCALE_P2X)]

		paddw xmm7, [rel PW_DESCALE_P2X]
		paddw xmm5, [rel PW_DESCALE_P2X]

		paddw mm5, [GOTOFF(ebx,PW_DESCALE_P2X)]
		paddw mm7, [GOTOFF(ebx,PW_DESCALE_P2X)]

Conversation

lilith commented Jan 31, 2026

Summary

Problem

Fix

Validation

Test plan

Uh oh!

ziemek99 Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

ziemek99 Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

ziemek99 Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants