Skip to content

QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated)#989

Open
alexanderaperry-arch wants to merge 1 commit intoopenai:mainfrom
alexanderaperry-arch:qat-swa-ablation
Open

QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated)#989
alexanderaperry-arch wants to merge 1 commit intoopenai:mainfrom
alexanderaperry-arch:qat-swa-ablation

Conversation

@alexanderaperry-arch
Copy link
Copy Markdown

@alexanderaperry-arch alexanderaperry-arch commented Mar 27, 2026

Leaderboard-relevant ablation: SWA and QAT are antagonistic

Systematic 2×2 factorial (QAT on/off × SWA on/off) on the PR #180 stack. 3-seed validated on 8xH100, all runs under 10 min and 16MB.

Result

Config QAT SWA Mean BPB (3 seeds) Delta vs Control
no_swa_qat Yes No 1.14018 -3.64 mBPB
control No Yes 1.14382 baseline
qat_snap70 Yes Yes 1.14468 +0.86 mBPB
no_swa No No 1.14486 +1.04 mBPB

Why this matters

Every QAT submission in this competition (#117, #139, smeargate_ortho) also used SWA — and every one underperformed non-QAT entries. Our ablation shows why: SWA's checkpoint averaging dilutes the quantization-boundary alignment that QAT works to achieve. Combining them is worse than either alone.

The fix is simple: remove SWA when using QAT. This alone yields ~3.6 mBPB.

Actionable for competitors

  • If you're using SWA + QAT together, drop SWA
  • QAT alone is 3.5x more effective than SWA alone for quantization quality
  • Training val_bpb is misleading for QAT — post-quantization BPB is the metric that matters
  • QAT weights need ~10% magnitude pruning (vs 3%) to fit under 16MB — they compress worse

Open question

Top entries now use EMA instead of SWA. Nagel et al. (2022) proposes EMA to stabilize QAT, but our short-horizon results suggest averaging mechanisms in general may conflict with QAT under tight wallclock constraints. EMA × QAT interaction is untested.

Validation

Non-record research submission. 2x2 factorial ablation of QAT x SWA
interaction on PR openai#180 stack (10L/512d/MLP3x).

Key finding: SWA and QAT are antagonistic. QAT alone (1.14018, 3-seed
mean) beats SWA alone (1.14382) by 3.64 mBPB. Combining them is worse
than either alone. This explains why prior QAT entries underperformed
non-QAT submissions in the competition.

3-seed validation (seeds 42, 1337, 2024), artifact under 16MB limit.
@alexanderaperry-arch alexanderaperry-arch changed the title QAT x SWA Ablation: antagonistic interaction finding QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated) Mar 27, 2026
aryanbhosale added a commit to aryanbhosale/parameter-golf that referenced this pull request Mar 28, 2026
slope 0.75 + LR 0.027 + warmdown 3700 (PR openai#977)
No SWA with QAT (PR openai#989)
QAT from 50% + range fix [-31,31]
mHC 22-param residual mixing (PR openai#928)
VE128 + no gated_attn + no value_residual (PR openai#549)
LZMA preset 7 compression (PR openai#999)
Muon TTT with NS3 (PR openai#999)
Entropy-adaptive TTT epochs 2/3/4 (PR openai#999)
Per-layer TTT LR (PR openai#995)
TTT momentum 0.95 (PR openai#995)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant