QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated)#989
Open
alexanderaperry-arch wants to merge 1 commit intoopenai:mainfrom
Open
QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated)#989alexanderaperry-arch wants to merge 1 commit intoopenai:mainfrom
alexanderaperry-arch wants to merge 1 commit intoopenai:mainfrom
Conversation
Non-record research submission. 2x2 factorial ablation of QAT x SWA interaction on PR openai#180 stack (10L/512d/MLP3x). Key finding: SWA and QAT are antagonistic. QAT alone (1.14018, 3-seed mean) beats SWA alone (1.14382) by 3.64 mBPB. Combining them is worse than either alone. This explains why prior QAT entries underperformed non-QAT submissions in the competition. 3-seed validation (seeds 42, 1337, 2024), artifact under 16MB limit.
aryanbhosale
added a commit
to aryanbhosale/parameter-golf
that referenced
this pull request
Mar 28, 2026
slope 0.75 + LR 0.027 + warmdown 3700 (PR openai#977) No SWA with QAT (PR openai#989) QAT from 50% + range fix [-31,31] mHC 22-param residual mixing (PR openai#928) VE128 + no gated_attn + no value_residual (PR openai#549) LZMA preset 7 compression (PR openai#999) Muon TTT with NS3 (PR openai#999) Entropy-adaptive TTT epochs 2/3/4 (PR openai#999) Per-layer TTT LR (PR openai#995) TTT momentum 0.95 (PR openai#995)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Leaderboard-relevant ablation: SWA and QAT are antagonistic
Systematic 2×2 factorial (QAT on/off × SWA on/off) on the PR #180 stack. 3-seed validated on 8xH100, all runs under 10 min and 16MB.
Result
Why this matters
Every QAT submission in this competition (#117, #139, smeargate_ortho) also used SWA — and every one underperformed non-QAT entries. Our ablation shows why: SWA's checkpoint averaging dilutes the quantization-boundary alignment that QAT works to achieve. Combining them is worse than either alone.
The fix is simple: remove SWA when using QAT. This alone yields ~3.6 mBPB.
Actionable for competitors
Open question
Top entries now use EMA instead of SWA. Nagel et al. (2022) proposes EMA to stabilize QAT, but our short-horizon results suggest averaging mechanisms in general may conflict with QAT under tight wallclock constraints. EMA × QAT interaction is untested.
Validation