Non-record: Sliding Patch Attentions + MoE (2-layer compact run) by BurguerJohn · Pull Request #981 · openai/parameter-golf

BurguerJohn · 2026-03-27T20:32:29Z

Summary

This PR adds a non_record_16mb submission under:

records/track_non_record_16mb/2026-03-27_SlidingPatchAttentions_Plus_Moe

Included files:

train_gpt.py
train.log
README.md
submission.json

Final metrics

val_bpb: 1.48926280
val_loss: 2.51455785
bytes_total: 3938328

Run details

Hardware: 1x NVIDIA H100 80GB HBM3
Wallclock cap: 600s
Timed stop: 2869/20000 steps
Parameters: 4,198,928
Layout: VOCAB_SIZE=1024 NUM_LAYERS=2 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2 TIE_EMBEDDINGS=1

Notes

This submission comes from an experimental branch that includes sliding-patch attention and MoE/router code paths in
train_gpt.py.

For the exact scored run in train.log, the active configuration is the compact 2-layer setup above, and the log
reports moe_layers:0/2, so the measured result should be treated as a compact non-record baseline from this branch
rather than a full MoE-enabled run.

Non record submmition: _SlidingPatchAttentions_Plus_Moe

a94418a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Sliding Patch Attentions + MoE (2-layer compact run)#981

Non-record: Sliding Patch Attentions + MoE (2-layer compact run)#981
BurguerJohn wants to merge 1 commit intoopenai:mainfrom
BurguerJohn:nonrecord

BurguerJohn commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BurguerJohn commented Mar 27, 2026

Summary

Final metrics

Run details

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant