Skip to content

Non-record: Sliding Patch Attentions + MoE (2-layer compact run)#981

Open
BurguerJohn wants to merge 1 commit intoopenai:mainfrom
BurguerJohn:nonrecord
Open

Non-record: Sliding Patch Attentions + MoE (2-layer compact run)#981
BurguerJohn wants to merge 1 commit intoopenai:mainfrom
BurguerJohn:nonrecord

Conversation

@BurguerJohn
Copy link
Copy Markdown

Summary

This PR adds a non_record_16mb submission under:

records/track_non_record_16mb/2026-03-27_SlidingPatchAttentions_Plus_Moe

Included files:

  • train_gpt.py
  • train.log
  • README.md
  • submission.json

Final metrics

  • val_bpb: 1.48926280
  • val_loss: 2.51455785
  • bytes_total: 3938328

Run details

  • Hardware: 1x NVIDIA H100 80GB HBM3
  • Wallclock cap: 600s
  • Timed stop: 2869/20000 steps
  • Parameters: 4,198,928
  • Layout: VOCAB_SIZE=1024 NUM_LAYERS=2 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2 TIE_EMBEDDINGS=1

Notes

This submission comes from an experimental branch that includes sliding-patch attention and MoE/router code paths in
train_gpt.py.

For the exact scored run in train.log, the active configuration is the compact 2-layer setup above, and the log
reports moe_layers:0/2, so the measured result should be treated as a compact non-record baseline from this branch
rather than a full MoE-enabled run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant