Non-record: Sliding Patch Attentions + MoE (2-layer compact run)#981
Open
BurguerJohn wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: Sliding Patch Attentions + MoE (2-layer compact run)#981BurguerJohn wants to merge 1 commit intoopenai:mainfrom
BurguerJohn wants to merge 1 commit intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a
non_record_16mbsubmission under:records/track_non_record_16mb/2026-03-27_SlidingPatchAttentions_Plus_MoeIncluded files:
train_gpt.pytrain.logREADME.mdsubmission.jsonFinal metrics
val_bpb:1.48926280val_loss:2.51455785bytes_total:3938328Run details
1x NVIDIA H100 80GB HBM3600s2869/20000steps4,198,928VOCAB_SIZE=1024 NUM_LAYERS=2 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2 TIE_EMBEDDINGS=1Notes
This submission comes from an experimental branch that includes sliding-patch attention and MoE/router code paths in
train_gpt.py.For the exact scored run in
train.log, the active configuration is the compact 2-layer setup above, and the logreports
moe_layers:0/2, so the measured result should be treated as a compact non-record baseline from this branchrather than a full MoE-enabled run.