Skip to content

Non-record: GatedDeltaNet SSM via fla library — 1.2907 bpb, 15.79MB#970

Open
dnldsz wants to merge 2 commits intoopenai:mainfrom
dnldsz:submission-clean
Open

Non-record: GatedDeltaNet SSM via fla library — 1.2907 bpb, 15.79MB#970
dnldsz wants to merge 2 commits intoopenai:mainfrom
dnldsz:submission-clean

Conversation

@dnldsz
Copy link
Copy Markdown

@dnldsz dnldsz commented Mar 27, 2026

Summary

  • Gated DeltaNet selective state space model using production Triton kernels from the flash-linear-attention (fla) library
  • 12 layers, 384d, ~13.7M params, 15.79 MB int8+zlib (under 16MB limit)
  • val_bpb: 1.2907 (int8+zlib roundtrip), 1.2781 pre-quant
  • Trained on 8×H100 for 10 minutes (~4,962 steps at 121ms/step)
  • Non-record unlimited compute track

Architecture

Replaces attention with Gated DeltaNet (delta-rule SSM):

  • State update: S_t = α_t · S_{t-1} · (I − β_t · k_t kᵀ_t) + β_t · v_t · kᵀ_t
  • Chunk-parallel scan via fla's fused Triton kernels (chunk_size=64)
  • U-Net skip connections, LeakyReLU(0.5)² MLP, BigramHash embedding, z-loss, polynomial softcap
  • Muon optimizer for 2D weights; delta-rule params (a_proj, b_proj, A_log, dt_bias) explicitly routed to Adam

Setup

pip install flash-linear-attention einops sentencepiece
torchrun --standalone --nproc_per_node=8 train_gpt.py

Notes

Submitted as a non-record SSM baseline. GDN weights compress less efficiently than transformer weights (~2.8× vs ~3.7×), limiting model width to 384d at 16MB. Future work: QAT, hybrid SSM+attention, longer context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant