Skip to content

Non-record: 24.7M params · int6 · Binary U-Net/SmearGate/BigramHash · 1.5hr · RTX 5060 Ti 16GB#997

Open
randy06122001-boop wants to merge 1 commit intoopenai:mainfrom
randy06122001-boop:main
Open

Non-record: 24.7M params · int6 · Binary U-Net/SmearGate/BigramHash · 1.5hr · RTX 5060 Ti 16GB#997
randy06122001-boop wants to merge 1 commit intoopenai:mainfrom
randy06122001-boop:main

Conversation

@randy06122001-boop
Copy link
Copy Markdown

@randy06122001-boop randy06122001-boop commented Mar 28, 2026

24.7M params · int6 · Binary U-Net/SmearGate/BigramHash · 1.5hr · RTX 5060 Ti 16GB

Non-record submission - trained on RTX 5060 Ti 16GB for 1.5 hours.

Approach

  • Binary U-Net / SmearGate / BigramHash
  • Quantized to 6 bits (int6)
  • 24.7M parameters

Int6 Quantisation + 10L (5 Encoder - 5 Decoder) + Muon + 3x ReLU² MLP + SmearGate + BigramHash + SWA + RTX 5060 Ti

val_bpb: 1.4182 (roundtrip, seed=1337) | 11.63 MB artifact | NVIDIA RTX 5060 Ti, 2469 steps (~1.5h)

Results (seed=1337, RTX 5060 Ti)

Metric Value
val_bpb (Roundtrip) 1.4182
val_loss 2.3946
Steps 2469
ms/step 2187.4
Training time 5,400s (~1.5h)
Artifact 11,633,008 bytes (11.63MB)
Parameters 24,730,704

Hardware & Environment

This run was executed on a local NVIDIA GeForce RTX 5060 Ti (16GB). The environment uses Windows, so torch.compile (Triton) was disabled. The run demonstrates that advanced architectural features like SmearGate and Int6 quantization can significantly improve performance even on consumer-grade silicon.

Architecture

  • 10 transformer layers (U-Net style: 5 encoder, 5 decoder blocks)
  • Model Dimension: 512, 8 heads, 4 KV heads (GQA)
  • Quantization: Int6 per-row for block weights, Int8/FP16 for others.
  • 3x MLP expansion: (hidden=1536) with ReLU² activation.
  • SmearGate: causal blending of token embeddings with previous context.
  • BigramHash: 4096-bucket hash embedding for consecutive token pairs.
  • ResidMix: Learned residual blending across blocks.
  • Embedding: Tied 1024-vocab embedding.

Optimization

  • Muon Optimizer: Newton-Schulz iteration (5 steps) for all matrix parameters with 0.04 weight decay.
  • AdamW: Used for scalar parameters and embeddings with 0.04 weight decay.
  • SWA (Stochastic Weight Averaging): Averaged 20 checkpoints during the warmdown phase for better generalization.
  • Orthogonal Initialization: Used for all matrix weights.

Compression

  • Int6 + Zstd-22: Int6 quantization for block weights allows for significantly more parameters/layers within the 16MB limit, achieving a 3.73x payload compression ratio.

Evaluation

  • Roundtrip Validation: Final metrics verified after dequantization to ensure performance persists in the exported artifact.

Files

    • quantized model
    • training script
    • training logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant