Skip to content

ClownCar: Frugendorff compression baseline + canonical DeltaNet integration#990

Open
newjordan wants to merge 39 commits intoopenai:mainfrom
newjordan:test
Open

ClownCar: Frugendorff compression baseline + canonical DeltaNet integration#990
newjordan wants to merge 39 commits intoopenai:mainfrom
newjordan:test

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Mar 28, 2026

image

The Frugendorff Architecture — Origin and Signal Analysis

The Frugendorff (F-Wing) crawler is a weight-shared recurrent architecture designed around one observation: parameter golf rewards compression, and the most efficient way to compress a transformer is to share weights across depth while running them repeatedly.

Thesis: The Frugendorff (F-Wing) crawler explores a simple idea;

Increase effective depth per stored byte by reusing weights across loops, then reallocate saved parameters into width.

Structure

Input → [2 flat encoder layers] → [1 shared crawler block × 4 loops] → [2 flat decoder layers] → Output
         unique weights             shared weights, looped                unique weights

The crawler block is one set of weights executed 4 times. The flat layers are unique. Total params: ~13.4M at our standard config (dim=512, 4 flat + 1 crawler × 4 loops). For comparison: a standard 11-layer transformer at the same dim runs 27M params. We get 8 effective depth passes at 13.4M params vs 11 unique passes at 27M.

Flow Instructions (FX-Wing innovation): instead of static per-loop positional offsets, each loop projects the current hidden state x through a shared bottleneck to produce a loop-specific perturbation:

inst_k = loop_inst_up[loop](loop_inst_proj(x))  # recomputed from current x each loop
x_loop = x + inst_k

This makes each loop's instruction respond to what the previous loop produced — genuine iterative refinement rather than a fixed plan.


What We Actually Found: Crawler Signal Analysis

We ran a comprehensive signal analysis across 8 micro-crawler TSVs, 175 Frugendorff sweep configs, production cadence ablations, and pod logs. Seven statistical tests applied.

Finding 1: Width is the primary lever (~0.033 BPB)

Config dim BPB Delta vs flat
3f+1cx2 608 2.157 −0.033
4f+1cx2 544 2.191 +0.001
6flat ctrl 496 2.190 baseline

The 0.033 BPB advantage tracks dim, not sharing. Fewer unique layers → fixed params spread across more channels per layer → wider dim → better quality. When dim is equalized, the advantage vanishes.

Finding 2: Recursion provides zero per-step benefit

C-steps (crawler fires) vs N-steps (no crawler) produce identical training loss at every phase. No momentum, no gradient interference difference, no variance difference. The second-through-fourth passes through the shared block don't "refine" in any measurable per-step sense.

Finding 3: More looping = early boost that decays

Config s50 delta s500 delta
3f+1cx2 −0.086 −0.033 (stable)
3f+1cx3 −0.092 −0.017 (decayed 81%)

Additional loops give early gains that fade over training.

Finding 4: ~0.01 BPB from implicit regularization

The 4f+1cx2 (no trigram, dim=544, 16.8M params) beats 6flat (trigram, dim=496, 17.9M params) by 0.009 BPB. With 6% fewer params and no trigram assist, this gap can't be entirely explained by width — approximately 0.01 BPB is from weight sharing as implicit regularization.

Finding 5: Post-processing is hostile to shared weights

SWA + quantization widen the gap with more recursion: quant gap 0.136 (4x2 cad1) vs 0.059 (4x2 cad4). More looping = worse quantization resilience. The crawler's int8 quantization mode (CRAWLER_QUANT_INT8=1) was specifically added to mitigate this.


The Compression Story

We ran a control experiment: a misconfigured ClownCar (missing USE_CRAWLER=1) ran a plain 27M-param GPT against our 13.4M crawler head-to-head.

Model Params int6+zstd size BPB Submittable
Plain GPT (control) 27M 16.6MB 1.1230 ❌ over limit
A-Wing SOTA ~26M ~15.5MB 1.1129
ClownCar (crawler) 13.4M ~9.1MB 1.1813 ± 0.0002
ClownCar_II (+ canonical DeltaNet [#875]) ~13.5M ~9.5MB pending

The 27M GPT cannot submit — 16.6MB exceeds the 16MB cap. Our crawler at 9.1MB has 6.9MB of headroom. At 1.1813 BPB vs the 27M GPT's 1.1230, we are within 0.058 BPB of a model twice our size that can't even enter.

Half the size. 94% of SOTA quality. Confirmed across 3 seeds.


ClownCar — Confirmed Results (3 Seeds)

experiments/ClownCar/ — FX_Wing_Delta stripped to its legal core. No ngram eval (ruled illegal). Sliding window only.

Seed BPB (sliding window) Size (int6+zstd) Post-EMA BPB Steps
1337 1.1812 9.05MB 1.1999 7,321
42 1.1815 9.15MB 1.1999 7,320
123 1.1813 8.98MB 1.2003 7,317
Mean 1.18133 9.06MB
Std dev 0.00015

Variance: 0.00015 BPB across 3 seeds. The number is 1.1813.

Config: 4 flat + 1 crawler × 4 loops, INST_DIM=32, CRAWLER_QUANT_INT8=1, DELTA_NET_HEADS=0, WARMDOWN_ITERS=2000.


Where We're Headed: Ludicrous Speed

The ClownCar_II kernel injection is the first step in a broader direction: replace every Python-level overhead loop in the module with native compiled kernels.

The overhead map

Component Current Bottleneck Fix
DeltaNet (#875, arxiv 2406.06484) Python loop over T @torch.compiler.disable, 1000× slower chunk_delta_rule (FLA Triton)
GDN (Cambrian) Python loop over 32 chunks × 64 tokens 19× slower (3.87s/step vs 0.2s) chunk_delta_rule or vectorized form
Short convolutions Pure PyTorch Conv1d JIT-able but not fused Fuse with projection in custom kernel
Crawler loop itself Python for loop in range(4) Unrollable but PyTorch graph overhead CUDA graph capture across loops

The thesis

Every for t in range(T): in a neural network is a promise that someone will eventually replace it with a Triton kernel. The FLA library by Songlin Yang & Yu Zhang (#875) is the reference implementation of that promise for linear attention. chunk_delta_rule parallelizes the delta rule over sequence chunks exactly the way flash_attn parallelized softmax attention.

The next frontier: capture the entire [encoder → crawler×4 → decoder] computation graph as a single CUDA graph, eliminating Python dispatch overhead between loops entirely.

ClownCar_III

Reserved for the next injection experiment — CUDA graph capture across crawler loops and/or vectorized GDN chunk form. Target TBD based on ClownCar_II results.


Files

  • experiments/ClownCar/run.sh — sliding window baseline, DELTA_NET_HEADS=0
  • experiments/ClownCar/train_gpt.py — FX_Wing_Delta engine (3283 lines)
  • experiments/ClownCar_II/run.sh — canonical DeltaNet (#875), DELTA_NET_HEADS=4
  • experiments/ClownCar_II/train_gpt.py — adds CanonicalDeltaNet + chunk_delta_rule import

Other parts of this will probably go towards this PR thiniking - #875 (comment)

frugendorff

clown_car_1

Octavian and others added 30 commits March 26, 2026 00:23
3D cubric pattern recognizer (54 warm-started adaptive multipliers)
+ complementary training. Seeds: 1337=0.4818, 300=0.4821, 58=0.4821.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Three variants targeting the 0.187 BPB gap to openai#1:
- bwing_alpha: clip 0.95, alpha 0.05-0.60 (isolate alpha curve)
- bwing_entropy_shift: per-order entropy center shift (isolate)
- bwing_full_port: all openai#809 techniques + fixed order mults (fire first)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Cubric 3D back online (CADENCE=32, warm-start)
- Per-order entropy center shift from openai#809
- Alpha 0.05-0.60, clip 0.95
- Our sliding-window TTT spliced in (1 epoch, SGD, freeze 2 blocks)
- TTT runs BEFORE n-gram eval → adapted model feeds n-gram

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Port openai#809 LoRA TTT: rank-8 adapters on Q/V/LM head, AdamW, Polyak
- Add LoRA injection to CausalSelfAttention, Block, GPT forward paths
- 53s vs our old 410s TTT, 6x better BPB gain
- Cubric 3D ON + entropy shift + alpha 0.05-0.60 clip 0.95

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Fixed mults + entropy shift + alpha 0.05-0.60 clip 0.95 (no cubric).
Base sliding: 1.1194, n-gram9: 0.4512. Delta from X-WING: -0.031.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Deleted LoRA TTT abomination. bwing_III is now a clean copy of our
best scoring variant for further iteration.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
bwing_IV: Prime fix only — adds primes 283721, 347237 to eliminate
XOR hash collisions for orders 8-9 (the 2.0x multiplier orders).
With 7 primes, prime[7] wrapped to prime[0], causing context tokens
at positions j-8 and j-1 to cancel when equal.

bwing_V: Prime fix + cubric 3D stacked on top of fixed mults.
Cubric warm-starts at 1.0 (neutral) and refines per (order × entropy
× count) on top of the fixed order multiplier scaling.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Adapted from old setup.sh. Fixes FA3 detection (old one skipped FA3
when FA2 was present), uses sp1024 dataset, adds zstandard install.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Standalone eval script loads final_model.int6.ptz once, then sweeps:
- alpha_max: [0.50, 0.60, 0.70, 0.80]
- entropy_center: [2.0, 2.5, 3.0]
- high_order_mult: [1.5, 2.0, 2.5, 3.0]
- min_count: [1, 2]
- cubric: [on, off]
= 192 configs, ~3 min each, sorted by aggressiveness (best-first).
Results to sweep_results.csv.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
openai#809 uses INT5 — more aggressive quantization creates more entropy in
the post-quant model, letting n-gram eval rescue harder. Their quant
loss is 0.019 vs our 0.006 (INT6), but n-gram extracts 0.869 vs 0.668.

Changes from bwing_IV:
- clip_range: 31 → 15 in gptq_quantize_weight, quantize_int6_per_row,
  and _find_best_row_scales
- No cubric (it hurt in bwing_V)
- 9 hash primes (from bwing_IV)
- All openai#809 n-gram params (fixed mults, entropy shift, alpha curve)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Clean submission-ready code. 2140 → 1936 lines (-204).
Removed all dead code paths that aren't used in our config.
INT5 GPTQ + 9-prime hash fix remain as the key changes.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
A-Wing Green (INT5 GPTQ + 9-prime):
  - Post-quant sliding: 1.1410 (vs 1.1194 INT6)
  - N-gram reduction: 0.683 (vs 0.668 INT6 — +0.015 more)
  - Final: 0.4576 BPB — worse than SOTA by 0.006
  - Conclusion: INT5 quant noise hurts more than n-gram gains

bwing_V (9-prime + cubric stacked on fixed mults):
  - Final: 0.4601 BPB — cubric on top of fixed mults HURTS by 0.009
  - Cubric over-corrected (orders 2-3 suppressed to 0.62x on top of 0.3x)

SOTA remains bwing_full_port at 0.4512 BPB (INT6, fixed mults, no cubric).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Instead of entropy-adaptive alpha (blind proxy), compare actual model_p
vs ngram_p per token. Soft sigmoid on log-ratio:
  alpha = 0.95 * sigmoid(8 * log(ngram_p / model_p))

When ngram_p > model_p: alpha → 0.95 (trust n-gram)
When ngram_p < model_p: alpha → 0.0 (trust model)
No wasted mixing on tokens where n-gram is worse.

Base: SOTA bwing_full_port + 9-prime hash fix. INT6, no cubric.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
openai#809 trains for 525s, leaving 75s for GPTQ. We were using the full
600s default. 570s leaves 30s for GPTQ calibrate (3.4s) + quantize
(~25s) with headroom.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- run.sh now checks zstandard + flash_attn BEFORE training starts
- Fails fast if zstandard missing (prevents 17MB zlib artifacts)
- Shows FA version for debugging
- train_gpt.py warns loudly if falling back to zlib

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Green_1 scored 0.3200 BPB with oracle alpha alone. Green_2 adds LoRA TTT
to close the remaining 0.025 gap to openai#809 (0.2952).

TTT flow (score-first legal):
1. Sliding window eval scores all val tokens (frozen model)
2. LoRA rank-8 adapters injected on Q, V projections
3. Single pass over val tokens: score then adapt (AdamW, lr=3e-4)
4. Polyak averaging (decay=0.998) for stability
5. N-gram eval with oracle alpha on adapted model

Coarse stride (16x) keeps TTT under 60s. Total eval budget: ~290s.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Rewrote setup_runpod.sh to install FA3 + zstandard directly into the
default system env instead of creating a separate conda environment
that conflicts with torchrun and per-test scripts.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
A-Wing Green_1 seed 1337 = 0.3200 BPB (was 0.4512).
Oracle alpha = sigmoid(8 * log(ngram_p/model_p)) * 0.95.
Copies: red, purple for parallel experimentation.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Adds Linear(512→12) alpha_head trained jointly with model to predict
per-token expert weights (neural + 11 n-gram orders 2-12).
Training oracle prefilled from training data, eval uses backward-looking
val-data cache. Targets sub-0.15 BPB on our 1.1195 neural baseline.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Usage on fresh pod:
  bash experiments/pod_launch.sh experiments/A_wing/purple/run.sh

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Add pod_setup.sh: one file, zero args, sets up pod environment
- Move stale root dirs to experiments/archive/ organized by type
- Update pod_launch.sh default branch to test
- Gitignore checkpoints (too large for GitHub)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
New experiment: test whether weight-shared Frugendorff architecture
compresses model artifact while maintaining BPB when paired with the
full X-WING N-gram eval stack (3D cubric, shared tables, CT, orders 2-9).

- train_gpt.py: adds CrawlerGPT class alongside existing GPT; USE_CRAWLER=1
  switches to 4 flat + 1 shared×2 architecture; build_model() factory handles
  both; all N-gram/GPTQ/CT machinery unchanged and legal
- Green/run.sh: 0.25 scale validator (1 GPU, 150s, dim=384)
- Red/run.sh: full scale production (8×H100, 600s, USE_CRAWLER=1)
- Purple/run.sh: U-Net control (8×H100, 600s, USE_CRAWLER=0) for clean A/B

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Octavian and others added 9 commits March 26, 2026 18:03
…upport)

- Copy RED's train_gpt.py as base (3D Cubric, entropy shift, learned mixer, CT)
- Add CrawlerGPT class: flat U-Net blocks + shared crawler blocks looped K times
- CrawlerGPT includes alpha_head for learned mixer compatibility
- Add build_model() factory and _get_block_named_params() helper
- Wire base_model/teacher_model/eval_model through build_model()
- USE_CRAWLER=1 activates Frugendorff path, =0 is clean A/B control (Purple)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Convert loop_pos from 2D parameter to ParameterList to avoid
sympy NaN comparison in torch.compile value range analysis.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Copy of green_1 SOTA baseline with MODEL_DIM=640 (up from 512).
Calibration run to test if wider model fits in 16MB int6+zstd.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Changes from green_1:
- XSA on all 11 layers (was last 4) — -0.0016 BPB per PR#609 ablation
- BigramHash 2048 (was 1536)
- GPTQ: descending col order, damping 0.01, block_size 128
- lzma compression (was zstd)
- Selective ±1 magnitude pruning for exact size targeting
- Oracle alpha REMOVED — entropy-adaptive only (submission-legal)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Binary search now uses zstd level-1 (~50x faster than lzma) for size
estimation, with a calibrated ratio to predict final lzma size. Only
one lzma compress at the end. Also vectorized candidate collection.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… testing

PR#609 Parallel Muon engine with B-WING n-gram eval. Removed all GPTQ/INT8
quantization (~660 lines), complementary training off, full 600s wallclock.
Focus: max base model quality.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@shalyhinpavel
Copy link
Copy Markdown

Man, the sheer velocity and rigor of your R&D is terrifying (in a good way). Brilliant breakdown on the Frugendorff crawler.

I completely agree with your Finding 1: under the 16MB constraint, reallocating saved parameters from depth into width (Dmodel) is the most dominant lever. It fundamentally changes the capacity bottleneck.

However, regarding Finding 2 (recursion provides zero per-step benefit) and Finding 3 (early boost that decays) — this is exactly the "Catastrophic Interference" wall I’ve been hitting in my own models. When you force the exact same shared weights to process sequential loops without a mechanism to 'anchor' the state or break the symmetry, the latent representation just turns into mush (noise) after the 2nd or 3rd pass. The network literally overwrites its own thoughts. You might need something at the very end of the loop to 'translate' that abstract recursive state back into distinct logits, rather than forcing the shared block to do both grammar and vocabulary mapping.
Your roadmap towards Ludicrous Speed (Triton / chunk_delta_rule / CUDA graphs) is 100% the future of this leaderboard. The Python dispatch overhead is killing us all. If you get that FLA kernel stable in ClownCar_III, it’s going to change the meta.

Keep pushing these limits. I'll be watching your CUDA graph experiments very closely while I finalize my next purely architectural swing!

@newjordan
Copy link
Copy Markdown
Author

newjordan commented Mar 28, 2026

I got it stable dude... I am hitting a .97 and a 10mb file size. But it's fucjing nuts, I'm getting to .37 9' THE MODEL BUILD!!!! But I can't stabilize the signal, spent all night working it down but couldn't get it stabilized to get a submission it. It screams. I had this built before I saw your delta net and it was the key to about 4 things I was working on. This is the very tip.

It the crawler needs reverse gradients and a system to re-combine. Like a polarized gradient funnel, with easing

Rust scaffold built and smoke testing all day today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants