Skip to content

Non-record: BESE Novel Tokenizer — 38-Token Structured Alphabet + BPE, 288 Vocab, 12.9MB#973

Open
mrbese wants to merge 2 commits intoopenai:mainfrom
mrbese:bese-nonrecord-submission
Open

Non-record: BESE Novel Tokenizer — 38-Token Structured Alphabet + BPE, 288 Vocab, 12.9MB#973
mrbese wants to merge 2 commits intoopenai:mainfrom
mrbese:bese-nonrecord-submission

Conversation

@mrbese
Copy link
Copy Markdown

@mrbese mrbese commented Mar 27, 2026

Summary

BESE (Base-Efficient Subword Encoding) is a novel two-layer tokenizer that replaces the standard 1,024-token BPE vocabulary with a 38-token linguistically structured alphabet, then applies 250 BPE merges on top (total vocab: 288). This is the first submission to explore tokenizer-level optimization for the parameter golf challenge.

  • 73% vocabulary reduction saves ~295KB of embedding parameters — enough for 2-3 extra transformer layers
  • Model size: 12.9 MB (vs 13.6 MB baseline) — the savings are real
  • val_bpb: 3.9143 — high due to 80x data starvation (12.6M vs 1B tokens), not tokenizer failure
  • Train loss drops normally: 5.69 → 1.22 (healthy learning curve)
  • Byte accounting verified: 100/100 documents pass (token bytes == UTF-8 bytes)

Why This Approach is Different

Every submission on the leaderboard optimizes architecture, quantization, or training — all using the default 1,024-token vocabulary. Nobody has rethought the tokenizer. BESE explores a fundamentally different axis: trading vocabulary size for model depth.

Layer 1: Structured Alphabet (38 tokens)

  • 8 most frequent English letters (e,t,a,o,i,n,s,r) → single-token encoding
  • 18 remaining letters → 5 context-disambiguated groups (selected via bigram frequency analysis)
  • Case-insensitive (model learns capitalization from context; BPB byte count preserved exactly)

Layer 2: BPE on Structured Tokens (250 merges)

  • BPE trained on the BESE token stream (not raw bytes) captures more meaningful merges
  • Compresses sequences to comparable lengths as baseline with 73% fewer vocab entries

BPB Byte Accounting (Critical for Tokenizer Submissions)

Every BESE token maps to a known number of UTF-8 bytes. Group tokens = 0 bytes, position tokens = 1 byte, single-letter tokens = 1 byte. BPE merged tokens = sum of constituent bytes (transitive). Verified correct across ASCII, multi-byte Unicode, and emoji. See README.md in the submission folder for full details.

Why the BPB is High

Baseline BESE
Training tokens ~1,000,000,000 ~12,604,981
Data ratio 1x 0.013x (80x less)

Our initial pipeline could only encode 10K documents due to a Python BPE bottleneck. The model exhausted its training data immediately and cycled through the same small corpus. We've since built a fast O(N log N) BPE trainer and fixed critical training bugs (node-0 merge count corruption, stale position drift). A full-data run is pending.

What's Next

  1. Full-data retrain with all 10 shards (~1B tokens)
  2. Architecture tuning: extra layers (13L vs 9L baseline) with parameter savings
  3. Merge count optimization (200-300 range)
  4. 8xH100 scaling for leaderboard-eligible timing

Files

Only adds records/track_non_record_16mb/2026-03-27_BESE_NovelTokenizer_38Base_250BPE_288Vocab/ containing:

  • README.md — Detailed write-up with approach, results, byte accounting proof
  • submission.json — Metadata
  • train_gpt.py — Modified training script with BESE LUT support
  • tokenizer/ — BESE constants, fast BPE trainer, base tokenizer
  • scripts/ — RunPod pipeline, shard export
  • requirements.txt — Dependencies
  • train_log_1xH100_data_starved.log — Training log from initial run

Full source: https://github.com/mrbese/parameter-golf/tree/experiment-results

@mrbese mrbese force-pushed the bese-nonrecord-submission branch from 37a8ee1 to 3be7151 Compare March 27, 2026 18:49
- Move heapq import to module top-level (was re-imported per call)
- Fix merge ID display (new_id not new_id-1)
- Remove stale None check in result collection
- Add error handling for missing tokenizer in runpod_v2
- Add all upstream-required fields to submission.json (hardware,
  training_time, compressed_model_bytes, code_bytes, etc.)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant