Non-record: BESE Novel Tokenizer — 38-Token Structured Alphabet + BPE, 288 Vocab, 12.9MB by mrbese · Pull Request #973 · openai/parameter-golf

mrbese · 2026-03-27T18:37:15Z

Summary

BESE (Base-Efficient Subword Encoding) is a novel two-layer tokenizer that replaces the standard 1,024-token BPE vocabulary with a 38-token linguistically structured alphabet, then applies 250 BPE merges on top (total vocab: 288). This is the first submission to explore tokenizer-level optimization for the parameter golf challenge.

73% vocabulary reduction saves ~295KB of embedding parameters — enough for 2-3 extra transformer layers
Model size: 12.9 MB (vs 13.6 MB baseline) — the savings are real
val_bpb: 3.9143 — high due to 80x data starvation (12.6M vs 1B tokens), not tokenizer failure
Train loss drops normally: 5.69 → 1.22 (healthy learning curve)
Byte accounting verified: 100/100 documents pass (token bytes == UTF-8 bytes)

Why This Approach is Different

Every submission on the leaderboard optimizes architecture, quantization, or training — all using the default 1,024-token vocabulary. Nobody has rethought the tokenizer. BESE explores a fundamentally different axis: trading vocabulary size for model depth.

Layer 1: Structured Alphabet (38 tokens)

8 most frequent English letters (e,t,a,o,i,n,s,r) → single-token encoding
18 remaining letters → 5 context-disambiguated groups (selected via bigram frequency analysis)
Case-insensitive (model learns capitalization from context; BPB byte count preserved exactly)

Layer 2: BPE on Structured Tokens (250 merges)

BPE trained on the BESE token stream (not raw bytes) captures more meaningful merges
Compresses sequences to comparable lengths as baseline with 73% fewer vocab entries

BPB Byte Accounting (Critical for Tokenizer Submissions)

Every BESE token maps to a known number of UTF-8 bytes. Group tokens = 0 bytes, position tokens = 1 byte, single-letter tokens = 1 byte. BPE merged tokens = sum of constituent bytes (transitive). Verified correct across ASCII, multi-byte Unicode, and emoji. See README.md in the submission folder for full details.

Why the BPB is High

	Baseline	BESE
Training tokens	~1,000,000,000	~12,604,981
Data ratio	1x	0.013x (80x less)

Our initial pipeline could only encode 10K documents due to a Python BPE bottleneck. The model exhausted its training data immediately and cycled through the same small corpus. We've since built a fast O(N log N) BPE trainer and fixed critical training bugs (node-0 merge count corruption, stale position drift). A full-data run is pending.

What's Next

Full-data retrain with all 10 shards (~1B tokens)
Architecture tuning: extra layers (13L vs 9L baseline) with parameter savings
Merge count optimization (200-300 range)
8xH100 scaling for leaderboard-eligible timing

Files

Only adds records/track_non_record_16mb/2026-03-27_BESE_NovelTokenizer_38Base_250BPE_288Vocab/ containing:

README.md — Detailed write-up with approach, results, byte accounting proof
submission.json — Metadata
train_gpt.py — Modified training script with BESE LUT support
tokenizer/ — BESE constants, fast BPE trainer, base tokenizer
scripts/ — RunPod pipeline, shard export
requirements.txt — Dependencies
train_log_1xH100_data_starved.log — Training log from initial run

Full source: https://github.com/mrbese/parameter-golf/tree/experiment-results

… BPE, 288 vocab, 12.9MB

- Move heapq import to module top-level (was re-imported per call) - Fix merge ID display (new_id not new_id-1) - Remove stale None check in result collection - Add error handling for missing tokenizer in runpod_v2 - Add all upstream-required fields to submission.json (hardware, training_time, compressed_model_bytes, code_bytes, etc.) Co-Authored-By: Claude Opus 4.6 <[email protected]>

Non-record: BESE novel tokenizer — 38-token structured alphabet + 250…

3be7151

… BPE, 288 vocab, 12.9MB

mrbese force-pushed the bese-nonrecord-submission branch from 37a8ee1 to 3be7151 Compare March 27, 2026 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: BESE Novel Tokenizer — 38-Token Structured Alphabet + BPE, 288 Vocab, 12.9MB#973

Non-record: BESE Novel Tokenizer — 38-Token Structured Alphabet + BPE, 288 Vocab, 12.9MB#973
mrbese wants to merge 2 commits intoopenai:mainfrom
mrbese:bese-nonrecord-submission

mrbese commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrbese commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why This Approach is Different

Layer 1: Structured Alphabet (38 tokens)

Layer 2: BPE on Structured Tokens (250 merges)

BPB Byte Accounting (Critical for Tokenizer Submissions)

Why the BPB is High

What's Next

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mrbese commented Mar 27, 2026 •

edited

Loading