Review: Rerun of #972 with actual full-vocab normalization#978
Open
AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
Open
Review: Rerun of #972 with actual full-vocab normalization#978AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
Conversation
Correct the eval-time n-gram posterior to normalize by the summed hashed-vocab mass and update the recorded metrics. The honest rerun lands at 1.5134 BPB, showing the earlier 0.3922 result came from the flawed normalization path. Made-with: Cursor
|
Thanks for the review. I'd already identified the denominator issue and reran with pair_c.sum(dim=1) + beta — confirmed it degrades to ~1.23-1.51 BPP, The entire n-gram gain is collision artifact. Closing this PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
eval_val_sliding()so the n-gram score is normalized by the summed hashed-vocab masspair_counts.sum(...) + betainstead ofctx_count + beta#972record README and submission metadata to reflect the honest rerun rather than the earlier0.3922claim1.51343368BPB and loses to the neural sliding baselineIssue in #972
#972claimed to compute a full-vocab normalized n-gram distribution, but the evaluation code did not actually use that normalization.What the code effectively did was:
The problem is that
ctx_count + betais not the normalization constant for the 1024-way hashed candidate distribution. Oncepair_chas been materialized, the correct denominator for a full-vocab posterior is the total mass over candidate tokens:So the previous PR was still scoring a gold-token scalar against a non-normalized denominator, even though it had already gathered counts for all 1024 tokens.
Test plan
torchrun --standalone --nproc_per_node=8 train_gpt.pyon 8xH100 with:DATA_PATH=/root/parameter-golf/data/datasets/fineweb10B_sp1024TOKENIZER_PATH=/root/parameter-golf/data/tokenizers/fineweb_1024_bpe.modelCTW_BETA=2.0CTW_BLEND=0.5final_int8_zlib_roundtrip_exact val_loss:1.97320202 val_bpb:1.16864138final_sliding_window_exact sliding_bpb:1.14740867 val_bpb:1.51343368