-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Problem
cached_challenge_fineweb.py lists byte260 as a valid --variant option (line 17), but the HuggingFace repo (willdepueoai/parameter-golf) only contains fineweb10B_sp1024 in both the manifest and the dataset folders.
Running the official download script fails:
python data/cached_challenge_fineweb.py --variant byte260 --train-shards 1
# ValueError: dataset fineweb10B_byte260 not found in datasets/manifest.jsonThe datasets/datasets/ folder on HuggingFace only has fineweb10B_sp1024. There is no fineweb10B_byte260 directory.
Impact
Multiple byte-level submissions (PRs #832, #705, #708, #696) rely on byte260 data. Currently, participants must manually convert from sp1024 using a SentencePiece decode pipeline, which is time-consuming and undocumented in the official workflow.
Request
Could the pre-built fineweb10B_byte260 shards (train + val) be uploaded to the HuggingFace repo and added to manifest.json? This would make byte-level experiments much more accessible to participants — especially given that SSMs and byte-level models are explicitly listed in the "Requests for PRs" section of the README.