Skip to content

byte260 variant listed in download script but shards are not on HuggingFace #899

@yashverms

Description

@yashverms

Problem

cached_challenge_fineweb.py lists byte260 as a valid --variant option (line 17), but the HuggingFace repo (willdepueoai/parameter-golf) only contains fineweb10B_sp1024 in both the manifest and the dataset folders.

Running the official download script fails:

python data/cached_challenge_fineweb.py --variant byte260 --train-shards 1
# ValueError: dataset fineweb10B_byte260 not found in datasets/manifest.json

The datasets/datasets/ folder on HuggingFace only has fineweb10B_sp1024. There is no fineweb10B_byte260 directory.

Impact

Multiple byte-level submissions (PRs #832, #705, #708, #696) rely on byte260 data. Currently, participants must manually convert from sp1024 using a SentencePiece decode pipeline, which is time-consuming and undocumented in the official workflow.

Request

Could the pre-built fineweb10B_byte260 shards (train + val) be uploaded to the HuggingFace repo and added to manifest.json? This would make byte-level experiments much more accessible to participants — especially given that SSMs and byte-level models are explicitly listed in the "Requests for PRs" section of the README.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions