-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
The FAQ states the 16MB artefact is computed as "code bytes plus compressed model bytes," and that no external downloads or network calls are allowed during evaluation.
I'd like to clarify whether the following pattern is permissible:
- The submission folder contains
train_gpt.py+ a supplementary data file (e.g. a compact, pre-optimised training dataset) - At the start of training, the data file is loaded into RAM and deleted from disk
- Training proceeds using this data (alongside FineWeb) and produces model weights
- The final artifact (code + compressed model) fits within the 16MB cap
The supplementary data file would be generated entirely from the public FineWeb training split using a reproducible, open-source script included in the submission README. It would contain no validation data whatsoever.
This seems distinct from the banned "paid prefix" approach (#168), which involved embedding validation data in the artifact. Here, the transient file contains only training-derived data and does not exist in the final artifact.
However, I can see arguments either way:
- For: The final artefact respects the 16MB cap AND the training + eval run never goes over it, the data is reproducibly generated from permitted training data, and the generation script is fully transparent
- Against: It could be seen as smuggling in external compute via the data file, also it could be seen as effectively deleting part of code during training to save on memory
Would appreciate a ruling on whether this pattern is valid for the record track. Happy to provide more detail if helpful!