Clarification: Transient non-model files in submission artefact consumed during training

The FAQ states the 16MB artefact is computed as "code bytes plus compressed model bytes," and that no external downloads or network calls are allowed during evaluation.

I'd like to clarify whether the following pattern is permissible:
1. The submission folder contains `train_gpt.py` + a supplementary data file (e.g. a compact, pre-optimised training dataset)
2. At the start of training, the data file is loaded into RAM and deleted from disk
3. Training proceeds using this data (alongside FineWeb) and produces model weights
4. The final artifact (code + compressed model) fits within the 16MB cap

The supplementary data file would be generated entirely from the public FineWeb training split using a reproducible, open-source script included in the submission README. It would contain no validation data whatsoever.

This seems distinct from the banned "paid prefix" approach (#168), which involved embedding validation data in the artifact. Here, the transient file contains only training-derived data and does not exist in the final artifact.

However, I can see arguments either way:
- **For:** The final artefact respects the 16MB cap AND the training + eval run never goes over it, the data is reproducibly generated from permitted training data, and the generation script is fully transparent
- **Against:** It could be seen as smuggling in external compute via the data file, also it could be seen as effectively deleting part of code during training to save on memory

Would appreciate a ruling on whether this pattern is valid for the record track. Happy to provide more detail if helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification: Transient non-model files in submission artefact consumed during training #847

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification: Transient non-model files in submission artefact consumed during training #847

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions