Skip to content

Clarification: Transient non-model files in submission artefact consumed during training #847

@Fraser-Greenlee

Description

@Fraser-Greenlee

The FAQ states the 16MB artefact is computed as "code bytes plus compressed model bytes," and that no external downloads or network calls are allowed during evaluation.

I'd like to clarify whether the following pattern is permissible:

  1. The submission folder contains train_gpt.py + a supplementary data file (e.g. a compact, pre-optimised training dataset)
  2. At the start of training, the data file is loaded into RAM and deleted from disk
  3. Training proceeds using this data (alongside FineWeb) and produces model weights
  4. The final artifact (code + compressed model) fits within the 16MB cap

The supplementary data file would be generated entirely from the public FineWeb training split using a reproducible, open-source script included in the submission README. It would contain no validation data whatsoever.

This seems distinct from the banned "paid prefix" approach (#168), which involved embedding validation data in the artifact. Here, the transient file contains only training-derived data and does not exist in the final artifact.

However, I can see arguments either way:

  • For: The final artefact respects the 16MB cap AND the training + eval run never goes over it, the data is reproducibly generated from permitted training data, and the generation script is fully transparent
  • Against: It could be seen as smuggling in external compute via the data file, also it could be seen as effectively deleting part of code during training to save on memory

Would appreciate a ruling on whether this pattern is valid for the record track. Happy to provide more detail if helpful!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions