-
Notifications
You must be signed in to change notification settings - Fork 239
Description
Problem
Evaluation rollouts are computed and scored during training, but only aggregate metrics (avg@k, pass@k) are logged. Individual eval traces are discarded.
Training logs 8 samples per step via monitor.log_samples(), but there's no equivalent call in evaluate_env() — it only calls monitor.log(eval_metrics, ...).
Why this matters
Debugging a degrading run (reward collapse, truncation spikes, content bloat) requires inspecting what the model actually generated, not just aggregate numbers. Eval sets are small by design (30-100 examples) and run infrequently, so storage cost is minimal. The data is already being generated — it's just thrown away.
There is currently no way to access individual eval traces — not via CLI, API, or the platform UI.
Proposal
-
Eval samples: Call
monitor.log_samples()for eval rollouts inevaluate_env(), the same way training does. Since eval sets are small, all samples could be logged (not a subset). A config option likelog_samples = trueunder[eval]would work if there are storage concerns. -
Training samples: It would also be nice to make the training rollout sample count configurable rather than hardcoded to 8 (
min(8, len(train_rollouts))in the orchestrator). For debugging, being able to download a larger subset — or all rollouts — from a training step would be very helpful.
Both would make traces available via prime rl rollouts and W&B, enabling proper post-hoc debugging of hosted runs.