Skip to content

Add eval sample logging to monitor backends #1995

@bhavishya-pohani

Description

@bhavishya-pohani

Problem

Evaluation rollouts are computed and scored during training, but only aggregate metrics (avg@k, pass@k) are logged. Individual eval traces are discarded.

Training logs 8 samples per step via monitor.log_samples(), but there's no equivalent call in evaluate_env() — it only calls monitor.log(eval_metrics, ...).

Why this matters

Debugging a degrading run (reward collapse, truncation spikes, content bloat) requires inspecting what the model actually generated, not just aggregate numbers. Eval sets are small by design (30-100 examples) and run infrequently, so storage cost is minimal. The data is already being generated — it's just thrown away.

There is currently no way to access individual eval traces — not via CLI, API, or the platform UI.

Proposal

  1. Eval samples: Call monitor.log_samples() for eval rollouts in evaluate_env(), the same way training does. Since eval sets are small, all samples could be logged (not a subset). A config option like log_samples = true under [eval] would work if there are storage concerns.

  2. Training samples: It would also be nice to make the training rollout sample count configurable rather than hardcoded to 8 (min(8, len(train_rollouts)) in the orchestrator). For debugging, being able to download a larger subset — or all rollouts — from a training step would be very helpful.

Both would make traces available via prime rl rollouts and W&B, enabling proper post-hoc debugging of hosted runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions