Add eval sample logging to monitor backends

## Problem

Evaluation rollouts are computed and scored during training, but only aggregate metrics (avg@k, pass@k) are logged. Individual eval traces are discarded.

Training logs 8 samples per step via `monitor.log_samples()`, but there's no equivalent call in `evaluate_env()` — it only calls `monitor.log(eval_metrics, ...)`.

## Why this matters

Debugging a degrading run (reward collapse, truncation spikes, content bloat) requires inspecting what the model actually generated, not just aggregate numbers. Eval sets are small by design (30-100 examples) and run infrequently, so storage cost is minimal. The data is already being generated — it's just thrown away.

There is currently no way to access individual eval traces — not via CLI, API, or the platform UI.

## Proposal

1. **Eval samples**: Call `monitor.log_samples()` for eval rollouts in `evaluate_env()`, the same way training does. Since eval sets are small, all samples could be logged (not a subset). A config option like `log_samples = true` under `[eval]` would work if there are storage concerns.

2. **Training samples**: It would also be nice to make the training rollout sample count configurable rather than hardcoded to 8 (`min(8, len(train_rollouts))` in the orchestrator). For debugging, being able to download a larger subset — or all rollouts — from a training step would be very helpful.

Both would make traces available via `prime rl rollouts` and W&B, enabling proper post-hoc debugging of hosted runs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval sample logging to monitor backends #1995

Problem

Why this matters

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add eval sample logging to monitor backends #1995

Description

Problem

Why this matters

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions