Skip to content

test(evals): anthropic-key support, improved CLI, and a runs-based dashboard#17025

Merged
AlessioGr merged 6 commits into
mainfrom
chore/evals-claude
Jun 17, 2026
Merged

test(evals): anthropic-key support, improved CLI, and a runs-based dashboard#17025
AlessioGr merged 6 commits into
mainfrom
chore/evals-claude

Conversation

@AlessioGr

@AlessioGr AlessioGr commented Jun 16, 2026

Copy link
Copy Markdown
Member

Four changes, all under test/evals/:

  1. Ability to run the direct llm evals with an anthropic API key (before, OpenAI was required).
  2. Replace the ~28 test:eval:* scripts and the EVAL_VARIANT flag with one pnpm test:eval command, similar to what we did with pnpm docker:start
  3. Reorganize the dashboard around individual runs instead of one confusing pile.
  4. Let the Claude Code agent runner work on machines whose org requires a login (not an API key).

1. Anthropic API key support

The codegen evals used to run on OpenAI only.

  • Added @ai-sdk/anthropic + Anthropic model presets, and bumped ai / @ai-sdk/openai
  • The eval project never loaded .env (it had no setup file); it does now, to load the api key env variable
  • Anthropic's structured output rejects min/max on numbers, so the score schemas drop them and clamp the values in code instead.

2. One pnpm test:eval command

screenshot 2026-06-16 at 11 53 34@2x
  • EVAL_VARIANT (which jammed "which harness", "model" and "skill on/off" into one value) is replaced by three independent options: EVAL_RUNNER (llm | claude-code), EVAL_SKILL (on | off), EVAL_MODEL.
  • The per-suite/per-variant scripts collapse into one launcher (test/evals/cli.ts, modeled on docker:start): run pnpm test:eval for an interactive picker, or pass --runner / --skill / --model / --suite to skip the prompts

3. Agent auth that works behind an org login

The Claude Code agent runner authenticated with ANTHROPIC_API_KEY. That can't work on machines whose org requires a first-party login and rejects API keys. Fixed as an automatic fallback - the original path is untouched:

  • Primary (unchanged): API key in a fresh temp sandbox. Personal API-key setups behave exactly as before.
  • Fallback: if that's rejected, the runner uses a small login dir in your user config (~/.config/payload-evals/claude-code - outside the repo, so it can't be committed or wiped with eval output) and drops the API key so it uses that dir's login instead. You log in once: CLAUDE_CONFIG_DIR=… claude auth login.
  • Either way it stays a clean sandbox — your personal ~/.claude (CLAUDE.md, skills, settings) is never used, so results aren't skewed.
  • The auth check runs once before any test, so a missing login stops the run immediately with the exact command to fix it, instead of failing every case.

4. Runs-based dashboard

The dashboard used to lump every cached result together mixing models and skill settings - so the headline "pass rate" wasn't meaningful and "runs" were confusing.

  • Every eval run now gets a runId, so results group into real runs.
  • Overview lists runs, newest first. Results shows one run at a time (now with a Model column). Compare diffs any two runs, by category or by case.
  • Cancelled runs are hidden. The launcher tags a run "finished" only when it exits cleanly; a run you Ctrl+C never gets tagged, so the dashboard skips it instead of showing a half-finished run.
  • Fixed the "2 runs in the header vs 19 in Compare" mismatch - there's now one idea of a "run" everywhere.

before

screenshot 2026-06-16 at 11 51 26@2x screenshot 2026-06-16 at 11 51 33@2x

after

screenshot 2026-06-16 at 11 50 49@2x screenshot 2026-06-16 at 11 50 56@2x screenshot 2026-06-16 at 11 51 07@2x

@AlessioGr AlessioGr requested a review from denolfe as a code owner June 16, 2026 18:59
@AlessioGr AlessioGr enabled auto-merge (squash) June 16, 2026 19:01
@github-actions

Copy link
Copy Markdown
Contributor

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖
This PR introduced no changes to the esbuild bundle! 🙌

@AlessioGr AlessioGr merged commit a798940 into main Jun 17, 2026
180 checks passed
@AlessioGr AlessioGr deleted the chore/evals-claude branch June 17, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants