test(evals): anthropic-key support, improved CLI, and a runs-based dashboard by AlessioGr · Pull Request #17025 · payloadcms/payload

AlessioGr · 2026-06-16T18:59:42Z

Four changes, all under test/evals/:

Ability to run the direct llm evals with an anthropic API key (before, OpenAI was required).
Replace the ~28 test:eval:* scripts and the EVAL_VARIANT flag with one pnpm test:eval command, similar to what we did with pnpm docker:start
Reorganize the dashboard around individual runs instead of one confusing pile.
Let the Claude Code agent runner work on machines whose org requires a login (not an API key).

1. Anthropic API key support

The codegen evals used to run on OpenAI only.

Added @ai-sdk/anthropic + Anthropic model presets, and bumped ai / @ai-sdk/openai
The eval project never loaded .env (it had no setup file); it does now, to load the api key env variable
Anthropic's structured output rejects min/max on numbers, so the score schemas drop them and clamp the values in code instead.

2. One `pnpm test:eval` command

EVAL_VARIANT (which jammed "which harness", "model" and "skill on/off" into one value) is replaced by three independent options: EVAL_RUNNER (llm | claude-code), EVAL_SKILL (on | off), EVAL_MODEL.
The per-suite/per-variant scripts collapse into one launcher (test/evals/cli.ts, modeled on docker:start): run pnpm test:eval for an interactive picker, or pass --runner / --skill / --model / --suite to skip the prompts

3. Agent auth that works behind an org login

The Claude Code agent runner authenticated with ANTHROPIC_API_KEY. That can't work on machines whose org requires a first-party login and rejects API keys. Fixed as an automatic fallback - the original path is untouched:

Primary (unchanged): API key in a fresh temp sandbox. Personal API-key setups behave exactly as before.
Fallback: if that's rejected, the runner uses a small login dir in your user config (~/.config/payload-evals/claude-code - outside the repo, so it can't be committed or wiped with eval output) and drops the API key so it uses that dir's login instead. You log in once: CLAUDE_CONFIG_DIR=… claude auth login.
Either way it stays a clean sandbox — your personal ~/.claude (CLAUDE.md, skills, settings) is never used, so results aren't skewed.
The auth check runs once before any test, so a missing login stops the run immediately with the exact command to fix it, instead of failing every case.

4. Runs-based dashboard

The dashboard used to lump every cached result together mixing models and skill settings - so the headline "pass rate" wasn't meaningful and "runs" were confusing.

Every eval run now gets a runId, so results group into real runs.
Overview lists runs, newest first. Results shows one run at a time (now with a Model column). Compare diffs any two runs, by category or by case.
Cancelled runs are hidden. The launcher tags a run "finished" only when it exits cleanly; a run you Ctrl+C never gets tagged, so the dashboard skips it instead of showing a half-finished run.
Fixed the "2 runs in the header vs 19 in Compare" mismatch - there's now one idea of a "run" everywhere.

before

after

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1215781760309745

…s on or off, llm and harness. Add support for anthropic api

…not allow auth via provided api key

github-actions · 2026-06-16T19:05:30Z

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖
This PR introduced no changes to the esbuild bundle! 🙌

AlessioGr added 6 commits June 15, 2026 18:23

load env

9519b47

refactor: cleaner cli, more logical separation of evals to run, skill…

08d15be

…s on or off, llm and harness. Add support for anthropic api

test: improve evals dashboard, show every run by id

949d5c3

fix: new authentication fallback for claude code if claude code does …

909147d

…not allow auth via provided api key

fix: do not show cancelled or partial runs in evals dashboard

a4cf50f

fix

040001c

AlessioGr requested a review from denolfe as a code owner June 16, 2026 18:59

github-actions Bot added the created-by: Payload team label Jun 16, 2026

AlessioGr enabled auto-merge (squash) June 16, 2026 19:01

denolfe approved these changes Jun 17, 2026

View reviewed changes

AlessioGr merged commit a798940 into main Jun 17, 2026
180 checks passed

AlessioGr deleted the chore/evals-claude branch June 17, 2026 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(evals): anthropic-key support, improved CLI, and a runs-based dashboard#17025

test(evals): anthropic-key support, improved CLI, and a runs-based dashboard#17025
AlessioGr merged 6 commits into
mainfrom
chore/evals-claude

AlessioGr commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlessioGr commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Anthropic API key support

2. One pnpm test:eval command

3. Agent auth that works behind an org login

4. Runs-based dashboard

before

after

Uh oh!

github-actions Bot commented Jun 16, 2026

📦 esbuild Bundle Analysis for payload

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlessioGr commented Jun 16, 2026 •

edited

Loading

2. One `pnpm test:eval` command