test(evals): anthropic-key support, improved CLI, and a runs-based dashboard#17025
Merged
Conversation
…s on or off, llm and harness. Add support for anthropic api
…not allow auth via provided api key
Contributor
📦 esbuild Bundle Analysis for payloadThis analysis was generated by esbuild-bundle-analyzer. 🤖 |
denolfe
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Four changes, all under
test/evals/:test:eval:*scripts and theEVAL_VARIANTflag with onepnpm test:evalcommand, similar to what we did withpnpm docker:start1. Anthropic API key support
The codegen evals used to run on OpenAI only.
@ai-sdk/anthropic+ Anthropic model presets, and bumpedai/@ai-sdk/openaievalproject never loaded.env(it had no setup file); it does now, to load the api key env variablemin/maxon numbers, so the score schemas drop them and clamp the values in code instead.2. One
pnpm test:evalcommandEVAL_VARIANT(which jammed "which harness", "model" and "skill on/off" into one value) is replaced by three independent options:EVAL_RUNNER(llm|claude-code),EVAL_SKILL(on|off),EVAL_MODEL.test/evals/cli.ts, modeled ondocker:start): runpnpm test:evalfor an interactive picker, or pass--runner/--skill/--model/--suiteto skip the prompts3. Agent auth that works behind an org login
The Claude Code agent runner authenticated with
ANTHROPIC_API_KEY. That can't work on machines whose org requires a first-party login and rejects API keys. Fixed as an automatic fallback - the original path is untouched:~/.config/payload-evals/claude-code- outside the repo, so it can't be committed or wiped with eval output) and drops the API key so it uses that dir's login instead. You log in once:CLAUDE_CONFIG_DIR=… claude auth login.~/.claude(CLAUDE.md, skills, settings) is never used, so results aren't skewed.4. Runs-based dashboard
The dashboard used to lump every cached result together mixing models and skill settings - so the headline "pass rate" wasn't meaningful and "runs" were confusing.
runId, so results group into real runs.before
after