Skip to content

fix(eval): clean up eval configs#540

Merged
shivammittal274 merged 1 commit intomainfrom
fix/eval-config-cleanup
Mar 23, 2026
Merged

fix(eval): clean up eval configs#540
shivammittal274 merged 1 commit intomainfrom
fix/eval-config-cleanup

Conversation

@shivammittal274
Copy link
Contributor

Summary

  • Consolidate 13 eval configs down to 7 with uniform settings
  • 3 weekly configs (CI): browseros-agent, browseros-oe-agent, browseros-oe-clado
  • 4 test configs (local dev): test_gemini-computer-use, test_yutori-navigator, test_webvoyager, test_mind2web
  • All configs now have: headless: false, captcha block, full browseros ports, restart_server_per_task: true, graders
  • Add browseros-oe-agent-weekly.json — orchestrator-executor with AI SDK agent on both sides
  • Add test-clado-api.ts script for testing Clado API endpoints
  • Delete 6 redundant/unused configs

Test plan

  • All configs valid against EvalConfigSchema
  • Run weekly eval with each config to verify

🤖 Generated with Claude Code

Consolidate 13 configs down to 7 with uniform settings:
- 3 weekly (CI): browseros-agent, browseros-oe-agent, browseros-oe-clado
- 4 test (local): test_gemini-computer-use, test_yutori-navigator, test_webvoyager, test_mind2web
- All configs: headless=false, captcha block, full browseros ports, restart_server_per_task

Deleted: debug-test, mind2web-test, tool-loop-test, orchestrator-executor-test,
orchestrator-executor-clado-test, fireworks-minimax-m2, webvoyager-test

Added: test-clado-api.ts script, browseros-oe-agent-weekly.json (OE with AI SDK executor)
@github-actions github-actions bot added the fix label Mar 23, 2026
@shivammittal274 shivammittal274 merged commit 65547c6 into main Mar 23, 2026
9 of 10 checks passed
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR consolidates 13 eval configs into 7 with a uniform structure — full BrowserOS port block, captcha, graders, headless: false, and restart_server_per_task: true — and introduces a new browseros-oe-agent-weekly.json for orchestrator-executor CI evals plus a developer utility script (test-clado-api.ts) for smoke-testing Clado endpoints.

Key changes:

  • 6 redundant configs deleted (debug-test, fireworks-minimax-m2, mind2web-test, orchestrator-executor-clado-test, tool-loop-test, webvoyager-test), reducing dead config sprawl
  • 4 configs renamed with test_ prefix to clearly distinguish local-dev from CI-weekly configs
  • browseros-oe-clado-weekly.json updated: headless: true → false, captcha block added; verify CI runner has a virtual display before merging
  • All renamed configs drop output_dir (now optional per EvalConfigSchema); results directory will use the runner's default
  • test-clado-api.ts is a clean, self-contained health + generate tester with parallel health checks, AbortSignal timeouts, and MCP fallback screenshot capture

Confidence Score: 4/5

  • Safe to merge after confirming CI runners have a virtual display for headed Chrome sessions.
  • All schema validations pass (output_dir is optional, all required fields present). The consolidation is straightforward and the new script is well-structured. The one item to verify before merging is whether the weekly CI environment supports headed Chrome (headless: false) — if it already does, this is a clean 5.
  • browseros-oe-clado-weekly.json — headless changed from true to false for a CI weekly run; needs Xvfb or equivalent on the runner.

Important Files Changed

Filename Overview
packages/browseros-agent/apps/eval/configs/browseros-oe-agent-weekly.json New weekly CI config for orchestrator-executor (both sides AI SDK agent); standardized with full ports, captcha, graders, headless: false, restart_server_per_task; num_workers bumped to 10 (will trigger a "many browser windows" warning in config-validator).
packages/browseros-agent/apps/eval/configs/browseros-oe-clado-weekly.json Existing weekly CI config updated: headless changed from true → false and captcha block added; headless: false in a headless CI environment may require Xvfb.
packages/browseros-agent/apps/eval/scripts/test-clado-api.ts New developer utility script for smoke-testing Clado action/grounding endpoints; well-structured with parallel health checks, AbortSignal timeouts, graceful fallbacks, and multi-turn simulation.
packages/browseros-agent/apps/eval/configs/test_gemini-computer-use.json Renamed from gemini-computer-use.json; output_dir removed (now optional per schema), full browseros ports added, captcha and graders standardized.
packages/browseros-agent/apps/eval/configs/test_mind2web.json Renamed from mind2web-full.json; standardized ports, headless: false, captcha, graders, restart_server_per_task added; retains 5-minute timeout (300 000 ms) which was pre-existing.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph CI_Weekly["Weekly CI Configs (3)"]
        W1[browseros-agent-weekly.json]
        W2[browseros-oe-agent-weekly.json\n NEW - OE + kimi-k2p5 both sides]
        W3[browseros-oe-clado-weekly.json\n headless: true → false]
    end

    subgraph LocalDev["Local Dev / Test Configs (4)"]
        T1[test_gemini-computer-use.json]
        T2[test_yutori-navigator.json]
        T3[test_webvoyager.json]
        T4[test_mind2web.json]
    end

    subgraph Deleted["Deleted (6 redundant)"]
        D1[debug-test.json]
        D2[fireworks-minimax-m2.json]
        D3[mind2web-test.json]
        D4[orchestrator-executor-clado-test.json]
        D5[tool-loop-test.json]
        D6[webvoyager-test.json]
    end

    subgraph Uniform["Uniform settings applied to all 7"]
        U1[headless: false]
        U2[captcha block]
        U3[full browseros ports]
        U4[restart_server_per_task: true]
        U5[graders: performance_grader]
    end

    CI_Weekly --> Uniform
    LocalDev --> Uniform
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/configs/browseros-oe-clado-weekly.json
Line: 26

Comment:
**`headless: false` in weekly CI configs**

Both `browseros-oe-clado-weekly.json` and `browseros-oe-agent-weekly.json` now run with `headless: false`. The `browseros-oe-clado-weekly.json` was explicitly changed from `headless: true`.

Running a headed browser in a typical CI environment (without a virtual framebuffer like Xvfb) will cause Chrome to fail to start. If the CI runner that executes weekly evals already has Xvfb or a display configured, this is fine — but worth confirming to avoid silent weekly run failures.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/test-clado-api.ts
Line: 195-209

Comment:
**Multi-turn test only runs on `click` actions**

The step-2 history test only executes when the action model returns `action === 'click'`. For other action types (`type`, `scroll`, `key`, etc.) the multi-turn path is silently skipped, which may leave coverage gaps during manual smoke testing. Consider broadening the condition or logging a "skipping step 2" message when the action type doesn't match:

```ts
    if (result) {
      const historyEntry =
        result.action === 'click'
          ? `click(${result.x}, ${result.y})`
          : result.action === 'type'
            ? `type("${result.text}")`
            : String(result.action)

      await testGenerate('Action Model (step 2, with history)', ACTION_URL, {
        instruction: 'Type "hello world" in the search bar',
        image_base64: imageBase64,
        history: historyEntry,
      })
    }
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(eval): clean up eval configs and add..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant