fix(eval): clean up eval configs by shivammittal274 · Pull Request #540 · browseros-ai/BrowserOS

shivammittal274 · 2026-03-23T19:55:43Z

Summary

Consolidate 13 eval configs down to 7 with uniform settings
3 weekly configs (CI): browseros-agent, browseros-oe-agent, browseros-oe-clado
4 test configs (local dev): test_gemini-computer-use, test_yutori-navigator, test_webvoyager, test_mind2web
All configs now have: headless: false, captcha block, full browseros ports, restart_server_per_task: true, graders
Add browseros-oe-agent-weekly.json — orchestrator-executor with AI SDK agent on both sides
Add test-clado-api.ts script for testing Clado API endpoints
Delete 6 redundant/unused configs

Test plan

All configs valid against EvalConfigSchema
Run weekly eval with each config to verify

🤖 Generated with Claude Code

Consolidate 13 configs down to 7 with uniform settings: - 3 weekly (CI): browseros-agent, browseros-oe-agent, browseros-oe-clado - 4 test (local): test_gemini-computer-use, test_yutori-navigator, test_webvoyager, test_mind2web - All configs: headless=false, captcha block, full browseros ports, restart_server_per_task Deleted: debug-test, mind2web-test, tool-loop-test, orchestrator-executor-test, orchestrator-executor-clado-test, fireworks-minimax-m2, webvoyager-test Added: test-clado-api.ts script, browseros-oe-agent-weekly.json (OE with AI SDK executor)

greptile-apps · 2026-03-23T19:58:48Z

Greptile Summary

This PR consolidates 13 eval configs into 7 with a uniform structure — full BrowserOS port block, captcha, graders, headless: false, and restart_server_per_task: true — and introduces a new browseros-oe-agent-weekly.json for orchestrator-executor CI evals plus a developer utility script (test-clado-api.ts) for smoke-testing Clado endpoints.

Key changes:

6 redundant configs deleted (debug-test, fireworks-minimax-m2, mind2web-test, orchestrator-executor-clado-test, tool-loop-test, webvoyager-test), reducing dead config sprawl
4 configs renamed with test_ prefix to clearly distinguish local-dev from CI-weekly configs
browseros-oe-clado-weekly.json updated: headless: true → false, captcha block added; verify CI runner has a virtual display before merging
All renamed configs drop output_dir (now optional per EvalConfigSchema); results directory will use the runner's default
test-clado-api.ts is a clean, self-contained health + generate tester with parallel health checks, AbortSignal timeouts, and MCP fallback screenshot capture

Confidence Score: 4/5

Safe to merge after confirming CI runners have a virtual display for headed Chrome sessions.
All schema validations pass (output_dir is optional, all required fields present). The consolidation is straightforward and the new script is well-structured. The one item to verify before merging is whether the weekly CI environment supports headed Chrome (headless: false) — if it already does, this is a clean 5.
browseros-oe-clado-weekly.json — headless changed from true to false for a CI weekly run; needs Xvfb or equivalent on the runner.

Important Files Changed

Filename	Overview
packages/browseros-agent/apps/eval/configs/browseros-oe-agent-weekly.json	New weekly CI config for orchestrator-executor (both sides AI SDK agent); standardized with full ports, captcha, graders, headless: false, restart_server_per_task; num_workers bumped to 10 (will trigger a "many browser windows" warning in config-validator).
packages/browseros-agent/apps/eval/configs/browseros-oe-clado-weekly.json	Existing weekly CI config updated: headless changed from true → false and captcha block added; headless: false in a headless CI environment may require Xvfb.
packages/browseros-agent/apps/eval/scripts/test-clado-api.ts	New developer utility script for smoke-testing Clado action/grounding endpoints; well-structured with parallel health checks, AbortSignal timeouts, graceful fallbacks, and multi-turn simulation.
packages/browseros-agent/apps/eval/configs/test_gemini-computer-use.json	Renamed from gemini-computer-use.json; output_dir removed (now optional per schema), full browseros ports added, captcha and graders standardized.
packages/browseros-agent/apps/eval/configs/test_mind2web.json	Renamed from mind2web-full.json; standardized ports, headless: false, captcha, graders, restart_server_per_task added; retains 5-minute timeout (300 000 ms) which was pre-existing.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph CI_Weekly["Weekly CI Configs (3)"]
        W1[browseros-agent-weekly.json]
        W2[browseros-oe-agent-weekly.json\n NEW - OE + kimi-k2p5 both sides]
        W3[browseros-oe-clado-weekly.json\n headless: true → false]
    end

    subgraph LocalDev["Local Dev / Test Configs (4)"]
        T1[test_gemini-computer-use.json]
        T2[test_yutori-navigator.json]
        T3[test_webvoyager.json]
        T4[test_mind2web.json]
    end

    subgraph Deleted["Deleted (6 redundant)"]
        D1[debug-test.json]
        D2[fireworks-minimax-m2.json]
        D3[mind2web-test.json]
        D4[orchestrator-executor-clado-test.json]
        D5[tool-loop-test.json]
        D6[webvoyager-test.json]
    end

    subgraph Uniform["Uniform settings applied to all 7"]
        U1[headless: false]
        U2[captcha block]
        U3[full browseros ports]
        U4[restart_server_per_task: true]
        U5[graders: performance_grader]
    end

    CI_Weekly --> Uniform
    LocalDev --> Uniform

Prompt To Fix All With AI

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/configs/browseros-oe-clado-weekly.json
Line: 26

Comment:
**`headless: false` in weekly CI configs**

Both `browseros-oe-clado-weekly.json` and `browseros-oe-agent-weekly.json` now run with `headless: false`. The `browseros-oe-clado-weekly.json` was explicitly changed from `headless: true`.

Running a headed browser in a typical CI environment (without a virtual framebuffer like Xvfb) will cause Chrome to fail to start. If the CI runner that executes weekly evals already has Xvfb or a display configured, this is fine — but worth confirming to avoid silent weekly run failures.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/test-clado-api.ts
Line: 195-209

Comment:
**Multi-turn test only runs on `click` actions**

The step-2 history test only executes when the action model returns `action === 'click'`. For other action types (`type`, `scroll`, `key`, etc.) the multi-turn path is silently skipped, which may leave coverage gaps during manual smoke testing. Consider broadening the condition or logging a "skipping step 2" message when the action type doesn't match:

```ts
    if (result) {
      const historyEntry =
        result.action === 'click'
          ? `click(${result.x}, ${result.y})`
          : result.action === 'type'
            ? `type("${result.text}")`
            : String(result.action)

      await testGenerate('Action Model (step 2, with history)', ACTION_URL, {
        instruction: 'Type "hello world" in the search bar',
        image_base64: imageBase64,
        history: historyEntry,
      })
    }
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(eval): clean up eval configs and add..." | Re-trigger Greptile}

github-actions bot added the fix label Mar 23, 2026

shivammittal274 merged commit 65547c6 into main Mar 23, 2026
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): clean up eval configs#540

fix(eval): clean up eval configs#540
shivammittal274 merged 1 commit intomainfrom
fix/eval-config-cleanup

shivammittal274 commented Mar 23, 2026

Uh oh!

Uh oh!

greptile-apps bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shivammittal274 commented Mar 23, 2026

Summary

Test plan

Uh oh!

Uh oh!

greptile-apps bot commented Mar 23, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant