Merged
Conversation
Consolidate 13 configs down to 7 with uniform settings: - 3 weekly (CI): browseros-agent, browseros-oe-agent, browseros-oe-clado - 4 test (local): test_gemini-computer-use, test_yutori-navigator, test_webvoyager, test_mind2web - All configs: headless=false, captcha block, full browseros ports, restart_server_per_task Deleted: debug-test, mind2web-test, tool-loop-test, orchestrator-executor-test, orchestrator-executor-clado-test, fireworks-minimax-m2, webvoyager-test Added: test-clado-api.ts script, browseros-oe-agent-weekly.json (OE with AI SDK executor)
Contributor
Greptile SummaryThis PR consolidates 13 eval configs into 7 with a uniform structure — full BrowserOS port block, Key changes:
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph CI_Weekly["Weekly CI Configs (3)"]
W1[browseros-agent-weekly.json]
W2[browseros-oe-agent-weekly.json\n NEW - OE + kimi-k2p5 both sides]
W3[browseros-oe-clado-weekly.json\n headless: true → false]
end
subgraph LocalDev["Local Dev / Test Configs (4)"]
T1[test_gemini-computer-use.json]
T2[test_yutori-navigator.json]
T3[test_webvoyager.json]
T4[test_mind2web.json]
end
subgraph Deleted["Deleted (6 redundant)"]
D1[debug-test.json]
D2[fireworks-minimax-m2.json]
D3[mind2web-test.json]
D4[orchestrator-executor-clado-test.json]
D5[tool-loop-test.json]
D6[webvoyager-test.json]
end
subgraph Uniform["Uniform settings applied to all 7"]
U1[headless: false]
U2[captcha block]
U3[full browseros ports]
U4[restart_server_per_task: true]
U5[graders: performance_grader]
end
CI_Weekly --> Uniform
LocalDev --> Uniform
Prompt To Fix All With AIThis is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/configs/browseros-oe-clado-weekly.json
Line: 26
Comment:
**`headless: false` in weekly CI configs**
Both `browseros-oe-clado-weekly.json` and `browseros-oe-agent-weekly.json` now run with `headless: false`. The `browseros-oe-clado-weekly.json` was explicitly changed from `headless: true`.
Running a headed browser in a typical CI environment (without a virtual framebuffer like Xvfb) will cause Chrome to fail to start. If the CI runner that executes weekly evals already has Xvfb or a display configured, this is fine — but worth confirming to avoid silent weekly run failures.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/test-clado-api.ts
Line: 195-209
Comment:
**Multi-turn test only runs on `click` actions**
The step-2 history test only executes when the action model returns `action === 'click'`. For other action types (`type`, `scroll`, `key`, etc.) the multi-turn path is silently skipped, which may leave coverage gaps during manual smoke testing. Consider broadening the condition or logging a "skipping step 2" message when the action type doesn't match:
```ts
if (result) {
const historyEntry =
result.action === 'click'
? `click(${result.x}, ${result.y})`
: result.action === 'type'
? `type("${result.text}")`
: String(result.action)
await testGenerate('Action Model (step 2, with history)', ACTION_URL, {
instruction: 'Type "hello world" in the search bar',
image_base64: imageBase64,
history: historyEntry,
})
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(eval): clean up eval configs and add..." | Re-trigger Greptile |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
browseros-agent,browseros-oe-agent,browseros-oe-cladotest_gemini-computer-use,test_yutori-navigator,test_webvoyager,test_mind2webheadless: false,captchablock, full browseros ports,restart_server_per_task: true,gradersbrowseros-oe-agent-weekly.json— orchestrator-executor with AI SDK agent on both sidestest-clado-api.tsscript for testing Clado API endpointsTest plan
EvalConfigSchema🤖 Generated with Claude Code