-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Is this still alive? I asked Claude to analyze this repo and it shocked me.
Hey team — love the work on Agent-S. The S1 paper (ICLR 2025 Best Paper!) was brilliant, especially the dual-mode approach with AT-SPI2 accessibility trees + vision. But I just did a deep code-level analysis of S3 and... I have questions.
S3 dropped the accessibility tree entirely?
S1's LinuxACI used pyatspi to capture AT-SPI2 trees, linearize them as TSV, augment with OCR, and click by element index. That was elegant — structured targeting, token-efficient, fast.
S3's OSWorldACI has zero accessibility tree usage. Everything goes through the VLM grounding model. generate_coords() sends a natural language description + screenshot to UI-TARS and parses (x, y) pixel coordinates from the response.
Was this a deliberate architectural decision? Why was the accessibility tree removed? It seems like a regression — pixel coordinates are brittle to resolution changes, DPI scaling, and window resizing. The structured tree gave you robust element refs for free.
2-4 LLM calls per single GUI action is a lot
For a simple click("Save button"), the execution path is:
- Reflection LLM call — reviews trajectory for cycles (~2-5s)
- Worker/Generator LLM call — decides what action to take (~3-8s)
- UI-TARS grounding VLM call — converts description to (x, y) coords (~1-3s)
That's ~6-16 seconds of pure LLM thinking for one mouse click. A drag_and_drop needs 3-4 calls (two grounding calls). highlight_text_span needs OCR + text span agent calls.
For comparison, an accessibility-tree-first approach (like Windows-MCP) needs 1 LLM call — the tree already provides element coordinates, so no grounding model is needed. That's 3-8x faster per action.
Has the team measured the actual wall-clock time per action step? Is there a breakdown of where the latency goes?
The 72.6% number uses bBoN — what's the real cost?
The README says "Agent S3 alone reaches 66% in the 100-step setting" and "With the addition of Behavior Best-of-N, performance climbs even higher to 72.6%."
bBoN runs multiple rollouts and picks the best. So the headline 72.6% costs 3x+ the compute of the base 66%. For anyone trying to use this in production (not benchmarks), the relevant number is 66%.
Could you clarify:
- How many rollouts does bBoN use for the 72.6% number?
- What's the average cost per task (LLM API $ + GPU hours for UI-TARS)?
- Is there a way to get closer to 72% without bBoN?
exec() of raw pyautogui strings — any plans for a safer interface?
Actions are Python strings like "import pyautogui; pyautogui.click(542, 312)" that get exec()'d directly. This works for benchmarks but is a security concern for any real deployment.
Are there plans to:
- Expose actions via MCP (Model Context Protocol) tools?
- Add a structured action interface instead of raw code execution?
- Support sandboxed execution?
Format validation retries can triple LLM calls
call_llm_formatted() retries up to 3x if the Worker output doesn't pass format checks (single agent.XXX() call in a code block). Combined with the reflection + grounding calls, a single action step could trigger 6-12 LLM calls in the worst case.
Is there data on how often format retries happen in practice?
Don't get me wrong — 66% base on OSWorld is impressive, and the code agent for programmatic tasks is a smart addition. But the shift from S1's hybrid (accessibility + vision) to S3's pure vision feels like trading efficiency for benchmark optimization. Would love to hear the team's thinking on this.
Great work overall. Just want to understand the roadmap.