Skip to content

Is this still alive? I asked Claude to analyze this repo and it shocked me. #181

@kundeng

Description

@kundeng

Is this still alive? I asked Claude to analyze this repo and it shocked me.

Hey team — love the work on Agent-S. The S1 paper (ICLR 2025 Best Paper!) was brilliant, especially the dual-mode approach with AT-SPI2 accessibility trees + vision. But I just did a deep code-level analysis of S3 and... I have questions.

S3 dropped the accessibility tree entirely?

S1's LinuxACI used pyatspi to capture AT-SPI2 trees, linearize them as TSV, augment with OCR, and click by element index. That was elegant — structured targeting, token-efficient, fast.

S3's OSWorldACI has zero accessibility tree usage. Everything goes through the VLM grounding model. generate_coords() sends a natural language description + screenshot to UI-TARS and parses (x, y) pixel coordinates from the response.

Was this a deliberate architectural decision? Why was the accessibility tree removed? It seems like a regression — pixel coordinates are brittle to resolution changes, DPI scaling, and window resizing. The structured tree gave you robust element refs for free.

2-4 LLM calls per single GUI action is a lot

For a simple click("Save button"), the execution path is:

  1. Reflection LLM call — reviews trajectory for cycles (~2-5s)
  2. Worker/Generator LLM call — decides what action to take (~3-8s)
  3. UI-TARS grounding VLM call — converts description to (x, y) coords (~1-3s)

That's ~6-16 seconds of pure LLM thinking for one mouse click. A drag_and_drop needs 3-4 calls (two grounding calls). highlight_text_span needs OCR + text span agent calls.

For comparison, an accessibility-tree-first approach (like Windows-MCP) needs 1 LLM call — the tree already provides element coordinates, so no grounding model is needed. That's 3-8x faster per action.

Has the team measured the actual wall-clock time per action step? Is there a breakdown of where the latency goes?

The 72.6% number uses bBoN — what's the real cost?

The README says "Agent S3 alone reaches 66% in the 100-step setting" and "With the addition of Behavior Best-of-N, performance climbs even higher to 72.6%."

bBoN runs multiple rollouts and picks the best. So the headline 72.6% costs 3x+ the compute of the base 66%. For anyone trying to use this in production (not benchmarks), the relevant number is 66%.

Could you clarify:

  • How many rollouts does bBoN use for the 72.6% number?
  • What's the average cost per task (LLM API $ + GPU hours for UI-TARS)?
  • Is there a way to get closer to 72% without bBoN?

exec() of raw pyautogui strings — any plans for a safer interface?

Actions are Python strings like "import pyautogui; pyautogui.click(542, 312)" that get exec()'d directly. This works for benchmarks but is a security concern for any real deployment.

Are there plans to:

  • Expose actions via MCP (Model Context Protocol) tools?
  • Add a structured action interface instead of raw code execution?
  • Support sandboxed execution?

Format validation retries can triple LLM calls

call_llm_formatted() retries up to 3x if the Worker output doesn't pass format checks (single agent.XXX() call in a code block). Combined with the reflection + grounding calls, a single action step could trigger 6-12 LLM calls in the worst case.

Is there data on how often format retries happen in practice?


Don't get me wrong — 66% base on OSWorld is impressive, and the code agent for programmatic tasks is a smart addition. But the shift from S1's hybrid (accessibility + vision) to S3's pure vision feels like trading efficiency for benchmark optimization. Would love to hear the team's thinking on this.

Great work overall. Just want to understand the roadmap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions