Is this still alive?  I asked Claude to analyze this repo and it shocked me.

# Is this still alive? I asked Claude to analyze this repo and it shocked me.

Hey team — love the work on Agent-S. The S1 paper (ICLR 2025 Best Paper!) was brilliant, especially the dual-mode approach with AT-SPI2 accessibility trees + vision. But I just did a deep code-level analysis of S3 and... I have questions.

## S3 dropped the accessibility tree entirely?

S1's `LinuxACI` used `pyatspi` to capture AT-SPI2 trees, linearize them as TSV, augment with OCR, and click by **element index**. That was elegant — structured targeting, token-efficient, fast.

S3's `OSWorldACI` has **zero accessibility tree usage**. Everything goes through the VLM grounding model. `generate_coords()` sends a natural language description + screenshot to UI-TARS and parses (x, y) pixel coordinates from the response.

Was this a deliberate architectural decision? Why was the accessibility tree removed? It seems like a regression — pixel coordinates are brittle to resolution changes, DPI scaling, and window resizing. The structured tree gave you robust element refs for free.

## 2-4 LLM calls per single GUI action is a lot

For a simple `click("Save button")`, the execution path is:

1. **Reflection LLM call** — reviews trajectory for cycles (~2-5s)
2. **Worker/Generator LLM call** — decides what action to take (~3-8s)
3. **UI-TARS grounding VLM call** — converts description to (x, y) coords (~1-3s)

That's **~6-16 seconds of pure LLM thinking** for one mouse click. A `drag_and_drop` needs 3-4 calls (two grounding calls). `highlight_text_span` needs OCR + text span agent calls.

For comparison, an accessibility-tree-first approach (like Windows-MCP) needs **1 LLM call** — the tree already provides element coordinates, so no grounding model is needed. That's 3-8x faster per action.

Has the team measured the actual wall-clock time per action step? Is there a breakdown of where the latency goes?

## The 72.6% number uses bBoN — what's the real cost?

The README says "Agent S3 alone reaches 66% in the 100-step setting" and "With the addition of Behavior Best-of-N, performance climbs even higher to 72.6%."

bBoN runs multiple rollouts and picks the best. So the headline 72.6% costs **3x+ the compute** of the base 66%. For anyone trying to use this in production (not benchmarks), the relevant number is 66%.

Could you clarify:
- How many rollouts does bBoN use for the 72.6% number?
- What's the average cost per task (LLM API $ + GPU hours for UI-TARS)?
- Is there a way to get closer to 72% without bBoN?

## `exec()` of raw pyautogui strings — any plans for a safer interface?

Actions are Python strings like `"import pyautogui; pyautogui.click(542, 312)"` that get `exec()`'d directly. This works for benchmarks but is a security concern for any real deployment.

Are there plans to:
- Expose actions via MCP (Model Context Protocol) tools?
- Add a structured action interface instead of raw code execution?
- Support sandboxed execution?

## Format validation retries can triple LLM calls

`call_llm_formatted()` retries up to 3x if the Worker output doesn't pass format checks (single `agent.XXX()` call in a code block). Combined with the reflection + grounding calls, a single action step could trigger **6-12 LLM calls** in the worst case.

Is there data on how often format retries happen in practice?

---

Don't get me wrong — 66% base on OSWorld is impressive, and the code agent for programmatic tasks is a smart addition. But the shift from S1's hybrid (accessibility + vision) to S3's pure vision feels like trading efficiency for benchmark optimization. Would love to hear the team's thinking on this.

Great work overall. Just want to understand the roadmap.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this still alive? I asked Claude to analyze this repo and it shocked me. #181

Is this still alive? I asked Claude to analyze this repo and it shocked me.

S3 dropped the accessibility tree entirely?

2-4 LLM calls per single GUI action is a lot

The 72.6% number uses bBoN — what's the real cost?

`exec()` of raw pyautogui strings — any plans for a safer interface?

Format validation retries can triple LLM calls

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is this still alive? I asked Claude to analyze this repo and it shocked me. #181

Description

Is this still alive? I asked Claude to analyze this repo and it shocked me.

S3 dropped the accessibility tree entirely?

2-4 LLM calls per single GUI action is a lot

The 72.6% number uses bBoN — what's the real cost?

exec() of raw pyautogui strings — any plans for a safer interface?

Format validation retries can triple LLM calls

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`exec()` of raw pyautogui strings — any plans for a safer interface?