feat: add Yutori Navigator n1.5 as a computer-use (CUA) agent provider#2194
feat: add Yutori Navigator n1.5 as a computer-use (CUA) agent provider#2194lawrencechen98 wants to merge 4 commits into
Conversation
🦋 Changeset detectedLatest commit: fcfbdb1 The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
This PR is from an external contributor and must be approved by a stagehand team member with write access before CI can run. |
Integrates Yutori's Navigator n1.5 computer-use model as a Stagehand CUA provider, mirroring the existing OpenAI/Anthropic/Google/Microsoft CUA clients. Navigator is OpenAI-compatible Chat Completions at https://api.yutori.com/v1 (screenshot in, coordinate tool_calls out in a normalized 1000x1000 space), so this reuses the existing `openai` dependency and the provider-agnostic CUA handler — no new dependencies. Usage: stagehand.agent({ mode: "cua", model: "yutori/n1.5-latest" }). Auth via YUTORI_API_KEY or clientOptions (apiKey/baseURL). Core tool set. - YutoriCUAClient: screenshot-per-turn loop; tool_call -> AgentAction with 1000x1000 coordinate denormalization; role:"tool" results with a current-URL suffix; payload trimming; completion when no tool calls; stop-and-summarize on max steps. Faithful to the Yutori Python SDK reference loop. - Provider registration (AgentProvider, AVAILABLE_CUA_MODELS, AgentType, providerEnvVarMap) and Navigator ClientOptions (toolSet/disableTools/ jsonSchema/userTimezone/userLocation), incl. the cloud API + OpenAPI schema. - Keyboard modifiers via a general page.click/scroll `modifiers` option (sets the CDP mouse-event modifiers bitmask); hold-key; refresh via page.reload with a faithful agent-replay step. - Evals: skip building an AI-SDK text client for CUA-only models in the local bench harness path (also unblocks microsoft/fara-7b). - Unit tests + usage example. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
931e0b7 to
ce6c06e
Compare
There was a problem hiding this comment.
2 issues found across 23 files
Confidence score: 3/5
- There is some real merge risk:
stopAndSummarize()inpackages/core/lib/v3/agent/YutoriCUAClient.tsbypasses expectedflowLoggerinstrumentation, which can reduce traceability and make agent behavior/debugging less reliable. packages/core/lib/v3/handlers/v3CuaAgentHandler.tshas a concrete replay correctness concern—modifiers used during CUA action execution are not persisted, so replay can diverge for modifier-dependent actions.- Given a high-confidence medium/high-severity instrumentation gap plus a replay-divergence bug, this looks mergeable only with caution rather than a low-risk merge.
- Pay close attention to
packages/core/lib/v3/agent/YutoriCUAClient.tsandpackages/core/lib/v3/handlers/v3CuaAgentHandler.ts- missing flow logging and non-persisted modifiers can cause observability and replay consistency issues.
Architecture diagram
sequenceDiagram
participant User as User Code
participant Stagehand as Stagehand Instance
participant Agent as Agent Provider
participant Client as YutoriCUAClient
participant Handler as V3CuaAgentHandler
participant Page as Understudy Page
participant CDP as Chrome DevTools Protocol
participant Navigator as Yutori Navigator API
Note over User,Navigator: NEW: Yutori Navigator n1.5 CUA Agent Provider
User->>Stagehand: stagehand.agent({ mode: "cua", model: "yutori/n1.5-latest" })
Stagehand->>Agent: create provider (modelToAgentProviderMap)
Agent->>Client: new YutoriCUAClient(type, model, instructions, clientOptions)
alt API key missing
Client-->>Agent: throw Error
end
Client->>Client: configure toolSet, disableTools, jsonSchema, userTimezone, userLocation
Stagehand-->>User: agent instance
User->>Stagehand: agent.execute({ instruction, maxSteps })
Stagehand->>Handler: new V3CuaAgentHandler(..., client=YutoriCUAClient)
Handler->>Client: setScreenshotProvider()
Handler->>Client: setActionHandler()
Handler->>Handler: capture screenshot (page.screenshot)
Handler->>Client: setCurrentUrl(page.url())
Client->>Client: build message history (system prompt + user instruction with location/timezone context)
loop Step Loop (maxSteps)
Client->>Client: clone messages for request
Client->>Client: trimImagesToFit (drop old screenshots under ~9.5 MB, keep latest)
Client->>Navigator: POST /v1/chat/completions (OpenAI-compatible)
Note over Client,Navigator: Extra params: tool_set, disable_tools, json_schema
Navigator-->>Client: response with tool_calls (1000x1000 normalized coordinates)
Client->>Client: parse tool_calls from assistant message
alt No tool_calls
Client-->>Handler: return final result (completed)
else Has tool_calls
par For each tool_call
Client->>Client: denormalizeCoordinates (1000→viewport pixels)
Client->>Client: mapNavigatorKeyToPlaywright (Navigator keys→Playwright keys)
Client->>Handler: actionHandler(action)
alt Action type: click (with possible modifier)
Handler->>Page: click(x, y, { modifiers })
Page->>CDP: dispatchMouseEvent(..., modifiers bitmask)
else Action type: scroll (with possible modifier)
Handler->>Page: scroll(x, y, deltaX, deltaY, { modifiers })
Page->>CDP: dispatchMouseEvent(..., modifiers bitmask)
else Action type: keypress (with optional holdMs delay)
Handler->>Page: keyPress(key, { delay })
else Action type: refresh
Handler->>Page: reload({ waitUntil: "load" })
Page->>CDP: Page.reload
else Action type: type, goto, back, forward, wait, drag
Handler->>Page: execute action
end
alt Action succeeded
Page-->>Handler: success
else Action threw error
Page-->>Handler: error
Handler->>Page: still update client URL (page.url())
Handler-->>Client: action result with [ERROR]
end
end
Client->>Client: append tool result (role:"tool" + "Current URL:" suffix)
Client->>Client: captureScreenshot() for next turn
end
alt Payload size > max bytes
Client->>Client: trimImagesToFit (strip old screenshots, keep recent)
end
end
alt Max steps reached (no completion)
Client->>Client: formatStopAndSummarize(task)
Client->>Navigator: final request with summarize prompt (no json_schema)
Navigator-->>Client: summary text
Client-->>Handler: return result (completed=false, summary message)
end
Handler-->>Stagehand: AgentResult (output may include parsed_json)
Stagehand-->>User: execution result
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
Address review: stopAndSummarize() made a direct chat.completions.create call without FlowLogger instrumentation. Wrap it with FlowLogger.logLlmRequest/logLlmResponse, mirroring predict(), so every direct Navigator LLM call is flow-logged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rolls Address review (P2): modifiers applied during CUA action execution were not captured in recorded agent-replay steps, so a cached replay re-ran a chorded click/scroll as a plain one. Thread modifiers through the selector-based path symmetrically with the coordinate path: - Add a shared `cdpModifierMask` helper (understudy/modifiers.ts); reuse it in Page (dedupes the previous private copy) and add modifier support to Locator.click via the CDP mouse-event modifiers bitmask. - Action gains optional `modifiers`; performUnderstudyMethod forwards them to the locator click; takeDeterministicAction passes action.modifiers. - The CUA handler records modifiers on the replay step for click (Action) and scroll (AgentReplayScrollStep); AgentCache re-applies them on replay. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…est)
Enable the expanded Navigator tool set (browser_tools_expanded-20260403) and
make it the default for yutori/n1.5-latest, on top of the core coordinate tools.
- DOM tools backed by Stagehand's a11y snapshot + deepLocator:
- extract_elements / find render the hybrid accessibility tree in Navigator's
format, minting stable ref_N tokens (NavigatorRefRegistry).
- set_element_value resolves a ref to its xpath and fills via deepLocator.
- execute_js evaluates JS in the page (expression-first, body fallback).
- ref-targeted coordinate tools: click/scroll/etc. may carry a `ref` instead of
coordinates; it resolves to the element's on-screen center (deepLocator
centroid, scroll-into-view), taking priority over model coordinates and
falling back to them. A ref'd scroll scrolls the element into view.
- The CUA handler supplies a generic page bridge (a11y snapshot + evaluate +
elementCenter); all Navigator-specific logic stays in the client.
- Unknown/stale refs return a recoverable error so the model re-extracts.
Unit tests cover rendering/find/ref resolution + the four-tool dispatch,
tool-set selection, scroll-into-view, and stale-ref handling; a
YUTORI_API_KEY-gated integration spec exercises the tools live.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Adds Yutori Navigator n1.5 as a computer-use agent provider, alongside the existing OpenAI / Anthropic / Google / Microsoft CUA clients. Navigator is a computer-use model (screenshot in, coordinate-based
tool_callsout in a normalized 1000×1000 space) served via an OpenAI-compatible Chat Completions API athttps://api.yutori.com/v1. Because it's OpenAI-compatible, this reuses Stagehand's existingopenaidependency and the provider-agnosticV3CuaAgentHandler— no new dependencies, no handler-shape changes for other providers.Auth via
YUTORI_API_KEY(orclientOptions.apiKey/baseURL). Ships the core tool set (browser_tools_core-20260403).What's included
YutoriCUAClient— screenshot-per-turn loop;tool_call→AgentActionwith 1000×1000 → viewport coordinate denormalization;role:"tool"results with aCurrent URL:suffix; request payload trimming (keep recent screenshots under ~9.5 MB); completion when notool_calls; stop-and-summarize on max steps. Mirrors the Yutori Python SDK reference loop.AgentProvider,AVAILABLE_CUA_MODELS,AgentType, andproviderEnvVarMap(yutori→YUTORI_API_KEY); Navigator-specificClientOptions(toolSet,disableTools,jsonSchema,userTimezone,userLocation) + the cloud API / OpenAPI schema.modifiersoption on the understudypage.click()/page.scroll()that sets the CDP mouse-event modifier bitmask (reusable by any provider). Plushold_keyandrefresh(viapage.reload, with a faithful agent-replay step).getAISDKLanguageModelhas no provider for them andinitV3ignores it in CUA mode). General fix; also unblocks local evals formicrosoft/fara-7b.Testing
pnpm build(typecheck) + the new unit suites pass; prettier/eslint clean.Notes for reviewers
extract_elements,find,set_element_value,execute_js) is intentionally a follow-up.mouse_down/mouse_upare disabled by default (no equivalent in the shared action handler;dragcovers press-move-release).Maintained by the Yutori team.
Summary by cubic
Adds the Yutori Navigator n1.5 computer-use model as a new provider via an OpenAI-compatible Chat Completions API, now defaulting to the expanded DOM tool set for richer page interaction.
New Features
yutori/n1.5-latestCUA model withYUTORI_API_KEYauth and options (toolSet,disableTools,jsonSchema,userTimezone,userLocation,temperature).extract_elements,find,set_element_value,execute_jsbuilt on the a11y snapshot +deepLocator; coordinate tools can target aref(resolved to on-screen center with scroll-into-view) and recover on stale refs.YutoriCUAClient: screenshot-per-turn loop, 1000×1000 coordinate mapping, payload trimming, per-tool results with current URL, stop-and-summarize on max steps (fully flow-logged), and structuredparsed_jsononAgentResult.output.hold_keydelay, andrefreshwith a recorded replay step; API/OpenAPI exposes provideryutoriand Navigator options;.env.exampleincludesYUTORI_API_KEY; local eval harness skips AI-SDK text clients for CUA-only models; example and tests included.Migration
YUTORI_API_KEY(and optionalbaseURL), then use:stagehand.agent({ mode: "cua", model: "yutori/n1.5-latest" }).toolSet,disableTools,jsonSchema,userTimezone,userLocation,temperature(usetoolSet: "browser_tools_core-20260403"for coordinate-only).Written for commit fcfbdb1. Summary will update on new commits.