Skip to content

feat: add Yutori Navigator n1.5 as a computer-use (CUA) agent provider#2194

Open
lawrencechen98 wants to merge 4 commits into
browserbase:mainfrom
yutori-ai:yutori-navigator-cua-upstream
Open

feat: add Yutori Navigator n1.5 as a computer-use (CUA) agent provider#2194
lawrencechen98 wants to merge 4 commits into
browserbase:mainfrom
yutori-ai:yutori-navigator-cua-upstream

Conversation

@lawrencechen98

@lawrencechen98 lawrencechen98 commented Jun 5, 2026

Copy link
Copy Markdown

Draft — opening early for visibility/feedback. Happy to adjust scope or split (see "Notes for reviewers").

Summary

Adds Yutori Navigator n1.5 as a computer-use agent provider, alongside the existing OpenAI / Anthropic / Google / Microsoft CUA clients. Navigator is a computer-use model (screenshot in, coordinate-based tool_calls out in a normalized 1000×1000 space) served via an OpenAI-compatible Chat Completions API at https://api.yutori.com/v1. Because it's OpenAI-compatible, this reuses Stagehand's existing openai dependency and the provider-agnostic V3CuaAgentHandlerno new dependencies, no handler-shape changes for other providers.

const agent = stagehand.agent({ mode: "cua", model: "yutori/n1.5-latest" });
await agent.execute({ instruction: "...", maxSteps: 30 });

Auth via YUTORI_API_KEY (or clientOptions.apiKey / baseURL). Ships the core tool set (browser_tools_core-20260403).

What's included

  • YutoriCUAClient — screenshot-per-turn loop; tool_callAgentAction with 1000×1000 → viewport coordinate denormalization; role:"tool" results with a Current URL: suffix; request payload trimming (keep recent screenshots under ~9.5 MB); completion when no tool_calls; stop-and-summarize on max steps. Mirrors the Yutori Python SDK reference loop.
  • Provider registrationAgentProvider, AVAILABLE_CUA_MODELS, AgentType, and providerEnvVarMap (yutoriYUTORI_API_KEY); Navigator-specific ClientOptions (toolSet, disableTools, jsonSchema, userTimezone, userLocation) + the cloud API / OpenAPI schema.
  • Keyboard modifiers implemented generically: a new optional modifiers option on the understudy page.click() / page.scroll() that sets the CDP mouse-event modifier bitmask (reusable by any provider). Plus hold_key and refresh (via page.reload, with a faithful agent-replay step).
  • Evals — the local bench harness no longer builds an AI-SDK text client for CUA-only models (getAISDKLanguageModel has no provider for them and initV3 ignores it in CUA mode). General fix; also unblocks local evals for microsoft/fara-7b.
  • Tests — unit coverage for the client (action mapping, message/trajectory shape, structured output, error recovery, stop-and-summarize), the helpers (coordinate denorm/validation, key mapping, payload trimming), the handler (modifiers/hold/refresh + URL freshness), and API serialization; plus a usage example.

Testing

  • pnpm build (typecheck) + the new unit suites pass; prettier/eslint clean.
  • Verified live end-to-end against the real Navigator API (local headless Chrome): multi-step click/type/keypress tasks complete with correct DOM end-state.

Notes for reviewers

  • Core tool set only. The expanded/DOM tool set (extract_elements, find, set_element_value, execute_js) is intentionally a follow-up.
  • The cloud-API/OpenAPI additions expose Navigator config through the hosted API schema for consistency; happy to drop or split these (and/or the evals harness fix) into separate PRs if you'd prefer a smaller first PR.
  • mouse_down/mouse_up are disabled by default (no equivalent in the shared action handler; drag covers press-move-release).

Maintained by the Yutori team.


Summary by cubic

Adds the Yutori Navigator n1.5 computer-use model as a new provider via an OpenAI-compatible Chat Completions API, now defaulting to the expanded DOM tool set for richer page interaction.

  • New Features

    • New yutori/n1.5-latest CUA model with YUTORI_API_KEY auth and options (toolSet, disableTools, jsonSchema, userTimezone, userLocation, temperature).
    • Expanded tools (default): extract_elements, find, set_element_value, execute_js built on the a11y snapshot + deepLocator; coordinate tools can target a ref (resolved to on-screen center with scroll-into-view) and recover on stale refs.
    • YutoriCUAClient: screenshot-per-turn loop, 1000×1000 coordinate mapping, payload trimming, per-tool results with current URL, stop-and-summarize on max steps (fully flow-logged), and structured parsed_json on AgentResult.output.
    • Generic click/scroll modifiers via CDP bitmask (captured and replayed), hold_key delay, and refresh with a recorded replay step; API/OpenAPI exposes provider yutori and Navigator options; .env.example includes YUTORI_API_KEY; local eval harness skips AI-SDK text clients for CUA-only models; example and tests included.
  • Migration

    • Set YUTORI_API_KEY (and optional baseURL), then use: stagehand.agent({ mode: "cua", model: "yutori/n1.5-latest" }).
    • Optional model options: toolSet, disableTools, jsonSchema, userTimezone, userLocation, temperature (use toolSet: "browser_tools_core-20260403" for coordinate-only).

Written for commit fcfbdb1. Summary will update on new commits.

Review in cubic

@changeset-bot

changeset-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: fcfbdb1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server-v3 Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

This PR is from an external contributor and must be approved by a stagehand team member with write access before CI can run.
Approving the latest commit mirrors it into an internal PR owned by the approver.
If new commits are pushed later, the internal PR stays open but is marked stale until someone approves the latest external commit and refreshes it.

@github-actions github-actions Bot added external-contributor Tracks PRs mirrored from external contributor forks. external-contributor:awaiting-approval Waiting for a stagehand team member to approve the latest external commit. labels Jun 5, 2026
Integrates Yutori's Navigator n1.5 computer-use model as a Stagehand CUA
provider, mirroring the existing OpenAI/Anthropic/Google/Microsoft CUA clients.
Navigator is OpenAI-compatible Chat Completions at https://api.yutori.com/v1
(screenshot in, coordinate tool_calls out in a normalized 1000x1000 space), so
this reuses the existing `openai` dependency and the provider-agnostic CUA
handler — no new dependencies.

Usage: stagehand.agent({ mode: "cua", model: "yutori/n1.5-latest" }).
Auth via YUTORI_API_KEY or clientOptions (apiKey/baseURL). Core tool set.

- YutoriCUAClient: screenshot-per-turn loop; tool_call -> AgentAction with
  1000x1000 coordinate denormalization; role:"tool" results with a current-URL
  suffix; payload trimming; completion when no tool calls; stop-and-summarize on
  max steps. Faithful to the Yutori Python SDK reference loop.
- Provider registration (AgentProvider, AVAILABLE_CUA_MODELS, AgentType,
  providerEnvVarMap) and Navigator ClientOptions (toolSet/disableTools/
  jsonSchema/userTimezone/userLocation), incl. the cloud API + OpenAPI schema.
- Keyboard modifiers via a general page.click/scroll `modifiers` option (sets
  the CDP mouse-event modifiers bitmask); hold-key; refresh via page.reload with
  a faithful agent-replay step.
- Evals: skip building an AI-SDK text client for CUA-only models in the local
  bench harness path (also unblocks microsoft/fara-7b).
- Unit tests + usage example.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lawrencechen98 lawrencechen98 force-pushed the yutori-navigator-cua-upstream branch from 931e0b7 to ce6c06e Compare June 5, 2026 22:15
@lawrencechen98 lawrencechen98 marked this pull request as ready for review June 5, 2026 22:27

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 23 files

Confidence score: 3/5

  • There is some real merge risk: stopAndSummarize() in packages/core/lib/v3/agent/YutoriCUAClient.ts bypasses expected flowLogger instrumentation, which can reduce traceability and make agent behavior/debugging less reliable.
  • packages/core/lib/v3/handlers/v3CuaAgentHandler.ts has a concrete replay correctness concern—modifiers used during CUA action execution are not persisted, so replay can diverge for modifier-dependent actions.
  • Given a high-confidence medium/high-severity instrumentation gap plus a replay-divergence bug, this looks mergeable only with caution rather than a low-risk merge.
  • Pay close attention to packages/core/lib/v3/agent/YutoriCUAClient.ts and packages/core/lib/v3/handlers/v3CuaAgentHandler.ts - missing flow logging and non-persisted modifiers can cause observability and replay consistency issues.
Architecture diagram
sequenceDiagram
    participant User as User Code
    participant Stagehand as Stagehand Instance
    participant Agent as Agent Provider
    participant Client as YutoriCUAClient
    participant Handler as V3CuaAgentHandler
    participant Page as Understudy Page
    participant CDP as Chrome DevTools Protocol
    participant Navigator as Yutori Navigator API

    Note over User,Navigator: NEW: Yutori Navigator n1.5 CUA Agent Provider

    User->>Stagehand: stagehand.agent({ mode: "cua", model: "yutori/n1.5-latest" })
    Stagehand->>Agent: create provider (modelToAgentProviderMap)
    Agent->>Client: new YutoriCUAClient(type, model, instructions, clientOptions)
    alt API key missing
        Client-->>Agent: throw Error
    end
    Client->>Client: configure toolSet, disableTools, jsonSchema, userTimezone, userLocation
    Stagehand-->>User: agent instance

    User->>Stagehand: agent.execute({ instruction, maxSteps })
    Stagehand->>Handler: new V3CuaAgentHandler(..., client=YutoriCUAClient)

    Handler->>Client: setScreenshotProvider()
    Handler->>Client: setActionHandler()
    Handler->>Handler: capture screenshot (page.screenshot)
    Handler->>Client: setCurrentUrl(page.url())
    Client->>Client: build message history (system prompt + user instruction with location/timezone context)

    loop Step Loop (maxSteps)
        Client->>Client: clone messages for request
        Client->>Client: trimImagesToFit (drop old screenshots under ~9.5 MB, keep latest)
        Client->>Navigator: POST /v1/chat/completions (OpenAI-compatible)
        Note over Client,Navigator: Extra params: tool_set, disable_tools, json_schema
        Navigator-->>Client: response with tool_calls (1000x1000 normalized coordinates)
        Client->>Client: parse tool_calls from assistant message

        alt No tool_calls
            Client-->>Handler: return final result (completed)
        else Has tool_calls
            par For each tool_call
                Client->>Client: denormalizeCoordinates (1000→viewport pixels)
                Client->>Client: mapNavigatorKeyToPlaywright (Navigator keys→Playwright keys)
                Client->>Handler: actionHandler(action)
                alt Action type: click (with possible modifier)
                    Handler->>Page: click(x, y, { modifiers })
                    Page->>CDP: dispatchMouseEvent(..., modifiers bitmask)
                else Action type: scroll (with possible modifier)
                    Handler->>Page: scroll(x, y, deltaX, deltaY, { modifiers })
                    Page->>CDP: dispatchMouseEvent(..., modifiers bitmask)
                else Action type: keypress (with optional holdMs delay)
                    Handler->>Page: keyPress(key, { delay })
                else Action type: refresh
                    Handler->>Page: reload({ waitUntil: "load" })
                    Page->>CDP: Page.reload
                else Action type: type, goto, back, forward, wait, drag
                    Handler->>Page: execute action
                end
                alt Action succeeded
                    Page-->>Handler: success
                else Action threw error
                    Page-->>Handler: error
                    Handler->>Page: still update client URL (page.url())
                    Handler-->>Client: action result with [ERROR]
                end
            end
            Client->>Client: append tool result (role:"tool" + "Current URL:" suffix)
            Client->>Client: captureScreenshot() for next turn
        end

        alt Payload size > max bytes
            Client->>Client: trimImagesToFit (strip old screenshots, keep recent)
        end
    end

    alt Max steps reached (no completion)
        Client->>Client: formatStopAndSummarize(task)
        Client->>Navigator: final request with summarize prompt (no json_schema)
        Navigator-->>Client: summary text
        Client-->>Handler: return result (completed=false, summary message)
    end

    Handler-->>Stagehand: AgentResult (output may include parsed_json)
    Stagehand-->>User: execution result
Loading

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread packages/core/lib/v3/agent/YutoriCUAClient.ts
Comment thread packages/core/lib/v3/handlers/v3CuaAgentHandler.ts Outdated
lawrencechen98 and others added 3 commits June 9, 2026 10:36
Address review: stopAndSummarize() made a direct chat.completions.create
call without FlowLogger instrumentation. Wrap it with
FlowLogger.logLlmRequest/logLlmResponse, mirroring predict(), so every
direct Navigator LLM call is flow-logged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rolls

Address review (P2): modifiers applied during CUA action execution were not
captured in recorded agent-replay steps, so a cached replay re-ran a chorded
click/scroll as a plain one.

Thread modifiers through the selector-based path symmetrically with the
coordinate path:
- Add a shared `cdpModifierMask` helper (understudy/modifiers.ts); reuse it in
  Page (dedupes the previous private copy) and add modifier support to
  Locator.click via the CDP mouse-event modifiers bitmask.
- Action gains optional `modifiers`; performUnderstudyMethod forwards them to
  the locator click; takeDeterministicAction passes action.modifiers.
- The CUA handler records modifiers on the replay step for click (Action) and
  scroll (AgentReplayScrollStep); AgentCache re-applies them on replay.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…est)

Enable the expanded Navigator tool set (browser_tools_expanded-20260403) and
make it the default for yutori/n1.5-latest, on top of the core coordinate tools.

- DOM tools backed by Stagehand's a11y snapshot + deepLocator:
  - extract_elements / find render the hybrid accessibility tree in Navigator's
    format, minting stable ref_N tokens (NavigatorRefRegistry).
  - set_element_value resolves a ref to its xpath and fills via deepLocator.
  - execute_js evaluates JS in the page (expression-first, body fallback).
- ref-targeted coordinate tools: click/scroll/etc. may carry a `ref` instead of
  coordinates; it resolves to the element's on-screen center (deepLocator
  centroid, scroll-into-view), taking priority over model coordinates and
  falling back to them. A ref'd scroll scrolls the element into view.
- The CUA handler supplies a generic page bridge (a11y snapshot + evaluate +
  elementCenter); all Navigator-specific logic stays in the client.
- Unknown/stale refs return a recoverable error so the model re-extracts.

Unit tests cover rendering/find/ref resolution + the four-tool dispatch,
tool-set selection, scroll-into-view, and stale-ref handling; a
YUTORI_API_KEY-gated integration spec exercises the tools live.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contributor:awaiting-approval Waiting for a stagehand team member to approve the latest external commit. external-contributor Tracks PRs mirrored from external contributor forks.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant