Skip to content

Commit 5f1d70f

Browse files
feat(sparsekernel): tighten browser actionability checks
1 parent 2717e7c commit 5f1d70f

4 files changed

Lines changed: 57 additions & 10 deletions

File tree

docs/architecture/browser-broker.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Set `OPENCLAW_RUNTIME_BROWSER_BROKER=native` to let SparseKernel launch and supe
2121

2222
Set `OPENCLAW_RUNTIME_BROWSER_REQUIRE_PROXY=1` when a trust zone must use a proxy-backed browser egress path. The trust zone's network policy must contain a loopback `proxy_ref`, and native browser pools launch Chromium with `--proxy-server=<proxy_ref>`. Static or externally managed CDP endpoints are rejected in this mode unless `OPENCLAW_RUNTIME_BROWSER_EXTERNAL_PROXY_OK=1` asserts that the external browser process is already proxy-controlled. This protects the SparseKernel-owned browser process path; it is not host-level egress enforcement for arbitrary host processes.
2323

24-
Supported v0 actions (`status`, `doctor`, `profiles`, `tabs`, `open`, `navigate`, `focus`, `close`, `snapshot`, `console`, `screenshot`, `pdf`, direct file-input `upload`, `dialog`, and brokered `act`) operate against broker-owned targets inside the leased CDP context. Brokered `act` covers the OpenClaw action contract for click, coordinate click, type, press, hover, scroll, drag, select, fill, resize, wait, evaluate, close, and batch using CDP input events plus bounded DOM evaluation. Selector-backed actions retry inside the leased page until their action timeout, and `wait --load networkidle` uses per-target CDP Network events plus a quiet window rather than only checking `document.readyState`. Actions that can change page state are followed by a broker-side navigation check: same-target navigations are accepted only when the resulting URL stays inside the context's allowed-origin policy, same-policy popups are attached as broker-owned targets, and disallowed popups are closed. When an allowed-origin policy is configured, the broker also enables CDP Fetch interception and fails requests outside that policy while recording `browser_network.blocked` observations; this is request control, not host isolation. Before opening or navigating, the ToolBroker checks the trust-zone network policy and denies unsupported schemes, private-network destinations when disallowed, literal denied CIDRs, and, when runtime policy enforcement is enabled, hostnames that resolve to denied/private addresses. Proxy-backed egress control remains future work. Snapshots use a bounded CDP `Runtime.evaluate` DOM read, actions resolve refs from the latest brokered snapshot where needed, console output is captured from CDP runtime/log events per target, and screenshot/PDF output is captured as SparseKernel artifacts, read back through artifact access, and converted to existing tool result formats for compatibility. Closing a broker-owned target now closes that target; the full browser context is released only when the last target closes or broker cleanup runs.
24+
Supported v0 actions (`status`, `doctor`, `profiles`, `tabs`, `open`, `navigate`, `focus`, `close`, `snapshot`, `console`, `screenshot`, `pdf`, direct file-input `upload`, `dialog`, and brokered `act`) operate against broker-owned targets inside the leased CDP context. Brokered `act` covers the OpenClaw action contract for click, coordinate click, type, press, hover, scroll, drag, select, fill, resize, wait, evaluate, close, and batch using CDP input events plus bounded DOM evaluation. Selector-backed actions retry inside the leased page until their action timeout and now require basic actionability before dispatch: visible connected target, stable bounding box, enabled form state where relevant, editable target for typing, and center-point hit testing. `wait --load networkidle` uses per-target CDP Network events plus a quiet window rather than only checking `document.readyState`. Actions that can change page state are followed by a broker-side navigation check: same-target navigations are accepted only when the resulting URL stays inside the context's allowed-origin policy, same-policy popups are attached as broker-owned targets, and disallowed popups are closed. When an allowed-origin policy is configured, the broker also enables CDP Fetch interception and fails requests outside that policy while recording `browser_network.blocked` observations; this is request control, not host isolation. Before opening or navigating, the ToolBroker checks the trust-zone network policy and denies unsupported schemes, private-network destinations when disallowed, literal denied CIDRs, and, when runtime policy enforcement is enabled, hostnames that resolve to denied/private addresses. Proxy-backed egress control remains future work. Snapshots use a bounded CDP `Runtime.evaluate` DOM read, actions resolve refs from the latest brokered snapshot where needed, console output is captured from CDP runtime/log events per target, and screenshot/PDF output is captured as SparseKernel artifacts, read back through artifact access, and converted to existing tool result formats for compatibility. Closing a broker-owned target now closes that target; the full browser context is released only when the last target closes or broker cleanup runs.
2525

2626
Use `openclaw sparsekernel browser-pools` to inspect durable ledger pools and currently materialized native browser process pools. Native pool snapshots include trust zone, profile, active refs, max context slots, idle timeout, endpoint, PID when available, last activity, start count, clean stop count, and crash count.
2727

docs/architecture/local-agent-kernel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ The browser broker model is:
106106

107107
Important boundary: BrowserContext isolation is session isolation, not host isolation. Playwright route blocking and SSRF guards are useful controls, but they are not hard security boundaries.
108108

109-
The broker applies configured trust-zone network policy to explicit allowed origins before allocating a context. It also denies unsupported URL schemes, private-network destinations when the policy disallows them, literal IPs matching denied CIDRs, and, when `OPENCLAW_RUNTIME_BROWSER_POLICY_ENFORCE=1` is set, hostnames that resolve to denied/private addresses. Set `OPENCLAW_RUNTIME_BROWSER_POLICY_DNS=0` only when a caller intentionally wants literal-host checks without DNS resolution. Set `OPENCLAW_RUNTIME_BROWSER_REQUIRE_PROXY=1` to require a valid loopback `network_policies.proxy_ref`; native browser pools then launch Chromium with `--proxy-server=<proxy_ref>` and reject static or externally managed CDP endpoints unless `OPENCLAW_RUNTIME_BROWSER_EXTERNAL_PROXY_OK=1` accepts that external process as already proxy-controlled. This is a concrete proxy-backed browser egress path, but it is still process configuration, not a kernel or VM boundary. Set `OPENCLAW_RUNTIME_BROWSER_BROKER=cdp` and `OPENCLAW_SPARSEKERNEL_BROWSER_CDP_ENDPOINT=<loopback endpoint>` to make the OpenClaw browser tool acquire a real SparseKernel CDP context for the active run. Set `OPENCLAW_RUNTIME_BROWSER_BROKER=managed` to use the existing OpenClaw browser control service as the managed process owner and let SparseKernel lease CDP contexts from its reported endpoint. Set `OPENCLAW_RUNTIME_BROWSER_BROKER=native` to let SparseKernel launch and supervise a local Chromium-compatible process pool keyed by trust zone/profile, with process lifetime tied to brokered context leases and idle timeout. `OPENCLAW_SPARSEKERNEL_BROWSER_MAX_CONTEXTS` caps active contexts per native pool. Use `openclaw sparsekernel browser-pools` to inspect ledger pools plus in-process native pool refcounts, limits, start counts, clean stops, and crash counts. The runtime injects an internal browser proxy for supported navigation, tab, snapshot, console, screenshot, PDF, direct file-input upload, dialog, and action routes instead of exposing raw CDP to the agent. Brokered actions cover the OpenClaw action contract with CDP input events, bounded DOM evaluation, selector retry, per-target CDP-backed network-idle waiting, and post-action navigation checks. Same-target action navigations must stay inside the context's allowed origins when a policy is configured; same-policy popups are attached as broker-owned targets and disallowed popups are closed. Broker-owned targets and per-target console/network/artifact observations are persisted in first-class ledger tables and mirrored to audit events. CDP Fetch interception blocks out-of-policy requests when an allowed-origin policy is configured, but this remains request control rather than a hard security boundary. Closing a target releases the whole context only when no broker-owned targets remain. Screenshot and PDF outputs go through the artifact store.
109+
The broker applies configured trust-zone network policy to explicit allowed origins before allocating a context. It also denies unsupported URL schemes, private-network destinations when the policy disallows them, literal IPs matching denied CIDRs, and, when `OPENCLAW_RUNTIME_BROWSER_POLICY_ENFORCE=1` is set, hostnames that resolve to denied/private addresses. Set `OPENCLAW_RUNTIME_BROWSER_POLICY_DNS=0` only when a caller intentionally wants literal-host checks without DNS resolution. Set `OPENCLAW_RUNTIME_BROWSER_REQUIRE_PROXY=1` to require a valid loopback `network_policies.proxy_ref`; native browser pools then launch Chromium with `--proxy-server=<proxy_ref>` and reject static or externally managed CDP endpoints unless `OPENCLAW_RUNTIME_BROWSER_EXTERNAL_PROXY_OK=1` accepts that external process as already proxy-controlled. This is a concrete proxy-backed browser egress path, but it is still process configuration, not a kernel or VM boundary. Set `OPENCLAW_RUNTIME_BROWSER_BROKER=cdp` and `OPENCLAW_SPARSEKERNEL_BROWSER_CDP_ENDPOINT=<loopback endpoint>` to make the OpenClaw browser tool acquire a real SparseKernel CDP context for the active run. Set `OPENCLAW_RUNTIME_BROWSER_BROKER=managed` to use the existing OpenClaw browser control service as the managed process owner and let SparseKernel lease CDP contexts from its reported endpoint. Set `OPENCLAW_RUNTIME_BROWSER_BROKER=native` to let SparseKernel launch and supervise a local Chromium-compatible process pool keyed by trust zone/profile, with process lifetime tied to brokered context leases and idle timeout. `OPENCLAW_SPARSEKERNEL_BROWSER_MAX_CONTEXTS` caps active contexts per native pool. Use `openclaw sparsekernel browser-pools` to inspect ledger pools plus in-process native pool refcounts, limits, start counts, clean stops, and crash counts. The runtime injects an internal browser proxy for supported navigation, tab, snapshot, console, screenshot, PDF, direct file-input upload, dialog, and action routes instead of exposing raw CDP to the agent. Brokered actions cover the OpenClaw action contract with CDP input events, bounded DOM evaluation, selector retry with basic actionability checks, per-target CDP-backed network-idle waiting, and post-action navigation checks. Same-target action navigations must stay inside the context's allowed origins when a policy is configured; same-policy popups are attached as broker-owned targets and disallowed popups are closed. Broker-owned targets and per-target console/network/artifact observations are persisted in first-class ledger tables and mirrored to audit events. CDP Fetch interception blocks out-of-policy requests when an allowed-origin policy is configured, but this remains request control rather than a hard security boundary. Closing a target releases the whole context only when no broker-owned targets remain. Screenshot and PDF outputs go through the artifact store.
110110

111111
## Sandbox broker
112112

packages/browser-broker/src/index.test.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1234,6 +1234,9 @@ describe("@openclaw/sparsekernel-browser-broker", () => {
12341234
expect(actionExpression).toContain("waitForActionTarget");
12351235
expect(actionExpression).toContain("const timeoutMs = 1234");
12361236
expect(actionExpression).toContain("document.querySelector(selector)");
1237+
expect(actionExpression).toContain("getBoundingClientRect");
1238+
expect(actionExpression).toContain("aria-disabled");
1239+
expect(actionExpression).toContain("elementFromPoint");
12371240
});
12381241

12391242
it("waits for CDP network idle instead of treating document load as enough", async () => {

packages/browser-broker/src/index.ts

Lines changed: 52 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2404,7 +2404,7 @@ function buildActionExpression(
24042404
normalizeMouseButton(request.kind === "click" ? request.button : undefined),
24052405
);
24062406
return `(async () => {
2407-
${buildActionTargetHelpers(selectorJson, timeoutJson)}
2407+
${buildActionTargetHelpers(selectorJson, timeoutJson, request.kind)}
24082408
const node = await waitForActionTarget();
24092409
node.scrollIntoView({ block: "center", inline: "center" });
24102410
if (${JSON.stringify(request.kind)} === "scrollIntoView") return { ok: true };
@@ -2426,7 +2426,7 @@ function buildActionExpression(
24262426
const submit = request.submit === true;
24272427
const slowly = request.slowly === true;
24282428
return `(async () => {
2429-
${buildActionTargetHelpers(selectorJson, timeoutJson)}
2429+
${buildActionTargetHelpers(selectorJson, timeoutJson, request.kind)}
24302430
const node = await waitForActionTarget();
24312431
node.scrollIntoView({ block: "center", inline: "center" });
24322432
node.focus?.();
@@ -2460,7 +2460,7 @@ function buildActionExpression(
24602460
if (request.kind === "select") {
24612461
const values = JSON.stringify(request.values);
24622462
return `(async () => {
2463-
${buildActionTargetHelpers(selectorJson, timeoutJson)}
2463+
${buildActionTargetHelpers(selectorJson, timeoutJson, request.kind)}
24642464
const node = await waitForActionTarget();
24652465
node.scrollIntoView({ block: "center", inline: "center" });
24662466
const values = ${values};
@@ -2574,24 +2574,68 @@ function buildEvaluateExpression(
25742574
})()`;
25752575
}
25762576

2577-
function buildActionTargetHelpers(selectorJson: string, timeoutJson: string): string {
2577+
function buildActionTargetHelpers(selectorJson: string, timeoutJson: string, kind: string): string {
25782578
return `const selector = ${selectorJson};
25792579
const timeoutMs = ${timeoutJson};
2580+
const actionKind = ${JSON.stringify(kind)};
25802581
const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
2582+
const rectSnapshot = (node) => {
2583+
const rect = node.getBoundingClientRect();
2584+
return { x: rect.x, y: rect.y, width: rect.width, height: rect.height };
2585+
};
25812586
const isVisible = (node) => {
25822587
if (!node?.isConnected) return false;
25832588
const style = getComputedStyle(node);
2584-
if (style.visibility === "hidden" || style.display === "none") return false;
2585-
return node.getClientRects().length > 0;
2589+
if (style.visibility === "hidden" || style.display === "none" || style.pointerEvents === "none") return false;
2590+
const rect = rectSnapshot(node);
2591+
return rect.width > 0 && rect.height > 0 && node.getClientRects().length > 0;
2592+
};
2593+
const isEnabled = (node) => {
2594+
if (node.disabled === true) return false;
2595+
if (node.getAttribute?.("aria-disabled") === "true") return false;
2596+
return true;
2597+
};
2598+
const isEditable = (node) => {
2599+
if (node.isContentEditable) return true;
2600+
if (node instanceof HTMLTextAreaElement) return !node.readOnly && !node.disabled;
2601+
if (node instanceof HTMLInputElement) {
2602+
const type = String(node.type || "text").toLowerCase();
2603+
return !node.readOnly && !node.disabled && !["button", "checkbox", "color", "file", "hidden", "image", "radio", "range", "reset", "submit"].includes(type);
2604+
}
2605+
return false;
2606+
};
2607+
const receivesCenterHit = (node) => {
2608+
const rect = node.getBoundingClientRect();
2609+
const x = Math.min(Math.max(rect.left + rect.width / 2, 0), Math.max(window.innerWidth - 1, 0));
2610+
const y = Math.min(Math.max(rect.top + rect.height / 2, 0), Math.max(window.innerHeight - 1, 0));
2611+
const hit = document.elementFromPoint(x, y);
2612+
return !hit || hit === node || node.contains(hit);
2613+
};
2614+
const isStable = async (node) => {
2615+
const first = rectSnapshot(node);
2616+
await delay(50);
2617+
const second = rectSnapshot(node);
2618+
return Math.abs(first.x - second.x) < 0.5 &&
2619+
Math.abs(first.y - second.y) < 0.5 &&
2620+
Math.abs(first.width - second.width) < 0.5 &&
2621+
Math.abs(first.height - second.height) < 0.5;
2622+
};
2623+
const isActionable = async (node) => {
2624+
if (!isVisible(node)) return false;
2625+
if ((actionKind === "click" || actionKind === "type" || actionKind === "select") && !isEnabled(node)) return false;
2626+
if (actionKind === "type" && !isEditable(node)) return false;
2627+
if (!(await isStable(node))) return false;
2628+
if (actionKind !== "scrollIntoView" && !receivesCenterHit(node)) return false;
2629+
return true;
25862630
};
25872631
const waitForActionTarget = async () => {
25882632
const deadline = Date.now() + timeoutMs;
25892633
while (Date.now() <= deadline) {
25902634
const node = document.querySelector(selector);
2591-
if (node && isVisible(node)) return node;
2635+
if (node && await isActionable(node)) return node;
25922636
await delay(100);
25932637
}
2594-
throw new Error("SparseKernel browser action target not found");
2638+
throw new Error("SparseKernel browser action target not actionable");
25952639
};`;
25962640
}
25972641

0 commit comments

Comments
 (0)