Skip to content

feat: Screenshot tool with DXCam backend reporting and UIAutomation hang fix#104

Open
yasuhirofujii-medley wants to merge 4 commits intoCursorTouch:mainfrom
yasuhirofujii-medley:feat/fast-snapshot-no-tree
Open

feat: Screenshot tool with DXCam backend reporting and UIAutomation hang fix#104
yasuhirofujii-medley wants to merge 4 commits intoCursorTouch:mainfrom
yasuhirofujii-medley:feat/fast-snapshot-no-tree

Conversation

@yasuhirofujii-medley
Copy link

Summary

This PR adds a dedicated Screenshot tool for fast screenshot-only capture, reports the capture backend (DXCam/Pillow) in the response, and skips expensive UIAutomation window enumeration in the Screenshot fast path.

These changes build on top of the use_ui_tree=False fast path introduced in PR #98.


Why this is needed

1. Screenshot tool — dedicated fast capture endpoint (65a9ed3)

Problem: The existing Snapshot tool, even with use_ui_tree=False, still carries the overhead of being a general-purpose tool. Callers who only need a screenshot have to specify multiple flags (use_vision=True, use_annotation=False, use_ui_tree=False). More importantly, there was no way to invoke a screenshot-only path with a simple, discoverable tool name.

Solution: Added a new Screenshot tool that is purpose-built for fast screenshot capture:

  • Fixed to use_vision=True, use_annotation=False, use_ui_tree=False
  • Accepts display parameter (list of display indices) for multi-monitor selection
  • Single-purpose tool with a clear name that agents can discover easily
  • DXCam (DirectX) hardware capture is used when display is specified (requires capture_rect)

Also added:

  • Desktop.parse_display_selection() for robust display parameter handling
  • Desktop.get_display_union_rect() for computing the capture region from display indices
  • Shared _capture_desktop_state() helper to deduplicate Snapshot/Screenshot implementation
  • WINDOWS_MCP_PROFILE_SNAPSHOT env var for per-stage timing instrumentation

2. Capture backend reporting (5484e46)

Problem: When debugging screenshot performance, there was no way to tell from the tool response whether DXCam (DirectX, ~10ms) or Pillow (GDI, ~100ms) was used for capture. This made it difficult to confirm that DXCam was actually being activated.

Solution: The get_screenshot() method now tracks the backend used (self._last_screenshot_backend), and the response includes a Screenshot Backend: dxcam or Screenshot Backend: pillow line. The DesktopState dataclass carries a screenshot_backend field.

3. Skip UIAutomation window enumeration for Screenshot tool (5b22d1b, 3d751df)

Problem: Desktop.get_state() unconditionally called get_controls_handles(), get_windows(), and get_active_window() — even when use_ui_tree=False (Screenshot tool). These are UIAutomation API calls that enumerate windows via COM/WM messages. When an application is launching and not responding to window messages (e.g., showing a splash screen), these calls hang for tens of seconds (observed: 47 seconds for a single screenshot).

This is the same class of problem that PR #98 addressed for tree capture, but the window enumeration calls were left in place because the Snapshot response includes window metadata. For the Screenshot tool, however, this metadata is not needed — the purpose is strictly to capture the screen image as fast as possible.

Solution: When use_ui_tree=False, get_state() now skips all three UIAutomation window enumeration calls and returns empty window lists. This eliminates the hang entirely for the Screenshot path.

The comment explaining this was initially written in Japanese, which caused an encoding corruption issue when uv fetched the package from GitHub — multi-byte characters were mangled, newlines were swallowed, and an if statement was merged into a comment line, producing an IndentationError on startup. The comment was rewritten in English to avoid this.


Changes

src/windows_mcp/__main__.py

  • Added Screenshot tool with display parameter
  • Extracted _capture_desktop_state() shared helper (used by both Snapshot and Screenshot)
  • Added _snapshot_profile_enabled() and _as_bool() helpers
  • Added _build_snapshot_response() to deduplicate response construction
  • Response includes Screenshot Backend: line when available

src/windows_mcp/desktop/service.py

  • get_state(): Skip get_controls_handles/get_windows/get_active_window when use_ui_tree=False
  • get_screenshot(): Track _last_screenshot_backend (dxcam/pillow)
  • Added parse_display_selection() for display parameter validation
  • Added get_display_union_rect() for computing display capture region
  • Added per-stage profiling when WINDOWS_MCP_PROFILE_SNAPSHOT=1

src/windows_mcp/desktop/views.py

  • Added screenshot_backend: str | None field to DesktopState

src/windows_mcp/tree/service.py

  • Added screen_box property (used as fallback root box when UI tree is skipped)

tests/test_snapshot_display_filter.py

  • Added tests for parse_display_selection()
  • Added tests for display-filtered screenshot dimensions
  • Added tests for use_ui_tree=False tree skip + use_dom validation

Behavior

Default behavior (no breaking changes)

  • Snapshot tool continues to work exactly as before
  • All existing parameters and defaults are preserved

New Screenshot tool

{
  "tool": "Screenshot",
  "display": [0]
}

Returns a fast screenshot with DXCam backend (when available), no UI tree, no window enumeration.

Performance impact

Scenario Before After
Screenshot during app launch (UIAutomation hang) ~50s <1s
Normal Screenshot with DXCam ~200ms ~200ms
Snapshot (use_ui_tree=True) unchanged unchanged

Testing

python -m pytest -q tests/test_snapshot_display_filter.py
# 11 passed

yasuhirofujii-medley added 4 commits March 13, 2026 09:25
get_screenshot() で使用されたバックエンド (dxcam/pillow) を追跡し、
DesktopState.screenshot_backend に格納。
レスポンステキストに 'Screenshot Backend: dxcam/pillow' 行を追加。

Control Node 側でこの情報をパースしてログに表示することで、
DirectX キャプチャが有効かどうかをリモートから確認可能にする。
use_ui_tree=False (Screenshot tool) の場合、get_controls_handles /
get_windows / get_active_window をスキップ。
これらの UIAutomation API はアプリ起動中にハングする可能性があり、
Screenshot が 47 秒以上ブロックされるケースがあった。
uv cache fetch corrupted multi-byte (Japanese) characters in comments,
causing newlines to be swallowed and merging the if-statement into
the comment line, resulting in IndentationError on startup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant