Skip to content

feat: add 9 advanced automation tools with DPI coordinate system support#93

Open
Vaibhav-api-code wants to merge 10 commits intoCursorTouch:mainfrom
Vaibhav-api-code:feature/advanced-vision-tools
Open

feat: add 9 advanced automation tools with DPI coordinate system support#93
Vaibhav-api-code wants to merge 10 commits intoCursorTouch:mainfrom
Vaibhav-api-code:feature/advanced-vision-tools

Conversation

@Vaibhav-api-code
Copy link

Summary

Adds 9 new tools to Windows-MCP that expand its desktop automation capabilities, bringing it closer to feature parity with macOS automation solutions. All coordinate-accepting tools support a coordinate_system parameter ("physical" or "logical") for DPI-aware operation.

New Tools

Tool Description Key Features
CursorPosition Get current mouse (x, y) coordinates Read-only, no deps
PixelColor Get RGB color at screen coordinates Hex code + nearest named color (20-color palette)
KeyHold Press/release keys independently (down/up) 40+ key names (shift, ctrl, alt, f1-f12, arrows, etc.)
ScreenInfo Get screen dimensions and DPI scaling Virtual screen size + scale factor
ScreenHighlight Highlight a screen region with colored rectangle GDI overlay with auto-cleanup, 4 colors
MousePath Move mouse along a multi-point path Bezier smoothing, configurable duration
ScreenReader OCR - read text from screen region Windows OCR (built-in) + pytesseract fallback
WaitForChange Wait until screen region visually changes Pixel-by-pixel comparison, configurable threshold
FindImage Template matching - find image on screen OpenCV-based, returns center coords + confidence

DPI Coordinate System

All 6 coordinate-accepting tools support a coordinate_system parameter:

  • "physical" (default) — raw pixel coordinates, no conversion
  • "logical" — coordinates are multiplied by the system DPI scale factor

Three internal helpers handle the conversion:

  • _to_physical(loc, system) — for [x, y] coordinates
  • _region_to_physical(region, system) — for [x, y, w, h] regions
  • _path_to_physical(path, system) — for [[x,y], ...] waypoint lists

Dependencies

No new required dependencies. All tools work with the existing pillow + pywin32 stack.

Optional dependency groups added to pyproject.toml:

pip install 'windows-mcp[vision]'  # opencv-python-headless, numpy
pip install 'windows-mcp[ocr]'     # pytesseract
pip install 'windows-mcp[all]'     # everything

Tools gracefully degrade with clear install instructions when optional deps are missing.

Code Changes

This PR is purely additive — no existing code is modified except one import line in service.py (adding _approximate_color_name to the utils import).

File Changes
src/windows_mcp/__main__.py +288 lines: 3 DPI helpers + 9 tool registrations
src/windows_mcp/desktop/service.py +384 lines: 9 implementation methods + 2 constant dicts
src/windows_mcp/desktop/utils.py +36 lines: color name lookup table + helper
pyproject.toml +12 lines: optional dependency groups
README.md +9 lines: new tools in tools table
tests/ +10 new test files (899 lines total)

Testing

10 comprehensive test files with 80+ test cases covering:

  • All 9 tools with success paths and error handling
  • DPI coordinate conversion (physical passthrough, logical scaling at 100%/150%/200%)
  • Edge cases: unknown keys, invalid coordinates, missing optional deps
  • GDI handle validation, BMP format handling, template matching

All tests use unittest.mock to avoid Windows-specific runtime dependencies.

Test Plan

  • All new tests pass with pytest tests/
  • No modifications to existing tests
  • Diff is purely additive (1528 insertions, 1 deletion for import update)
  • Code follows existing project patterns (@mcp.tool(), @with_analytics, Desktop class methods)
  • 3 rounds of code review completed (addressed 2 CRITICAL, 2 HIGH, 4 MEDIUM issues)

🤖 Generated with Claude Code

Add cursor position, pixel color, key hold/release, screen info,
highlight region, mouse path, OCR screen reader, wait-for-change,
and find-image (template matching) tools.

All coordinate-accepting tools support a `coordinate_system` parameter
("physical" or "logical") for DPI-aware operation.

New optional dependencies for vision (opencv) and OCR (pytesseract)
in pyproject.toml. Includes 10 comprehensive test files (80+ tests).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copilot AI review requested due to automatic review settings March 8, 2026 04:53
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands windows-mcp desktop automation by adding 9 new MCP tools and related Desktop service implementations, plus DPI-aware coordinate conversion helpers and optional dependency extras for OCR/vision features.

Changes:

  • Added 3 DPI coordinate conversion helpers (_to_physical, _region_to_physical, _path_to_physical) and registered 9 new MCP tools in __main__.py.
  • Implemented the 9 new tool backends in Desktop (cursor position, pixel color, key hold, screen info, highlight, mouse path, OCR, change detection, template matching).
  • Added color-name approximation utility, optional dependency groups (vision, ocr, all), README tool list updates, and a new test suite covering the new tools/helpers.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/windows_mcp/__main__.py Adds DPI coordinate conversion helpers and registers new MCP tools that wrap Desktop methods.
src/windows_mcp/desktop/service.py Adds tool implementations in Desktop, plus VK/color constants for KeyHold/Highlight.
src/windows_mcp/desktop/utils.py Adds named-color palette and _approximate_color_name helper for PixelColor output.
pyproject.toml Adds optional dependency extras for vision (OpenCV/numpy) and OCR (pytesseract).
README.md Documents the newly added tools in the README tool list.
tests/test_cursor_position.py Tests CursorPosition behavior via mocked UIA cursor coordinates.
tests/test_pixel_color.py Tests PixelColor output formatting and color name approximation helper.
tests/test_key_hold.py Tests key hold/release behavior and VK map essentials.
tests/test_screen_info.py Tests monitor parsing and fallback behavior for ScreenInfo.
tests/test_highlight.py Tests ScreenHighlight input validation and color map presence.
tests/test_mouse_path.py Tests mouse path validation and endpoint visitation behavior.
tests/test_screen_reader.py Tests OCR flow, region capture bbox behavior, and error handling.
tests/test_wait_for_change.py Tests change detection, timeout, invalid input, and baseline capture errors.
tests/test_find_image.py Tests missing deps messaging, path/extension validation, and match/no-match flows.
tests/test_coordinate_system.py Tests DPI conversion helper behavior for physical/logical coordinate modes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +75 to +110
def _to_physical(loc: list[int], coordinate_system: str) -> list[int]:
"""Convert coordinates to physical space if needed.

Args:
loc: [x, y] coordinates.
coordinate_system: "physical" (no conversion) or "logical" (multiply by DPI scale).

Returns:
[x, y] in physical coordinates ready for pyautogui.
"""
if coordinate_system == "logical":
if desktop is None:
raise RuntimeError("Desktop service is not initialized.")
scale = desktop.get_dpi_scaling()
return [round(loc[0] * scale), round(loc[1] * scale)]
return loc


def _region_to_physical(region: list[int], coordinate_system: str) -> list[int]:
"""Convert a region [x, y, width, height] to physical space if needed."""
if coordinate_system == "logical":
if desktop is None:
raise RuntimeError("Desktop service is not initialized.")
scale = desktop.get_dpi_scaling()
return [round(v * scale) for v in region]
return region


def _path_to_physical(path: list[list[int]], coordinate_system: str) -> list[list[int]]:
"""Convert a list of [x, y] waypoints to physical space if needed."""
if coordinate_system == "logical":
if desktop is None:
raise RuntimeError("Desktop service is not initialized.")
scale = desktop.get_dpi_scaling()
return [[round(p[0] * scale), round(p[1] * scale)] for p in path]
return path
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In logical mode _to_physical and _path_to_physical index into loc / each waypoint without validating shape first. For invalid inputs (e.g., loc=[100] or a malformed waypoint), this raises IndexError and the tool wrapper returns a generic Error: list index out of range instead of the intended "loc must be [x, y]" / "waypoint must be [x, y]" messages. Consider validating lengths in the helpers (or before calling them) and raising a clear ValueError/returning an unchanged value so the downstream validation runs.

Copilot uses AI. Check for mistakes.
return "Error: loc must be [x, y]"
x, y = loc[0], loc[1]
try:
img = ImageGrab.grab(bbox=(x, y, x + 1, y + 1))
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_pixel_color captures via ImageGrab.grab(bbox=...) without all_screens=True, while get_screenshot() uses all_screens=True for the virtual screen. This likely breaks pixel sampling on multi-monitor setups (especially with negative coordinates / non-primary monitors). Consider passing all_screens=True here (and in the other new coordinate-based capture helpers) for consistent coordinate semantics.

Suggested change
img = ImageGrab.grab(bbox=(x, y, x + 1, y + 1))
img = ImageGrab.grab(bbox=(x, y, x + 1, y + 1), all_screens=True)

Copilot uses AI. Check for mistakes.
Comment on lines +1347 to +1350
timeout: float = 30.0,
threshold: float = 0.05,
poll_interval: float = 0.5,
) -> str:
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_for_change accepts threshold as a ratio but doesn't validate its range. For example, a negative threshold will always trigger immediate "Change detected", and a threshold > 1.0 can never be met. Consider clamping/validating threshold to [0.0, 1.0] (and returning a clear error when invalid).

Copilot uses AI. Check for mistakes.
Comment on lines +1360 to +1364
try:
baseline = list(ImageGrab.grab(bbox=bbox).getdata())
except Exception as e:
return f"Error capturing baseline: {str(e)}"

Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_for_change captures baseline/current pixels using ImageGrab.grab(bbox=...) without all_screens=True. Given the rest of the code uses virtual-screen coordinates (get_screenshot(all_screens=True) and UIA coordinates can be negative), this can cause the wrong region to be monitored on multi-monitor setups. Consider using all_screens=True for these grabs (baseline and subsequent polls).

Copilot uses AI. Check for mistakes.
template_path: str,
region: list[int] | None = None,
threshold: float = 0.8,
) -> str:
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_image does not validate the threshold range. Values outside [0.0, 1.0] yield confusing behavior (e.g., always-match for negative thresholds or never-match for thresholds > 1). Consider validating/clamping the threshold and returning a clear error for invalid values.

Suggested change
) -> str:
) -> str:
# Validate threshold to avoid confusing always-match/never-match behavior
if not 0.0 <= threshold <= 1.0:
return "Error: threshold must be between 0.0 and 1.0 (inclusive)"

Copilot uses AI. Check for mistakes.
@@ -1,4 +1,4 @@
from windows_mcp.desktop.utils import ps_quote, ps_quote_for_xml
from windows_mcp.desktop.utils import ps_quote, ps_quote_for_xml, _approximate_color_name
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing _approximate_color_name (underscore-prefixed) from desktop.utils makes a private helper part of an inter-module contract. Consider making it a public utility (rename without underscore) or keeping the naming private but moving the helper into service.py to avoid implying it is internal-only while being used externally.

Copilot uses AI. Check for mistakes.
if len(size) != 2:
return "Error: size must be [width, height]"
x, y = loc[0], loc[1]
w, h = size[0], size[1]
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

highlight_region validates list lengths but doesn't validate that width/height are positive. Negative or zero sizes will produce an inverted/empty rectangle or unexpected GDI behavior. Consider adding w > 0 and h > 0 validation and returning a clear error if not.

Suggested change
w, h = size[0], size[1]
w, h = size[0], size[1]
if w <= 0 or h <= 0:
return "Error: width and height must be positive"

Copilot uses AI. Check for mistakes.
Comment on lines +1288 to +1290
img = ImageGrab.grab(bbox=(x, y, x + w, y + h))
else:
img = ImageGrab.grab()
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_screen_text uses ImageGrab.grab()/grab(bbox=...) without all_screens=True, which can make region coordinates inconsistent with the virtual screen coordinates returned by UIA (and with get_screenshot() which uses all_screens=True). Consider using all_screens=True for both full-screen and bbox captures so regions can target non-primary monitors reliably.

Suggested change
img = ImageGrab.grab(bbox=(x, y, x + w, y + h))
else:
img = ImageGrab.grab()
img = ImageGrab.grab(bbox=(x, y, x + w, y + h), all_screens=True)
else:
img = ImageGrab.grab(all_screens=True)

Copilot uses AI. Check for mistakes.
Comment on lines +1438 to +1441
screen_img = ImageGrab.grab(bbox=(x, y, x + w, y + h))
else:
x, y = 0, 0
screen_img = ImageGrab.grab()
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_image uses ImageGrab.grab() / grab(bbox=...) without all_screens=True. This can make template matching fail or search the wrong pixels on multi-monitor setups (especially when using virtual screen coordinates or negative offsets). Consider using all_screens=True for both the full-screen capture and the regional capture for consistency with the rest of the desktop coordinate system.

Suggested change
screen_img = ImageGrab.grab(bbox=(x, y, x + w, y + h))
else:
x, y = 0, 0
screen_img = ImageGrab.grab()
screen_img = ImageGrab.grab(bbox=(x, y, x + w, y + h), all_screens=True)
else:
x, y = 0, 0
screen_img = ImageGrab.grab(all_screens=True)

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +45

def _approximate_color_name(r: int, g: int, b: int) -> str:
"""Find the closest named color using Euclidean distance."""
best_name = "unknown"
best_dist = float("inf")
for name, (nr, ng, nb) in _NAMED_COLORS.items():
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_NAMED_COLORS contains duplicate RGB values (cyan and aqua are both (0, 255, 255)). Because _approximate_color_name picks the first minimum, the returned name for that color is effectively an implementation detail of dict ordering. Consider removing duplicates or explicitly choosing a preferred label for duplicates to keep results stable/intentional.

Suggested change
def _approximate_color_name(r: int, g: int, b: int) -> str:
"""Find the closest named color using Euclidean distance."""
best_name = "unknown"
best_dist = float("inf")
for name, (nr, ng, nb) in _NAMED_COLORS.items():
# Canonical color names used for approximation; one name per unique RGB triple.
# For duplicate RGB values in _NAMED_COLORS (e.g., "cyan" and "aqua"), we choose a
# single preferred label here to keep _approximate_color_name deterministic.
_CANONICAL_COLORS = {
"black": (0, 0, 0),
"white": (255, 255, 255),
"red": (255, 0, 0),
"green": (0, 128, 0),
"blue": (0, 0, 255),
"yellow": (255, 255, 0),
"cyan": (0, 255, 255), # preferred label for (0, 255, 255) over "aqua"
"magenta": (255, 0, 255),
"orange": (255, 165, 0),
"purple": (128, 0, 128),
"pink": (255, 192, 203),
"brown": (139, 69, 19),
"gray": (128, 128, 128),
"silver": (192, 192, 192),
"navy": (0, 0, 128),
"teal": (0, 128, 128),
"maroon": (128, 0, 0),
"olive": (128, 128, 0),
"lime": (0, 255, 0),
}
def _approximate_color_name(r: int, g: int, b: int) -> str:
"""Find the closest named color using Euclidean distance."""
best_name = "unknown"
best_dist = float("inf")
for name, (nr, ng, nb) in _CANONICAL_COLORS.items():

Copilot uses AI. Check for mistakes.
- Add input shape validation in DPI helpers (_to_physical, _region_to_physical,
  _path_to_physical) with clear ValueError messages
- Add all_screens=True to all ImageGrab.grab() calls for multi-monitor support
  (get_pixel_color, read_screen_text, wait_for_change, find_image)
- Add threshold range validation [0.0, 1.0] in wait_for_change and find_image
- Add DPI scaling factor to get_screen_info output
- Add positive width/height validation in highlight_region
- Rename _approximate_color_name -> approximate_color_name (public API)
- Remove duplicate aqua entry from _NAMED_COLORS (same RGB as cyan)
- Add input validation tests to test_coordinate_system.py

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@Vaibhav-api-code
Copy link
Author

All 12 Copilot review comments have been addressed in commit 66ac4b3:

Input Validation:

  • DPI helpers now validate input shape (len check) with clear ValueError messages
  • highlight_region validates width/height > 0
  • wait_for_change and find_image validate threshold in [0.0, 1.0]

Multi-Monitor Support:

  • Added all_screens=True to all ImageGrab.grab() calls (get_pixel_color, read_screen_text, wait_for_change, find_image)

API Improvements:

  • get_screen_info now includes DPI scaling factor in output
  • Renamed _approximate_color_name to approximate_color_name (public API)
  • Removed duplicate aqua entry from _NAMED_COLORS (same RGB as cyan)

Bankers Rounding Note:

  • Kept round() for DPI conversion — the 1px difference at exact .5 values is negligible for coordinate targeting. DPI scales in practice (1.25, 1.5, 2.0) rarely produce ambiguous half-pixel values.

4 new input validation tests added to test_coordinate_system.py.

Vaibhav-api-code and others added 8 commits March 7, 2026 22:46
System Control: VolumeControl (COM AudioEndpointVolume), BrightnessControl
(WMI), AppList (Get-Process), Dialog (WinForms), SystemInfoExtended
(WMI/Registry), DarkMode (Registry), SayText (SAPI)

Dev Workflow: PortCheck (Get-NetTCPConnection/UDPEndpoint), FileWatcher
(polling), SearchFiles (Get-ChildItem/Select-String), NetworkDiagnostics
(Test-Connection/Resolve-DnsName/Invoke-WebRequest), AccessibilityInspector
(UIAutomation)

Review fixes applied:
- Removed raw PowerShell execution path in SearchFiles (security)
- Fixed PowerShell injection in SayText error message via safe_voice
- Fixed file_watcher saw_delete not reset + deduplicated condition
- Added -SimpleMatch for literal content search in SearchFiles
- Escaped filesystem wildcard chars in SearchFiles name filter
- Added vtable comment for volume set COM interface
- Clarified cancel vs empty prompt ambiguity in Dialog

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- BrightnessControl: handle multi-monitor WMI collections properly
- Dialog: add TopMost=true to custom forms for z-order visibility
- SearchFiles: default to $env:USERPROFILE instead of C:\ root (performance)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- app_is_running: strip .exe/.EXE extension before process lookup
- say_text: use SpeakAsync + polling loop to avoid blocking the MCP thread

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Document all new tools: VolumeControl, BrightnessControl, AppList, Dialog,
SystemInfoExtended, DarkMode, SayText, PortCheck, FileWatcher, SearchFiles,
NetworkDiagnostics, AccessibilityInspector

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add missing tempfile import (ScreenReader tool was crashing)
- Replace ThreadPoolExecutor with sequential draw (PIL ImageDraw not thread-safe)
- multi_select: fix mutable default locs=[], add try/finally for Ctrl key,
  guard ReleaseKey with press_ctrl check
- is_overlay_window: change OR to AND (was filtering legitimate childless windows)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
New tools: UIElement (7 modes), WindowScreenshot, MultiMonitor,
ScreenRecord, MenuClick, QuickLook, WindowTiling, ClipboardInfo.

Enhanced App tool with minimize/maximize/close/fullscreen/restore.

Total: 46 tools (was 38). Covers all 55 macOS automation-mcp
capabilities with Windows equivalents.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
CRITICAL:
- ffmpeg stop: use CTRL_BREAK_EVENT + CREATE_NEW_PROCESS_GROUP
  instead of os.kill(pid, 2) which corrupts MP4 on Windows
- TOCTOU race: atomic O_EXCL create for screen recording state file

HIGH:
- _search_element: explicit continue on role mismatch instead of
  ambiguous pass-through
- screen_record: validate output_path (extension, no leading -)
  to prevent path traversal and ffmpeg option injection
- window_tiling cascade: try/except ValueError for int parsing
  of PowerShell handle output

MEDIUM:
- screen_record stop: verify PID is ffmpeg before sending signal
  (prevents PID recycling / injection attacks)
- window_screenshot_tool: fix return type annotation -> list | str
- get_clipboard_info: add retry loop for clipboard contention

Co-Authored-By: Claude Opus 4.6 <[email protected]>
CRITICAL:
- quick_look: block executables/scripts (.exe,.bat,.ps1,.vbs,etc.)
  via extension blocklist — prevents arbitrary code execution

HIGH:
- SendKeys escaping: escape +^%~()[] chars that SendKeys interprets
  as Shift/Ctrl/Alt/Enter/grouping — prevents unintended key combos
- ui_element_type_into: fail fast if SetFocus() fails instead of
  typing into whatever window is currently active
- screen_record start: cleanup state file if Popen fails, preventing
  permanent lock on future recording starts

MEDIUM:
- screen_record stop: wait up to 5s for ffmpeg to exit and finalize
  MP4 container before reporting success
- ui_element_set_value: SelectionItemPattern now validates element
  name matches requested value before selecting
- fullscreen/restore: restore now clears TOPMOST flag that fullscreen
  sets, preventing window from staying always-on-top
- App tool description: removed incorrect mode count

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants