feat: add 9 advanced automation tools with DPI coordinate system support#93
feat: add 9 advanced automation tools with DPI coordinate system support#93Vaibhav-api-code wants to merge 10 commits intoCursorTouch:mainfrom
Conversation
Add cursor position, pixel color, key hold/release, screen info,
highlight region, mouse path, OCR screen reader, wait-for-change,
and find-image (template matching) tools.
All coordinate-accepting tools support a `coordinate_system` parameter
("physical" or "logical") for DPI-aware operation.
New optional dependencies for vision (opencv) and OCR (pytesseract)
in pyproject.toml. Includes 10 comprehensive test files (80+ tests).
Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR expands windows-mcp desktop automation by adding 9 new MCP tools and related Desktop service implementations, plus DPI-aware coordinate conversion helpers and optional dependency extras for OCR/vision features.
Changes:
- Added 3 DPI coordinate conversion helpers (
_to_physical,_region_to_physical,_path_to_physical) and registered 9 new MCP tools in__main__.py. - Implemented the 9 new tool backends in
Desktop(cursor position, pixel color, key hold, screen info, highlight, mouse path, OCR, change detection, template matching). - Added color-name approximation utility, optional dependency groups (
vision,ocr,all), README tool list updates, and a new test suite covering the new tools/helpers.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
src/windows_mcp/__main__.py |
Adds DPI coordinate conversion helpers and registers new MCP tools that wrap Desktop methods. |
src/windows_mcp/desktop/service.py |
Adds tool implementations in Desktop, plus VK/color constants for KeyHold/Highlight. |
src/windows_mcp/desktop/utils.py |
Adds named-color palette and _approximate_color_name helper for PixelColor output. |
pyproject.toml |
Adds optional dependency extras for vision (OpenCV/numpy) and OCR (pytesseract). |
README.md |
Documents the newly added tools in the README tool list. |
tests/test_cursor_position.py |
Tests CursorPosition behavior via mocked UIA cursor coordinates. |
tests/test_pixel_color.py |
Tests PixelColor output formatting and color name approximation helper. |
tests/test_key_hold.py |
Tests key hold/release behavior and VK map essentials. |
tests/test_screen_info.py |
Tests monitor parsing and fallback behavior for ScreenInfo. |
tests/test_highlight.py |
Tests ScreenHighlight input validation and color map presence. |
tests/test_mouse_path.py |
Tests mouse path validation and endpoint visitation behavior. |
tests/test_screen_reader.py |
Tests OCR flow, region capture bbox behavior, and error handling. |
tests/test_wait_for_change.py |
Tests change detection, timeout, invalid input, and baseline capture errors. |
tests/test_find_image.py |
Tests missing deps messaging, path/extension validation, and match/no-match flows. |
tests/test_coordinate_system.py |
Tests DPI conversion helper behavior for physical/logical coordinate modes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _to_physical(loc: list[int], coordinate_system: str) -> list[int]: | ||
| """Convert coordinates to physical space if needed. | ||
|
|
||
| Args: | ||
| loc: [x, y] coordinates. | ||
| coordinate_system: "physical" (no conversion) or "logical" (multiply by DPI scale). | ||
|
|
||
| Returns: | ||
| [x, y] in physical coordinates ready for pyautogui. | ||
| """ | ||
| if coordinate_system == "logical": | ||
| if desktop is None: | ||
| raise RuntimeError("Desktop service is not initialized.") | ||
| scale = desktop.get_dpi_scaling() | ||
| return [round(loc[0] * scale), round(loc[1] * scale)] | ||
| return loc | ||
|
|
||
|
|
||
| def _region_to_physical(region: list[int], coordinate_system: str) -> list[int]: | ||
| """Convert a region [x, y, width, height] to physical space if needed.""" | ||
| if coordinate_system == "logical": | ||
| if desktop is None: | ||
| raise RuntimeError("Desktop service is not initialized.") | ||
| scale = desktop.get_dpi_scaling() | ||
| return [round(v * scale) for v in region] | ||
| return region | ||
|
|
||
|
|
||
| def _path_to_physical(path: list[list[int]], coordinate_system: str) -> list[list[int]]: | ||
| """Convert a list of [x, y] waypoints to physical space if needed.""" | ||
| if coordinate_system == "logical": | ||
| if desktop is None: | ||
| raise RuntimeError("Desktop service is not initialized.") | ||
| scale = desktop.get_dpi_scaling() | ||
| return [[round(p[0] * scale), round(p[1] * scale)] for p in path] | ||
| return path |
There was a problem hiding this comment.
In logical mode _to_physical and _path_to_physical index into loc / each waypoint without validating shape first. For invalid inputs (e.g., loc=[100] or a malformed waypoint), this raises IndexError and the tool wrapper returns a generic Error: list index out of range instead of the intended "loc must be [x, y]" / "waypoint must be [x, y]" messages. Consider validating lengths in the helpers (or before calling them) and raising a clear ValueError/returning an unchanged value so the downstream validation runs.
src/windows_mcp/desktop/service.py
Outdated
| return "Error: loc must be [x, y]" | ||
| x, y = loc[0], loc[1] | ||
| try: | ||
| img = ImageGrab.grab(bbox=(x, y, x + 1, y + 1)) |
There was a problem hiding this comment.
get_pixel_color captures via ImageGrab.grab(bbox=...) without all_screens=True, while get_screenshot() uses all_screens=True for the virtual screen. This likely breaks pixel sampling on multi-monitor setups (especially with negative coordinates / non-primary monitors). Consider passing all_screens=True here (and in the other new coordinate-based capture helpers) for consistent coordinate semantics.
| img = ImageGrab.grab(bbox=(x, y, x + 1, y + 1)) | |
| img = ImageGrab.grab(bbox=(x, y, x + 1, y + 1), all_screens=True) |
| timeout: float = 30.0, | ||
| threshold: float = 0.05, | ||
| poll_interval: float = 0.5, | ||
| ) -> str: |
There was a problem hiding this comment.
wait_for_change accepts threshold as a ratio but doesn't validate its range. For example, a negative threshold will always trigger immediate "Change detected", and a threshold > 1.0 can never be met. Consider clamping/validating threshold to [0.0, 1.0] (and returning a clear error when invalid).
| try: | ||
| baseline = list(ImageGrab.grab(bbox=bbox).getdata()) | ||
| except Exception as e: | ||
| return f"Error capturing baseline: {str(e)}" | ||
|
|
There was a problem hiding this comment.
wait_for_change captures baseline/current pixels using ImageGrab.grab(bbox=...) without all_screens=True. Given the rest of the code uses virtual-screen coordinates (get_screenshot(all_screens=True) and UIA coordinates can be negative), this can cause the wrong region to be monitored on multi-monitor setups. Consider using all_screens=True for these grabs (baseline and subsequent polls).
| template_path: str, | ||
| region: list[int] | None = None, | ||
| threshold: float = 0.8, | ||
| ) -> str: |
There was a problem hiding this comment.
find_image does not validate the threshold range. Values outside [0.0, 1.0] yield confusing behavior (e.g., always-match for negative thresholds or never-match for thresholds > 1). Consider validating/clamping the threshold and returning a clear error for invalid values.
| ) -> str: | |
| ) -> str: | |
| # Validate threshold to avoid confusing always-match/never-match behavior | |
| if not 0.0 <= threshold <= 1.0: | |
| return "Error: threshold must be between 0.0 and 1.0 (inclusive)" |
src/windows_mcp/desktop/service.py
Outdated
| @@ -1,4 +1,4 @@ | |||
| from windows_mcp.desktop.utils import ps_quote, ps_quote_for_xml | |||
| from windows_mcp.desktop.utils import ps_quote, ps_quote_for_xml, _approximate_color_name | |||
There was a problem hiding this comment.
Importing _approximate_color_name (underscore-prefixed) from desktop.utils makes a private helper part of an inter-module contract. Consider making it a public utility (rename without underscore) or keeping the naming private but moving the helper into service.py to avoid implying it is internal-only while being used externally.
| if len(size) != 2: | ||
| return "Error: size must be [width, height]" | ||
| x, y = loc[0], loc[1] | ||
| w, h = size[0], size[1] |
There was a problem hiding this comment.
highlight_region validates list lengths but doesn't validate that width/height are positive. Negative or zero sizes will produce an inverted/empty rectangle or unexpected GDI behavior. Consider adding w > 0 and h > 0 validation and returning a clear error if not.
| w, h = size[0], size[1] | |
| w, h = size[0], size[1] | |
| if w <= 0 or h <= 0: | |
| return "Error: width and height must be positive" |
src/windows_mcp/desktop/service.py
Outdated
| img = ImageGrab.grab(bbox=(x, y, x + w, y + h)) | ||
| else: | ||
| img = ImageGrab.grab() |
There was a problem hiding this comment.
read_screen_text uses ImageGrab.grab()/grab(bbox=...) without all_screens=True, which can make region coordinates inconsistent with the virtual screen coordinates returned by UIA (and with get_screenshot() which uses all_screens=True). Consider using all_screens=True for both full-screen and bbox captures so regions can target non-primary monitors reliably.
| img = ImageGrab.grab(bbox=(x, y, x + w, y + h)) | |
| else: | |
| img = ImageGrab.grab() | |
| img = ImageGrab.grab(bbox=(x, y, x + w, y + h), all_screens=True) | |
| else: | |
| img = ImageGrab.grab(all_screens=True) |
src/windows_mcp/desktop/service.py
Outdated
| screen_img = ImageGrab.grab(bbox=(x, y, x + w, y + h)) | ||
| else: | ||
| x, y = 0, 0 | ||
| screen_img = ImageGrab.grab() |
There was a problem hiding this comment.
find_image uses ImageGrab.grab() / grab(bbox=...) without all_screens=True. This can make template matching fail or search the wrong pixels on multi-monitor setups (especially when using virtual screen coordinates or negative offsets). Consider using all_screens=True for both the full-screen capture and the regional capture for consistency with the rest of the desktop coordinate system.
| screen_img = ImageGrab.grab(bbox=(x, y, x + w, y + h)) | |
| else: | |
| x, y = 0, 0 | |
| screen_img = ImageGrab.grab() | |
| screen_img = ImageGrab.grab(bbox=(x, y, x + w, y + h), all_screens=True) | |
| else: | |
| x, y = 0, 0 | |
| screen_img = ImageGrab.grab(all_screens=True) |
|
|
||
| def _approximate_color_name(r: int, g: int, b: int) -> str: | ||
| """Find the closest named color using Euclidean distance.""" | ||
| best_name = "unknown" | ||
| best_dist = float("inf") | ||
| for name, (nr, ng, nb) in _NAMED_COLORS.items(): |
There was a problem hiding this comment.
_NAMED_COLORS contains duplicate RGB values (cyan and aqua are both (0, 255, 255)). Because _approximate_color_name picks the first minimum, the returned name for that color is effectively an implementation detail of dict ordering. Consider removing duplicates or explicitly choosing a preferred label for duplicates to keep results stable/intentional.
| def _approximate_color_name(r: int, g: int, b: int) -> str: | |
| """Find the closest named color using Euclidean distance.""" | |
| best_name = "unknown" | |
| best_dist = float("inf") | |
| for name, (nr, ng, nb) in _NAMED_COLORS.items(): | |
| # Canonical color names used for approximation; one name per unique RGB triple. | |
| # For duplicate RGB values in _NAMED_COLORS (e.g., "cyan" and "aqua"), we choose a | |
| # single preferred label here to keep _approximate_color_name deterministic. | |
| _CANONICAL_COLORS = { | |
| "black": (0, 0, 0), | |
| "white": (255, 255, 255), | |
| "red": (255, 0, 0), | |
| "green": (0, 128, 0), | |
| "blue": (0, 0, 255), | |
| "yellow": (255, 255, 0), | |
| "cyan": (0, 255, 255), # preferred label for (0, 255, 255) over "aqua" | |
| "magenta": (255, 0, 255), | |
| "orange": (255, 165, 0), | |
| "purple": (128, 0, 128), | |
| "pink": (255, 192, 203), | |
| "brown": (139, 69, 19), | |
| "gray": (128, 128, 128), | |
| "silver": (192, 192, 192), | |
| "navy": (0, 0, 128), | |
| "teal": (0, 128, 128), | |
| "maroon": (128, 0, 0), | |
| "olive": (128, 128, 0), | |
| "lime": (0, 255, 0), | |
| } | |
| def _approximate_color_name(r: int, g: int, b: int) -> str: | |
| """Find the closest named color using Euclidean distance.""" | |
| best_name = "unknown" | |
| best_dist = float("inf") | |
| for name, (nr, ng, nb) in _CANONICAL_COLORS.items(): |
- Add input shape validation in DPI helpers (_to_physical, _region_to_physical, _path_to_physical) with clear ValueError messages - Add all_screens=True to all ImageGrab.grab() calls for multi-monitor support (get_pixel_color, read_screen_text, wait_for_change, find_image) - Add threshold range validation [0.0, 1.0] in wait_for_change and find_image - Add DPI scaling factor to get_screen_info output - Add positive width/height validation in highlight_region - Rename _approximate_color_name -> approximate_color_name (public API) - Remove duplicate aqua entry from _NAMED_COLORS (same RGB as cyan) - Add input validation tests to test_coordinate_system.py Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
All 12 Copilot review comments have been addressed in commit 66ac4b3: Input Validation:
Multi-Monitor Support:
API Improvements:
Bankers Rounding Note:
4 new input validation tests added to test_coordinate_system.py. |
System Control: VolumeControl (COM AudioEndpointVolume), BrightnessControl (WMI), AppList (Get-Process), Dialog (WinForms), SystemInfoExtended (WMI/Registry), DarkMode (Registry), SayText (SAPI) Dev Workflow: PortCheck (Get-NetTCPConnection/UDPEndpoint), FileWatcher (polling), SearchFiles (Get-ChildItem/Select-String), NetworkDiagnostics (Test-Connection/Resolve-DnsName/Invoke-WebRequest), AccessibilityInspector (UIAutomation) Review fixes applied: - Removed raw PowerShell execution path in SearchFiles (security) - Fixed PowerShell injection in SayText error message via safe_voice - Fixed file_watcher saw_delete not reset + deduplicated condition - Added -SimpleMatch for literal content search in SearchFiles - Escaped filesystem wildcard chars in SearchFiles name filter - Added vtable comment for volume set COM interface - Clarified cancel vs empty prompt ambiguity in Dialog Co-Authored-By: Claude Opus 4.6 <[email protected]>
- BrightnessControl: handle multi-monitor WMI collections properly - Dialog: add TopMost=true to custom forms for z-order visibility - SearchFiles: default to $env:USERPROFILE instead of C:\ root (performance) Co-Authored-By: Claude Opus 4.6 <[email protected]>
- app_is_running: strip .exe/.EXE extension before process lookup - say_text: use SpeakAsync + polling loop to avoid blocking the MCP thread Co-Authored-By: Claude Opus 4.6 <[email protected]>
Document all new tools: VolumeControl, BrightnessControl, AppList, Dialog, SystemInfoExtended, DarkMode, SayText, PortCheck, FileWatcher, SearchFiles, NetworkDiagnostics, AccessibilityInspector Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add missing tempfile import (ScreenReader tool was crashing) - Replace ThreadPoolExecutor with sequential draw (PIL ImageDraw not thread-safe) - multi_select: fix mutable default locs=[], add try/finally for Ctrl key, guard ReleaseKey with press_ctrl check - is_overlay_window: change OR to AND (was filtering legitimate childless windows) Co-Authored-By: Claude Opus 4.6 <[email protected]>
New tools: UIElement (7 modes), WindowScreenshot, MultiMonitor, ScreenRecord, MenuClick, QuickLook, WindowTiling, ClipboardInfo. Enhanced App tool with minimize/maximize/close/fullscreen/restore. Total: 46 tools (was 38). Covers all 55 macOS automation-mcp capabilities with Windows equivalents. Co-Authored-By: Claude Opus 4.6 <[email protected]>
CRITICAL: - ffmpeg stop: use CTRL_BREAK_EVENT + CREATE_NEW_PROCESS_GROUP instead of os.kill(pid, 2) which corrupts MP4 on Windows - TOCTOU race: atomic O_EXCL create for screen recording state file HIGH: - _search_element: explicit continue on role mismatch instead of ambiguous pass-through - screen_record: validate output_path (extension, no leading -) to prevent path traversal and ffmpeg option injection - window_tiling cascade: try/except ValueError for int parsing of PowerShell handle output MEDIUM: - screen_record stop: verify PID is ffmpeg before sending signal (prevents PID recycling / injection attacks) - window_screenshot_tool: fix return type annotation -> list | str - get_clipboard_info: add retry loop for clipboard contention Co-Authored-By: Claude Opus 4.6 <[email protected]>
CRITICAL: - quick_look: block executables/scripts (.exe,.bat,.ps1,.vbs,etc.) via extension blocklist — prevents arbitrary code execution HIGH: - SendKeys escaping: escape +^%~()[] chars that SendKeys interprets as Shift/Ctrl/Alt/Enter/grouping — prevents unintended key combos - ui_element_type_into: fail fast if SetFocus() fails instead of typing into whatever window is currently active - screen_record start: cleanup state file if Popen fails, preventing permanent lock on future recording starts MEDIUM: - screen_record stop: wait up to 5s for ffmpeg to exit and finalize MP4 container before reporting success - ui_element_set_value: SelectionItemPattern now validates element name matches requested value before selecting - fullscreen/restore: restore now clears TOPMOST flag that fullscreen sets, preventing window from staying always-on-top - App tool description: removed incorrect mode count Co-Authored-By: Claude Opus 4.6 <[email protected]>
Summary
Adds 9 new tools to Windows-MCP that expand its desktop automation capabilities, bringing it closer to feature parity with macOS automation solutions. All coordinate-accepting tools support a
coordinate_systemparameter ("physical"or"logical") for DPI-aware operation.New Tools
down/up)DPI Coordinate System
All 6 coordinate-accepting tools support a
coordinate_systemparameter:"physical"(default) — raw pixel coordinates, no conversion"logical"— coordinates are multiplied by the system DPI scale factorThree internal helpers handle the conversion:
_to_physical(loc, system)— for[x, y]coordinates_region_to_physical(region, system)— for[x, y, w, h]regions_path_to_physical(path, system)— for[[x,y], ...]waypoint listsDependencies
No new required dependencies. All tools work with the existing
pillow+pywin32stack.Optional dependency groups added to
pyproject.toml:Tools gracefully degrade with clear install instructions when optional deps are missing.
Code Changes
This PR is purely additive — no existing code is modified except one import line in
service.py(adding_approximate_color_nameto the utils import).src/windows_mcp/__main__.pysrc/windows_mcp/desktop/service.pysrc/windows_mcp/desktop/utils.pypyproject.tomlREADME.mdtests/Testing
10 comprehensive test files with 80+ test cases covering:
All tests use
unittest.mockto avoid Windows-specific runtime dependencies.Test Plan
pytest tests/@mcp.tool(),@with_analytics,Desktopclass methods)🤖 Generated with Claude Code