Skip to content

codemode: fix Start() fail-fast and use tools.As for wrapper unwrapping#2007

Open
dgageot wants to merge 1 commit intodocker:mainfrom
dgageot:board/fix-code-mode-71471ec1
Open

codemode: fix Start() fail-fast and use tools.As for wrapper unwrapping#2007
dgageot wants to merge 1 commit intodocker:mainfrom
dgageot:board/fix-code-mode-71471ec1

Conversation

@dgageot
Copy link
Member

@dgageot dgageot commented Mar 9, 2026

Problem

codeModeTool.Start() had two issues:

  1. Fail-fast behavior — it returned on the first error, so one broken toolset (e.g. an unreachable MCP server) prevented all remaining toolsets from starting.
  2. Direct type assertion — both Start() and Stop() used t.(tools.Startable) instead of tools.As, which cannot discover Startable through wrapper chains (e.g. StartableToolSet).

Fix

  • Collect all errors with errors.Join in Start(), matching the existing Stop() behavior.
  • Switch to tools.As[tools.Startable](t) so the capability is found at any depth in the wrapper chain.
  • Extract the now-identical Start/Stop loops into a shared forEachStartable helper to remove duplication.

@dgageot dgageot requested a review from a team as a code owner March 9, 2026 16:28
Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Assessment: 🔴 CRITICAL

This refactoring introduces a significant semantic change to Start() behavior. The method now attempts to start all toolsets even after failures (collect-all-errors), whereas the original implementation failed-fast on the first error.

Key concern: Without state tracking for which toolsets successfully started, the system can be left in a partially-started state with no rollback mechanism. If toolset 1 starts successfully but toolset 2 fails, there's no way to clean up toolset 1.


Findings

  • 🔴 1 CRITICAL issue
  • 🟡 1 NOTABLE issue

Note: This review uses COMMENT mode only. Human review is required for merge decisions.

}

return nil
return c.forEachStartable(func(s tools.Startable) error { return s.Start(ctx) })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL: Semantic breaking change — fail-fast → collect-all-errors

The original Start() implementation returned immediately on the first error:

if err := startable.Start(ctx); err != nil {
    return err  // fail-fast
}

The new implementation attempts all toolsets regardless of failures:

return c.forEachStartable(func(s tools.Startable) error { return s.Start(ctx) })
// forEachStartable collects all errors with errors.Join

Problem: If toolset A starts successfully but toolset B fails, the system is left in a partially-started state. There's no tracking of which toolsets succeeded, so:

  1. The caller cannot determine which toolsets are running
  2. No rollback mechanism exists to stop successfully-started toolsets
  3. Stop() will attempt to stop all toolsets regardless of whether they started

Impact: This changes startup semantics from "abort on first failure" to "try everything and report all failures". While the PR description mentions this is intentional ("one broken toolset shouldn't prevent others"), this could leave the system in an inconsistent state.

Recommendation: Consider adding state tracking (e.g., a started []tools.Startable field) to enable cleanup/rollback when Start() returns an error, or document that toolset Stop() implementations must handle being stopped without being started.

}

return nil
return c.forEachStartable(func(s tools.Startable) error { return s.Start(ctx) })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 NOTABLE: No state tracking for cleanup after partial startup failures

When Start() returns an error after attempting all toolsets, the caller has no visibility into which toolsets successfully started and which failed. This makes cleanup/rollback impossible.

The codeModeTool maintains no internal state about which toolsets are running, so if Start() fails:

  • Successfully-started toolsets remain running
  • No mechanism exists to selectively stop only the started ones
  • Calling Stop() will attempt to stop all toolsets, including those that never started (which may or may not handle this gracefully)

Recommendation: Either:

  1. Add internal state tracking: started []tools.Startable
  2. Implement automatic rollback in Start() on any error
  3. Document that toolset Stop() implementations must be idempotent and safe to call on non-started toolsets

Start() returned on the first error, so one broken toolset (e.g. an
unreachable MCP server) prevented all remaining toolsets from starting.
Collect all errors with errors.Join, matching the existing Stop()
behavior.

Both Start() and Stop() used a direct type assertion to find Startable,
which cannot see through wrapper chains (e.g. StartableToolSet). Switch
to tools.As so the capability is discovered at any depth.

Extract the now-identical loops into a shared forEachStartable helper to
remove the duplication.

Assisted-By: docker-agent
@dgageot dgageot force-pushed the board/fix-code-mode-71471ec1 branch from dbf90e1 to a6e51ce Compare March 9, 2026 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant