Feature Request: complete TSO failover to PD in microservice mode when TSO nodes are unavailable

## Feature Request

### Describe your feature request related problem

 PD already has partial support for TSO dynamic switching in microservice mode. When `micro-service.enable-tso-dynamic-switching` is
  enabled and no TSO microservice instances are discovered, PD can resume serving TSO.

  However, this behavior does not appear to fully cover keyspaces assigned to non-default keyspace groups. In the current client-side
  TSO discovery logic, if a keyspace has `tso_keyspace_group_id != 0`, it explicitly does not fall back to group 0 when all TSO
  microservice endpoints are unavailable, and the client returns an error instead.

  As a result, a PD cluster in microservice mode can still lose TSO availability during a full TSO service outage, especially for
  clusters already using non-default keyspace groups. This weakens the disaster recovery story and makes migration from classic PD to
  the microservice architecture harder.

### Describe the feature you'd like

  Complete the TSO failover path in microservice mode so that when TSO nodes are unavailable, PD can automatically and safely take
  over TSO serving without manual intervention.

  Ideally:
  - PD should detect that the TSO microservice is unavailable and resume serving TSO automatically.
  - The behavior should be well-defined for non-default keyspace groups as well, while preserving TSO monotonicity.
  - If merging all keyspace groups back to the default group is required before takeover, that workflow should be automated or
  clearly integrated into the failover path.
  - When the TSO microservice recovers, the system should be able to switch back in a controlled and safe way.

### Describe alternatives you've considered

  - Manually restarting or recovering TSO nodes. This still leaves an availability gap.
  - Manually merging all keyspace groups into the default group before fallback. This is operationally heavy and not automatic.
  - Relying only on documentation or operational playbooks. This increases operator burden and makes migration less predictable.

### Teachability, Documentation, Adoption, Migration Strategy

  - TSO disaster recovery.
  - Migration of existing PD clusters to the microservice architecture.
  - Temporary TSO service outages during upgrade or deployment mistakes.
  - Clearer operational semantics for clusters using non-default keyspace groups.

  It would also be helpful to document:
  - whether `enable-tso-dynamic-switching` is recommended in production,
  - what guarantees currently hold for default vs non-default keyspace groups,
  - and what the expected failover / recovery procedure is in microservice mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: complete TSO failover to PD in microservice mode when TSO nodes are unavailable #10485

Feature Request

Describe your feature request related problem

Describe the feature you'd like

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: complete TSO failover to PD in microservice mode when TSO nodes are unavailable #10485

Description

Feature Request

Describe your feature request related problem

Describe the feature you'd like

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions