Skip to content

Feature Request: complete TSO failover to PD in microservice mode when TSO nodes are unavailable #10485

@King-Dylan

Description

@King-Dylan

Feature Request

Describe your feature request related problem

PD already has partial support for TSO dynamic switching in microservice mode. When micro-service.enable-tso-dynamic-switching is
enabled and no TSO microservice instances are discovered, PD can resume serving TSO.

However, this behavior does not appear to fully cover keyspaces assigned to non-default keyspace groups. In the current client-side
TSO discovery logic, if a keyspace has tso_keyspace_group_id != 0, it explicitly does not fall back to group 0 when all TSO
microservice endpoints are unavailable, and the client returns an error instead.

As a result, a PD cluster in microservice mode can still lose TSO availability during a full TSO service outage, especially for
clusters already using non-default keyspace groups. This weakens the disaster recovery story and makes migration from classic PD to
the microservice architecture harder.

Describe the feature you'd like

Complete the TSO failover path in microservice mode so that when TSO nodes are unavailable, PD can automatically and safely take
over TSO serving without manual intervention.

Ideally:

  • PD should detect that the TSO microservice is unavailable and resume serving TSO automatically.
  • The behavior should be well-defined for non-default keyspace groups as well, while preserving TSO monotonicity.
  • If merging all keyspace groups back to the default group is required before takeover, that workflow should be automated or
    clearly integrated into the failover path.
  • When the TSO microservice recovers, the system should be able to switch back in a controlled and safe way.

Describe alternatives you've considered

  • Manually restarting or recovering TSO nodes. This still leaves an availability gap.
  • Manually merging all keyspace groups into the default group before fallback. This is operationally heavy and not automatic.
  • Relying only on documentation or operational playbooks. This increases operator burden and makes migration less predictable.

Teachability, Documentation, Adoption, Migration Strategy

  • TSO disaster recovery.
  • Migration of existing PD clusters to the microservice architecture.
  • Temporary TSO service outages during upgrade or deployment mistakes.
  • Clearer operational semantics for clusters using non-default keyspace groups.

It would also be helpful to document:

  • whether enable-tso-dynamic-switching is recommended in production,
  • what guarantees currently hold for default vs non-default keyspace groups,
  • and what the expected failover / recovery procedure is in microservice mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    contributionThis PR is from a community contributor.type/feature-requestCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions