Skip to content

Support multi-node TSO keyspace group discovery #10454

@bufferflies

Description

@bufferflies

Problem

When a keyspace group has multiple TSO nodes(>2), client and server behavior is not robust if a request reaches a node that has watched the keyspace group metadata but is not currently serving that group's allocator.

This can lead to:

  • FindGroupByKeyspaceID returning an error instead of usable keyspace group metadata
  • TSO client discovery failing to retry through another valid node
  • Health checks reporting internal errors when the allocator is absent on the current node
  • split-transition paths touching allocator state without guarding nil allocators

What is expected?

For multi-node TSO deployments:

  • non-serving nodes should still return enough keyspace group metadata for clients to continue discovery
  • clients should retry against another TSO node when the first node is not serving the allocator
  • health checks should distinguish "allocator not found" from internal failures
  • membership update logic should be safe when allocators are temporarily absent

Reproduction

  1. Start a TSO cluster with 3 nodes.
  2. Create a keyspace group with only a subset of nodes as members.
  3. Send FindGroupByKeyspaceID or client discovery traffic to a non-member TSO node.
  4. Observe discovery and health behavior.

Proposal

  • treat ErrGetAllocator as a non-fatal "not served here" signal
  • keep returning keyspace group metadata to the client
  • retry client discovery through another TSO node
  • return 404 from health API when allocator is absent
  • add integration coverage for multi-node behavior and split flow

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects-8.5This bug affects the 8.5.x(LTS) versions.severity/majortype/bugThe issue is confirmed as a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions