doc: add design proposal for topology-aware cluster selection by WMP · Pull Request #5986 · ceph/ceph-csi

WMP · 2026-01-28T08:45:08Z

Describe what this PR does

Design proposal for topology-aware multi-cluster volume provisioning.
This enables the CSI driver to dynamically select the appropriate Ceph
cluster at CreateVolume time based on the node's topology zone.

The proposal introduces two new configuration mechanisms:

topologyDomainLabels field in config.json cluster entries — associates
each cluster with Kubernetes topology labels
clusterIDs StorageClass parameter — a comma-separated list of candidate
cluster IDs for topology-based selection

This is a design-only PR. Implementation will follow in a separate PR
once the design is approved.

Is there anything that requires special attention

The design is fully backward compatible. Existing configs and
StorageClasses with a single clusterID work unchanged.
volumeBindingMode: WaitForFirstConsumer is required for topology-based
selection (Kubernetes must provide AccessibilityRequirements).
The existing clusterID parameter takes priority when present —
topology selection is only used as a fallback via the new clusterIDs
parameter.

Related issues

Ref: #5177

Future concerns

Making clusterID fully optional when clusterIDs is provided
Combining topology-based cluster selection with topologyConstrainedPools
for selecting both cluster and pool
E2E tests with multi-cluster topology setup

Checklist:

Commit Message Formatting: follows developer guide
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

Add a design document describing topology-aware multi-cluster volume provisioning. This enables the CSI driver to dynamically select the appropriate Ceph cluster at CreateVolume time based on the node's topology zone. The proposal introduces two configuration mechanisms: - topologyDomainLabels field in config.json cluster entries - clusterIDs StorageClass parameter (comma-separated list) Ref: ceph#5177 Signed-off-by: Marcin Janowski <marcin.janowski@assecobs.pl>

Rakshith-R · 2026-01-29T08:57:42Z

Hey @WMP ,

Thanks for the contributions !

Please note that all of this code is written by Claude, and at this point, it has not been built or tested. This is only a proposal to add this functionality. At the moment, it is really just vibe coding. However, if you agree that the idea of implementing this functionality is appropriate, I will continue my work and perform tests on live ceph and k8s clusters.

We would love to review and accept contributions.
Reviewing designs and code consumes a lot of effort.

Before you provide design on a new feature, have you tested and understood the current working of topology based provisioning in k8s, csi and cephcsi as of today ?

This would help you understand the feature and review design for proposed improvements yourself in depth.

It would certainly boost our confidence for reviewing this design document.

nixpanic · 2026-02-26T15:56:54Z

Sounds like a nice feature to me. There are Rook users that deploy/maintain different Ceph clusters at different locations, but still use a single large Kubernetes cluster.

@travisn might be interested in this feature too? Any input/feedback is appreciated.

travisn · 2026-02-26T19:50:48Z

There is already topology provisioning for a single ceph cluster. For reference, see this example for topology-based provisioning of an external cluster. The approach is similar for an internal ceph cluster as well, though we don't have that example in the docs currently.

Trying to support multiple ceph clusters from a single storage class sounds to me like a big change for the csi driver to support, although I'll let others decide on that. But let's be clear about the needed scenario. Is the need for scale, multi-tenancy, or other?

I would be surprised if Ceph didn't already scale enough to support all the storage needs for a single K8s cluster. And the existing topology-based provisioning based on pools in a single ceph cluster can already handle multi-tenancy. Separate pools can already be defined using device classes.

moskalev · 2026-03-17T05:12:11Z

Please allow me to describe our use case for multiple ceph cluster support. I hope this provides motivation of why this feature could be useful to the community.

We run our infrastructure in several independent deployments that we call "zones". Each zone has all components (hardware/software/config) necessary to be able to stand on its own. While currently zones are connected via high-bandwidth low-latency links, we make no strict assumptions of that being true going into the future (e.g. zone migration to a different data center may raise latency/decrease cross-zone throughput). Kubernetes clusters span multiple zones for redundancy reasons. We require the apps (inside k8s) to remain functional even when a whole zone is offline. Zones' configuration management is structured in a way such that worst case scenario errors would only be affecting a single zone at a time.

If we were to run a single ceph cluster spanning multiple zones, we would encounter the following trade offs (please correct me if I misunderstand the topologyConstrainedPools use case):

A simple single ceph cluster across multiple zones would require us to have high-bandwidth links at all time and, potentially, a separate set of those since we would need to decouple ceph's and applications' traffic. That gets costly quickly for geographically separated zones and makes zone migration much less feasible.
A ceph cluster with topology constraint pools could be a potential option here, but that would still break the "zone standing on its own" assumption (if I understand correctly, we would have to configure ceph in stretch mode to regain the ability to lose a zone; this greatly increases the configuration complexity) and that would open a possibility for a potential single misconfiguration to critically affect applications in all zones simultaneously.

To keep the zones independent and CRUSH maps simple, we went with the multi-cluster approach. This also helps us in rolling out upgrades to our ceph clusters with little stress at an expense of a few more repetitions.

travisn · 2026-03-17T17:46:25Z

There are two statements that seem conflicting:

"Each zone has all components (hardware/software/config) necessary to be able to stand on its own."
"Kubernetes clusters span multiple zones for redundancy reasons. We require the apps (inside k8s) to remain functional even when a whole zone is offline."

How do you expect to have each zone be independent, and also apps remain functional even when a zone is offline? I would expect that if the apps are to remain online, their data must also be online, which would require the data to be replicated across multiple zones.

moskalev · 2026-03-17T18:33:27Z

I apologize for confusion here.

The independence of a zone is not at the application level. "software" here means hypervisor layer, not the end-user's application.

You can think of a zone as a rack of machines: it spans multiple physical machines "vertically". A zone can function on its own as a hosting platform for VMs/k8s-nodes. A zone does not care about the end-user apps. A zone's goal is to ensure that VMs are running with as little downtime as possible and with as much performance as possible. We consider a zone to be a failure domain for the apps (the end-user apps know about zones, but zones are not aware about apps and do not have a goal of singlehandedly supporting an end-user app).

k8s clusters span zones "horizontaly" - e.g. multiple physical machines from different racks (in different zones) host VMs that are a part of a k8s cluster. k8s clusters are highly available - control plane/ingress/load balancing are distributed accross all zones and are functional even when one zone is offline.

"We require the apps (inside k8s) to remain functional even when a whole zone is offline." means that it's the end-user app's responsibility to ensure that its state is properly synchronized between the zones, not the storage layer's task. For other storage backends, like local NVMes - it is very easy to expose them as a storage class in such cluster even when VMs are in different zones. We would like ceph to be that another backend - local to each zone, but exposed as a single storage class within k8s cluster for the apps to use.

Also, some of the apps do not even require state synchronization (you can think of them as jobs), but they need some storage (e.g. scratch space) to function. An end-user app with many worker pods just needs enough capacity (roughly equal across all zones) to remain functional and may not even care what zone to launch a new pod in as long as there is capacity. We would want to make it an easy problem by utilizing the same storage class across all zones.

travisn · 2026-03-17T20:12:32Z

Do the applications have their own redundancy, thus do not require the storage to be replicated across zones?

In that case, you can create a single storage class with topology-awareness using device classes and Ceph pools for each device class on OSDs in separate zones. Did you read the topology-awareness example I linked in the previous comment?

moskalev · 2026-03-17T20:27:10Z

Yes, end-user applications are responsible for their own data replication/redundancy.

This would require a single ceph cluster which is exactly what we are trying to avoid. I do not see where the example covers a case of a single storage class composed from multiple ceph clusters.

mergify bot added ci/skip/e2e skip running e2e CI jobs ci/skip/multi-arch-build skip building on multiple architectures component/docs Issues and PRs related to documentation labels Jan 28, 2026

WMP force-pushed the proposal/topology-aware-cluster-selection branch from 76965d0 to 6c1c9d7 Compare January 28, 2026 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: add design proposal for topology-aware cluster selection#5986

doc: add design proposal for topology-aware cluster selection#5986
WMP wants to merge 1 commit intoceph:develfrom
WMP:proposal/topology-aware-cluster-selection

WMP commented Jan 28, 2026

Uh oh!

Rakshith-R commented Jan 29, 2026

Uh oh!

nixpanic commented Feb 26, 2026

Uh oh!

travisn commented Feb 26, 2026

Uh oh!

moskalev commented Mar 17, 2026

Uh oh!

travisn commented Mar 17, 2026

Uh oh!

moskalev commented Mar 17, 2026

Uh oh!

travisn commented Mar 17, 2026

Uh oh!

moskalev commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

WMP commented Jan 28, 2026

Describe what this PR does

Is there anything that requires special attention

Related issues

Future concerns

Uh oh!

Rakshith-R commented Jan 29, 2026

Uh oh!

nixpanic commented Feb 26, 2026

Uh oh!

travisn commented Feb 26, 2026

Uh oh!

moskalev commented Mar 17, 2026

Uh oh!

travisn commented Mar 17, 2026

Uh oh!

moskalev commented Mar 17, 2026

Uh oh!

travisn commented Mar 17, 2026

Uh oh!

moskalev commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants