doc: add design proposal for topology-aware cluster selection#5986
doc: add design proposal for topology-aware cluster selection#5986WMP wants to merge 1 commit intoceph:develfrom
Conversation
Add a design document describing topology-aware multi-cluster volume provisioning. This enables the CSI driver to dynamically select the appropriate Ceph cluster at CreateVolume time based on the node's topology zone. The proposal introduces two configuration mechanisms: - topologyDomainLabels field in config.json cluster entries - clusterIDs StorageClass parameter (comma-separated list) Ref: ceph#5177 Signed-off-by: Marcin Janowski <marcin.janowski@assecobs.pl>
76965d0 to
6c1c9d7
Compare
|
Hey @WMP , Thanks for the contributions !
We would love to review and accept contributions. Before you provide design on a new feature, have you tested and understood the current working of topology based provisioning in k8s, csi and cephcsi as of today ? This would help you understand the feature and review design for proposed improvements yourself in depth. It would certainly boost our confidence for reviewing this design document. |
|
Sounds like a nice feature to me. There are Rook users that deploy/maintain different Ceph clusters at different locations, but still use a single large Kubernetes cluster. @travisn might be interested in this feature too? Any input/feedback is appreciated. |
|
There is already topology provisioning for a single ceph cluster. For reference, see this example for topology-based provisioning of an external cluster. The approach is similar for an internal ceph cluster as well, though we don't have that example in the docs currently. Trying to support multiple ceph clusters from a single storage class sounds to me like a big change for the csi driver to support, although I'll let others decide on that. But let's be clear about the needed scenario. Is the need for scale, multi-tenancy, or other? I would be surprised if Ceph didn't already scale enough to support all the storage needs for a single K8s cluster. And the existing topology-based provisioning based on pools in a single ceph cluster can already handle multi-tenancy. Separate pools can already be defined using device classes. |
|
Please allow me to describe our use case for multiple ceph cluster support. I hope this provides motivation of why this feature could be useful to the community. We run our infrastructure in several independent deployments that we call "zones". Each zone has all components (hardware/software/config) necessary to be able to stand on its own. While currently zones are connected via high-bandwidth low-latency links, we make no strict assumptions of that being true going into the future (e.g. zone migration to a different data center may raise latency/decrease cross-zone throughput). Kubernetes clusters span multiple zones for redundancy reasons. We require the apps (inside k8s) to remain functional even when a whole zone is offline. Zones' configuration management is structured in a way such that worst case scenario errors would only be affecting a single zone at a time. If we were to run a single ceph cluster spanning multiple zones, we would encounter the following trade offs (please correct me if I misunderstand the topologyConstrainedPools use case):
To keep the zones independent and CRUSH maps simple, we went with the multi-cluster approach. This also helps us in rolling out upgrades to our ceph clusters with little stress at an expense of a few more repetitions. |
|
There are two statements that seem conflicting:
How do you expect to have each zone be independent, and also apps remain functional even when a zone is offline? I would expect that if the apps are to remain online, their data must also be online, which would require the data to be replicated across multiple zones. |
|
I apologize for confusion here. The independence of a zone is not at the application level. "software" here means hypervisor layer, not the end-user's application. You can think of a zone as a rack of machines: it spans multiple physical machines "vertically". A zone can function on its own as a hosting platform for VMs/k8s-nodes. A zone does not care about the end-user apps. A zone's goal is to ensure that VMs are running with as little downtime as possible and with as much performance as possible. We consider a zone to be a failure domain for the apps (the end-user apps know about zones, but zones are not aware about apps and do not have a goal of singlehandedly supporting an end-user app). k8s clusters span zones "horizontaly" - e.g. multiple physical machines from different racks (in different zones) host VMs that are a part of a k8s cluster. k8s clusters are highly available - control plane/ingress/load balancing are distributed accross all zones and are functional even when one zone is offline. "We require the apps (inside k8s) to remain functional even when a whole zone is offline." means that it's the end-user app's responsibility to ensure that its state is properly synchronized between the zones, not the storage layer's task. For other storage backends, like local NVMes - it is very easy to expose them as a storage class in such cluster even when VMs are in different zones. We would like ceph to be that another backend - local to each zone, but exposed as a single storage class within k8s cluster for the apps to use. Also, some of the apps do not even require state synchronization (you can think of them as jobs), but they need some storage (e.g. scratch space) to function. An end-user app with many worker pods just needs enough capacity (roughly equal across all zones) to remain functional and may not even care what zone to launch a new pod in as long as there is capacity. We would want to make it an easy problem by utilizing the same storage class across all zones. |
|
Do the applications have their own redundancy, thus do not require the storage to be replicated across zones? In that case, you can create a single storage class with topology-awareness using device classes and Ceph pools for each device class on OSDs in separate zones. Did you read the topology-awareness example I linked in the previous comment? |
|
Yes, end-user applications are responsible for their own data replication/redundancy. This would require a single ceph cluster which is exactly what we are trying to avoid. I do not see where the example covers a case of a single storage class composed from multiple ceph clusters. |
Describe what this PR does
Design proposal for topology-aware multi-cluster volume provisioning.
This enables the CSI driver to dynamically select the appropriate Ceph
cluster at CreateVolume time based on the node's topology zone.
The proposal introduces two new configuration mechanisms:
topologyDomainLabelsfield in config.json cluster entries — associateseach cluster with Kubernetes topology labels
clusterIDsStorageClass parameter — a comma-separated list of candidatecluster IDs for topology-based selection
This is a design-only PR. Implementation will follow in a separate PR
once the design is approved.
Is there anything that requires special attention
StorageClasses with a single
clusterIDwork unchanged.volumeBindingMode: WaitForFirstConsumeris required for topology-basedselection (Kubernetes must provide AccessibilityRequirements).
clusterIDparameter takes priority when present —topology selection is only used as a fallback via the new
clusterIDsparameter.
Related issues
Ref: #5177
Future concerns
clusterIDfully optional whenclusterIDsis providedtopologyConstrainedPoolsfor selecting both cluster and pool
Checklist: