Skip to content

Commit b26cb70

Browse files
committed
doc: add design proposal for K8s SA based volume access restriction
Signed-off-by: Rakshith R <rar@redhat.com>
1 parent d306039 commit b26cb70

File tree

1 file changed

+158
-0
lines changed

1 file changed

+158
-0
lines changed
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# Kubernetes ServiceAccount Based Volume Access Restriction
2+
3+
## Introduction
4+
5+
This proposal introduces an optional mechanism to restrict volume access based
6+
on the Kubernetes ServiceAccount of the Pod mounting the volume. When
7+
configured, only Pods running with the specified ServiceAccount are allowed to
8+
mount the volume. All other mount attempts are rejected with a
9+
`PermissionDenied` error.
10+
11+
The restriction is stored as metadata on the backend Ceph object (RBD image
12+
metadata or CephFS subvolume metadata) and is enforced at mount time through
13+
the CSI [`podInfoOnMount`][pod-info-on-mount] mechanism.
14+
15+
[pod-info-on-mount]:
16+
<https://kubernetes-csi.github.io/docs/pod-info.html#pod-info-on-mount-with-csi-driver-object>
17+
18+
## Motivation
19+
20+
Ceph-CSI volumes are accessible to any Pod that has a valid PVC reference and
21+
the necessary RBAC to use the StorageClass. In multi-tenant and data pipeline
22+
environments, this is insufficient. There are scenarios where a volume should
23+
be exclusively accessible to a specific workload identity even when other Pods
24+
in the same namespace can reference the PVC.
25+
26+
### Use Case: Ceph VolSync Plugin Replication Destination PVC Protection
27+
28+
A primary motivator for this feature is the custom
29+
[Ceph VolSync Plugin](https://github.com/RamenDR/ceph-volsync-plugin) that
30+
performs incremental data replication across clusters. In a disaster recovery
31+
or migration workflow:
32+
33+
1. A `ReplicationDestination` controller creates a PVC on the destination
34+
cluster to receive replicated data.
35+
1. A replication worker Pod, running under a dedicated ServiceAccount (e.g.
36+
`volsync-worker-sa`), incrementally syncs data from the source cluster into
37+
this destination PVC.
38+
1. The destination PVC must remain writable only by the replication worker
39+
until the replication is complete and a failover is triggered.
40+
41+
Without ServiceAccount based restriction, any Pod in the namespace with a
42+
reference to the destination PVC could write to it, potentially corrupting the
43+
replicated data or breaking the incremental sync state. By binding the
44+
destination volume to the replication worker's ServiceAccount, the volume is
45+
protected from unintended writes throughout the replication lifecycle. On
46+
failover, the restriction is removed so the application workload can mount
47+
the volume.
48+
49+
### Other Potential Use Cases
50+
51+
- **Sensitive data volumes**: Restrict access to volumes containing regulated
52+
data to only the ServiceAccount authorized to process them.
53+
- **Custom usecases**: Similar usecases where a
54+
workload identity needs exclusive access to a volume
55+
for data integrity or security reasons.
56+
57+
## Dependency
58+
59+
- The [`podInfoOnMount`][pod-info-on-mount] field must
60+
be set to `true` in the CSIDriver specification.
61+
This causes Kubelet to inject Pod information
62+
(including the ServiceAccount name) into the volume
63+
context during `NodePublishVolume`. Without this,
64+
the restriction cannot be enforced. Since this
65+
parameter is a mutable field in the CSIDriver spec,
66+
it will be enabled by default going
67+
forward(cephcsi v3.17.0+).
68+
69+
## Design
70+
71+
### Metadata Keys
72+
73+
Each driver type uses a driver-specific metadata key to store the allowed
74+
ServiceAccount name:
75+
76+
| Driver | Metadata Key | Storage |
77+
|--------|-------------|---------|
78+
| RBD | `.rbd.csi.ceph.com/serviceaccount` | RBD image metadata |
79+
| CephFS | `.cephfs.csi.ceph.com/serviceaccount` | CephFS subvolume metadata |
80+
| NVMe-oF | `.rbd.csi.ceph.com/serviceaccount` | RBD image metadata (via RBD backend) |
81+
| NFS | `.cephfs.csi.ceph.com/serviceaccount` | CephFS subvolume metadata (via CephFS backend) |
82+
83+
Only a single ServiceAccount can be specified per volume.
84+
85+
### CSI Flow
86+
87+
The restriction is enforced across two CSI RPCs:
88+
89+
1. **ControllerPublishVolume**: The controller reads the ServiceAccount
90+
metadata from the Ceph backend. If present, it is included in the publish
91+
context passed to the node.
92+
93+
1. **NodePublishVolume**: The node plugin compares the publish context value
94+
against the Pod's ServiceAccount (provided by Kubelet via
95+
`csi.storage.k8s.io/serviceAccount.name` in the volume context). A mismatch
96+
results in a `PermissionDenied` error. If no restriction is set, or if
97+
`podInfoOnMount` is not enabled, the mount is allowed.
98+
99+
### Implementation
100+
101+
A shared validation function `ValidateServiceAccountRestriction` in
102+
`internal/util/validate.go` is called at the beginning of `NodePublishVolume`
103+
in all four drivers (RBD, CephFS, NFS, NVMe-oF), ensuring consistent
104+
enforcement.
105+
106+
Each driver reads the restriction metadata in `ControllerPublishVolume` using
107+
its backend:
108+
109+
- **RBD**: reads via `GetMetadata` in `internal/rbd/controllerserver.go`.
110+
- **CephFS**: reads via `ListMetadata` in
111+
`internal/cephfs/controllerserver.go`.
112+
- **NVMe-oF**: delegates to the RBD backend and propagates the publish context
113+
in `internal/nvmeof/controller/controllerserver.go`.
114+
- **NFS**: delegates to the CephFS backend in
115+
`internal/nfs/controller/controllerserver.go`.
116+
117+
## Setting and Removing the Restriction
118+
119+
The restriction is managed through Ceph CLI commands. Refer to the
120+
"Kubernetes ServiceAccount Based Volume Access" sections in
121+
[RBD deploy.md](../../rbd/deploy.md) and
122+
[CephFS deploy.md](../../cephfs/deploy.md) for usage
123+
instructions and examples.
124+
125+
## Ceph VolSync Plugin Integration Example
126+
127+
1. The replication destination worker sets the
128+
ServiceAccount restriction on the backing Ceph
129+
object(RBD image or CephFS subvolume) to the
130+
replication worker's ServiceAccount
131+
(e.g.`volsync-worker-sa`) on first use.
132+
1. Only the worker Pod mounts the destination PVC successfully because its
133+
ServiceAccount matches. Any other Pod attempting to mount the same PVC is
134+
rejected with `PermissionDenied` during NodePublish call, protecting data
135+
integrity during incremental sync.
136+
1. On replication destination deletion, the controller spins up a cleanup job
137+
that removes the ServiceAccount restriction
138+
metadata, allowing the application workload to
139+
mount the volume.
140+
141+
## Limitations
142+
143+
- Only a single ServiceAccount can be specified per volume.
144+
- Enforced at CSI mount time only; does not prevent direct access to the
145+
underlying Ceph storage from outside Kubernetes.
146+
- If `podInfoOnMount` is not enabled, the restriction is silently unenforced.
147+
- Changing the restriction on an already-mounted volume does not affect
148+
existing mounts. The volume must be unmounted and remounted.
149+
- Managed through Ceph CLI commands, not Kubernetes-native APIs.
150+
151+
## Future Enhancements
152+
153+
- Support restriction based on other attributes (e.g. name, namespace) in
154+
addition to ServiceAccount.
155+
- Add support for multiple allowed ServiceAccounts per volume.
156+
- Provide more flexible configuration key value options (e.g. receiving both
157+
expected key-value pairs in the volume context instead of a single
158+
ServiceAccount name).

0 commit comments

Comments
 (0)