Validation Checklist
Version
master
Detailed Description
Hello,
I'm proposing an optional common/observability kustomize component that gives KubeFlow users a
ready-to-use monitoring foundation for those looking for a simple and quick observability setup.
The initial content focuses on GPU workloads (NVIDIA DCGM + AMD ROCm), since those are the
highest-value metrics for KubeFlow use cases, but the component is structured to grow with
additional dashboards and ServiceMonitors over time.
Motivation
KubeFlow manifests currently ship with no observability layer. Users who want visibility into
their cluster, like utilization, allocation, resource availability, must wire up the monitoring
stack themselves from scratch: Prometheus Operator, ServiceMonitor configuration, Grafana
dashboard provisioning. This is a non-trivial barrier, especially for teams new to the ecosystem.
This contribution closes that gap by providing a concrete starting point that works with
standard upstream tooling (kube-prometheus-stack, NVIDIA/AMD GPU Operators), requires no
vendor-specific configuration, and integrates naturally into the existing common/ + kustomize
pattern used by the rest of the manifests repo.
Prior art
We have been running this observability stack in production at CERN on a KubeFlow cluster with
mixed NVIDIA and AMD GPU nodes, and we would be happy to contribute it to the community as a
starting point. Happy to open a PR if there is interest.
Our reference architecture is described here: https://architecture.cncf.io/architectures/cern-scientific-computing/
Proposed potential solution
As a start, I'm proposing the following: A new directory common/observability/ following the same base/ + overlays/kubeflow/
pattern used by other common/ components.
Contents
ServiceMonitors (requires Prometheus Operator):
nvidia-dcgm-service-monitor.yaml — scrapes the DCGM exporter deployed by the NVIDIA GPU Operator (gpu-operator namespace, service label app: nvidia-dcgm-exporter, port gpu-metrics)
amd-gpu-service-monitor.yaml — scrapes the AMD ROCm device metrics exporter deployed by the AMD GPU Operator (kube-amd-gpu namespace, service label app.kubernetes.io/name: device-metrics-exporter, port metrics)
kepler-service-monitor.yaml — scrapes Kepler for per-pod energy consumption metrics (CPU + GPU power draw)
Unlike the GPU exporters (which are deployed by their respective GPU Operators), Kepler is not bundled with any other operator, so the component would also include Kepler's deployment manifests as an opt-in sub-component.
Grafana dashboards (provisioned as ConfigMaps with grafana_dashboard: "1"):
- GPU Cluster Usage — cluster-wide GPU utilization, memory, and resource usage per node
- GPU Namespace Usage — per-namespace GPU allocation and usage breakdown
- GPU Availability & Allocation — allocation ratios per GPU type, time-to-acquire-GPU metrics, pending GPU session tracking
Steps to Reproduce
N/A
Screenshots or Videos (Optional)
No response
Validation Checklist
Version
master
Detailed Description
Hello,
I'm proposing an optional
common/observabilitykustomize component that gives KubeFlow users aready-to-use monitoring foundation for those looking for a simple and quick observability setup.
The initial content focuses on GPU workloads (NVIDIA DCGM + AMD ROCm), since those are the
highest-value metrics for KubeFlow use cases, but the component is structured to grow with
additional dashboards and ServiceMonitors over time.
Motivation
KubeFlow manifests currently ship with no observability layer. Users who want visibility into
their cluster, like utilization, allocation, resource availability, must wire up the monitoring
stack themselves from scratch: Prometheus Operator, ServiceMonitor configuration, Grafana
dashboard provisioning. This is a non-trivial barrier, especially for teams new to the ecosystem.
This contribution closes that gap by providing a concrete starting point that works with
standard upstream tooling (kube-prometheus-stack, NVIDIA/AMD GPU Operators), requires no
vendor-specific configuration, and integrates naturally into the existing
common/+ kustomizepattern used by the rest of the manifests repo.
Prior art
We have been running this observability stack in production at CERN on a KubeFlow cluster with
mixed NVIDIA and AMD GPU nodes, and we would be happy to contribute it to the community as a
starting point. Happy to open a PR if there is interest.
Our reference architecture is described here: https://architecture.cncf.io/architectures/cern-scientific-computing/
Proposed potential solution
As a start, I'm proposing the following: A new directory
common/observability/following the samebase/+overlays/kubeflow/pattern used by other
common/components.Contents
ServiceMonitors (requires Prometheus Operator):
nvidia-dcgm-service-monitor.yaml— scrapes the DCGM exporter deployed by the NVIDIA GPU Operator (gpu-operatornamespace, service labelapp: nvidia-dcgm-exporter, portgpu-metrics)amd-gpu-service-monitor.yaml— scrapes the AMD ROCm device metrics exporter deployed by the AMD GPU Operator (kube-amd-gpunamespace, service labelapp.kubernetes.io/name: device-metrics-exporter, portmetrics)kepler-service-monitor.yaml— scrapes Kepler for per-pod energy consumption metrics (CPU + GPU power draw)Unlike the GPU exporters (which are deployed by their respective GPU Operators), Kepler is not bundled with any other operator, so the component would also include Kepler's deployment manifests as an opt-in sub-component.
Grafana dashboards (provisioned as ConfigMaps with
grafana_dashboard: "1"):Steps to Reproduce
N/A
Screenshots or Videos (Optional)
No response