Skip to content

[Feature] Add opt-in observability stack #3426

@amine-lah

Description

@amine-lah

Validation Checklist

  • I confirm that this is a Kubeflow-related issue.
  • I am reporting this in the appropriate repository.
  • I have followed the Kubeflow installation guidelines.
  • The issue report is detailed and includes version numbers where applicable.
  • I have considered adding my company to the adopters page to support Kubeflow and help the community, since I expect help from the community for my issue (see 1. and 2.).
  • This issue pertains to Kubeflow development.
  • I am available to work on this issue.
  • You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

Version

master

Detailed Description

Hello,

I'm proposing an optional common/observability kustomize component that gives KubeFlow users a
ready-to-use monitoring foundation for those looking for a simple and quick observability setup.

The initial content focuses on GPU workloads (NVIDIA DCGM + AMD ROCm), since those are the
highest-value metrics for KubeFlow use cases, but the component is structured to grow with
additional dashboards and ServiceMonitors over time.

Motivation

KubeFlow manifests currently ship with no observability layer. Users who want visibility into
their cluster, like utilization, allocation, resource availability, must wire up the monitoring
stack themselves from scratch: Prometheus Operator, ServiceMonitor configuration, Grafana
dashboard provisioning. This is a non-trivial barrier, especially for teams new to the ecosystem.

This contribution closes that gap by providing a concrete starting point that works with
standard upstream tooling (kube-prometheus-stack, NVIDIA/AMD GPU Operators), requires no
vendor-specific configuration, and integrates naturally into the existing common/ + kustomize
pattern used by the rest of the manifests repo.

Prior art

We have been running this observability stack in production at CERN on a KubeFlow cluster with
mixed NVIDIA and AMD GPU nodes, and we would be happy to contribute it to the community as a
starting point. Happy to open a PR if there is interest.
Our reference architecture is described here: https://architecture.cncf.io/architectures/cern-scientific-computing/

Proposed potential solution

As a start, I'm proposing the following: A new directory common/observability/ following the same base/ + overlays/kubeflow/
pattern used by other common/ components.

Contents

ServiceMonitors (requires Prometheus Operator):

  • nvidia-dcgm-service-monitor.yaml — scrapes the DCGM exporter deployed by the NVIDIA GPU Operator (gpu-operator namespace, service label app: nvidia-dcgm-exporter, port gpu-metrics)
  • amd-gpu-service-monitor.yaml — scrapes the AMD ROCm device metrics exporter deployed by the AMD GPU Operator (kube-amd-gpu namespace, service label app.kubernetes.io/name: device-metrics-exporter, port metrics)
  • kepler-service-monitor.yaml — scrapes Kepler for per-pod energy consumption metrics (CPU + GPU power draw)

Unlike the GPU exporters (which are deployed by their respective GPU Operators), Kepler is not bundled with any other operator, so the component would also include Kepler's deployment manifests as an opt-in sub-component.

Grafana dashboards (provisioned as ConfigMaps with grafana_dashboard: "1"):

  • GPU Cluster Usage — cluster-wide GPU utilization, memory, and resource usage per node
  • GPU Namespace Usage — per-namespace GPU allocation and usage breakdown
  • GPU Availability & Allocation — allocation ratios per GPU type, time-to-acquire-GPU metrics, pending GPU session tracking

Steps to Reproduce

N/A

Screenshots or Videos (Optional)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions