[Feature] Add opt-in observability stack

### Validation Checklist

- [x] I confirm that this is a Kubeflow-related issue.
- [x] I am reporting this in the appropriate repository.
- [x] I have followed the [Kubeflow installation guidelines](https://github.com/kubeflow/manifests/blob/master/README.md).
- [x] The issue report is detailed and includes version numbers where applicable.
- [x] I have considered adding my company to the adopters page to support Kubeflow and help the community, since I expect help from the community for my issue (see [1.](https://github.com/kubeflow/community/issues/833) and [2.](https://github.com/kubeflow/community/blob/master/ADOPTERS.md#adopters-of-kubeflow-platform)).
- [x] This issue pertains to Kubeflow development.
- [x] I am available to work on this issue.
- [x] You can join the CNCF Slack and access our meetings at the [Kubeflow Community](https://www.kubeflow.org/docs/about/community/) website. Our channel on the CNCF Slack is here [**#kubeflow-platform**](https://app.slack.com/client/T08PSQ7BQ/C073W572LA2).

### Version

master

### Detailed Description

Hello,

I'm proposing an optional `common/observability` kustomize component that gives KubeFlow users a
ready-to-use monitoring foundation for those looking for a simple and quick observability setup.

The initial content focuses on GPU workloads (NVIDIA DCGM + AMD ROCm), since those are the
highest-value metrics for KubeFlow use cases, but the component is structured to grow with
additional dashboards and ServiceMonitors over time.

## Motivation

KubeFlow manifests currently ship with no observability layer. Users who want visibility into
their cluster, like utilization, allocation, resource availability, must wire up the monitoring
stack themselves from scratch: Prometheus Operator, ServiceMonitor configuration, Grafana
dashboard provisioning. This is a non-trivial barrier, especially for teams new to the ecosystem.

This contribution closes that gap by providing a concrete starting point that works with
standard upstream tooling (kube-prometheus-stack, NVIDIA/AMD GPU Operators), requires no
vendor-specific configuration, and integrates naturally into the existing `common/` + kustomize
pattern used by the rest of the manifests repo.

## Prior art

We have been running this observability stack in production at CERN on a KubeFlow cluster with
mixed NVIDIA and AMD GPU nodes, and we would be happy to contribute it to the community as a
starting point. Happy to open a PR if there is interest.
Our reference architecture is described here: https://architecture.cncf.io/architectures/cern-scientific-computing/


## Proposed potential solution

As a start, I'm proposing the following: A new directory `common/observability/` following the same `base/` + `overlays/kubeflow/`
pattern used by other `common/` components.

### Contents

**ServiceMonitors** (requires Prometheus Operator):
- `nvidia-dcgm-service-monitor.yaml` — scrapes the DCGM exporter deployed by the NVIDIA GPU Operator (`gpu-operator` namespace, service label `app: nvidia-dcgm-exporter`, port `gpu-metrics`)
- `amd-gpu-service-monitor.yaml` — scrapes the AMD ROCm device metrics exporter deployed by the AMD GPU Operator (`kube-amd-gpu` namespace, service label `app.kubernetes.io/name: device-metrics-exporter`, port `metrics`)
- `kepler-service-monitor.yaml` — scrapes [Kepler](https://github.com/sustainable-computing-io/kepler) for per-pod energy consumption metrics (CPU + GPU power draw)

Unlike the GPU exporters (which are deployed by their respective GPU Operators), Kepler is not bundled with any other operator, so the component would also include Kepler's deployment manifests as an opt-in sub-component.

**Grafana dashboards** (provisioned as ConfigMaps with `grafana_dashboard: "1"`):
- **GPU Cluster Usage** — cluster-wide GPU utilization, memory, and resource usage per node
- **GPU Namespace Usage** — per-namespace GPU allocation and usage breakdown
- **GPU Availability & Allocation** — allocation ratios per GPU type, time-to-acquire-GPU metrics, pending GPU session tracking

### Steps to Reproduce

N/A

### Screenshots or Videos (Optional)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add opt-in observability stack #3426

Validation Checklist

Version

Detailed Description

Motivation

Prior art

Proposed potential solution

Contents

Steps to Reproduce

Screenshots or Videos (Optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Add opt-in observability stack #3426

Description

Validation Checklist

Version

Detailed Description

Motivation

Prior art

Proposed potential solution

Contents

Steps to Reproduce

Screenshots or Videos (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions