Conversation
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
| docker push ${{ env.NPD_IMAGE }}:latest | ||
|
|
||
| - name: Build and push Sentinel image | ||
| uses: docker/build-push-action@v6 |
Check warning
Code scanning / CodeQL
Unpinned tag for a non-immutable Action in workflow Medium
| "labels": dict(node.metadata.labels or {}), | ||
| "annotations": { | ||
| k: v for k, v in (node.metadata.annotations or {}).items() | ||
| if k.startswith("gcm-sentinel") or k.startswith("node.kubernetes.io/") |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
Summary:
gcm-sentinel: AI-powered GPU cluster investigation agent
Summary
Adds gcm-sentinel, a new component to the GCM ecosystem that uses an LLM (Claude or GPT) to investigate GPU hardware failures detected by GCM Health Checks. When a node condition changes (e.g. GcmXidErrorsProblem), the agent queries Prometheus metrics, Kubernetes state, pod logs, and GPU exporter data, then produces a severity assessment, root cause analysis, and recommended remediation action.
Key design decisions
- Observe-only by default (actionMode=recommend). Remediation tools are absent from the LLM's schema unless explicitly set to execute mode. Safety enforced at the code level, not prompt level.
- Plugin-based data sources. Each data source (Prometheus, DCGM direct, node-exporter, K8s core, workloads, GCM health,
Alertmanager) is a self-contained Python class. Adding a new one is one file + one line of registration.
- Multi-LLM support. Backend abstraction supports Anthropic and OpenAI APIs. Default: Claude Sonnet 4.6.
- Separate Helm chart at charts/gcm-sentinel/ (alongside existing charts/gcm/). Separate PyPI package (gcm-sentinel). Independent deployment, RBAC, and lifecycle from the core GCM DaemonSets.
Deploy to a cluster (observe-only):
```
# Create API key secret
kubectl create secret generic gcm-sentinel-llm \
--namespace monitoring \
--from-literal=api-key=YOUR_KEY
# Deploy — observe-only, single node
helm install gcm-sentinel oci://ghcr.io/facebookresearch/charts/gcm-sentinel \
--namespace monitoring \
--set llm.existingSecret=gcm-sentinel-llm \
--set sentinel.nodeAllowlist="YOUR_GPU_NODE_NAME"
# Verify
kubectl logs -n monitoring -l app=gcm-sentinel | head -10
kubectl get events -A --field-selector reason=GCMSentinel --watch
```
Differential Revision: D96650734
ddfd9af to
521ff6b
Compare
|
Claude finished @luccabb's task in 1m 50s —— View job PR Review: AI Sentinel Agent
This is a well-structured addition. The plugin-based datasource architecture, three-tier safety model (LLM schema → code guards → RBAC), and the observe-by-default posture are all solid design choices. Below are findings ranging from critical to minor. Critical: Race condition on
|
Summary:
gcm-sentinel: AI-powered GPU cluster investigation agent
3 lines to release:
Summary
Adds gcm-sentinel, a new component to the GCM ecosystem that uses an LLM (Claude or GPT) to investigate GPU hardware failures detected by GCM Health Checks. When a node condition changes (e.g. GcmXidErrorsProblem), the agent queries Prometheus metrics, Kubernetes state, pod logs, and GPU exporter data, then produces a severity assessment, root cause analysis, and recommended remediation action.
Key design decisions
Alertmanager) is a self-contained Python class. Adding a new one is one file + one line of registration.
Deploy to a cluster (observe-only):
Differential Revision: D96650734