Skip to content

AI sentinel agent#98

Open
luccabb wants to merge 4 commits intomainfrom
export-D96650734
Open

AI sentinel agent#98
luccabb wants to merge 4 commits intomainfrom
export-D96650734

Conversation

@luccabb
Copy link
Copy Markdown
Member

@luccabb luccabb commented Mar 15, 2026

Summary:
gcm-sentinel: AI-powered GPU cluster investigation agent

image

3 lines to release:

# 1. Deploy GCM health checks
$ helm install gcm oci://ghcr.io/facebookresearch/charts/gcm --set monitoring.enabled=false

# 2. Deploy Sentinel (AI investigation agent)
$ kubectl create secret generic gcm-sentinel-llm --from-literal=api-key=YOUR_KEY -n monitoring 
$ helm install gcm-sentinel oci://ghcr.io/facebookresearch/charts/gcm-sentinel --set llm.existingSecret=gcm-sentinel-llm

Summary

Adds gcm-sentinel, a new component to the GCM ecosystem that uses an LLM (Claude or GPT) to investigate GPU hardware failures detected by GCM Health Checks. When a node condition changes (e.g. GcmXidErrorsProblem), the agent queries Prometheus metrics, Kubernetes state, pod logs, and GPU exporter data, then produces a severity assessment, root cause analysis, and recommended remediation action.

Key design decisions

  • Observe-only by default (actionMode=recommend). Remediation tools are absent from the LLM's schema unless explicitly set to execute mode. Safety enforced at the code level, not prompt level.
  • Plugin-based data sources. Each data source (Prometheus, DCGM direct, node-exporter, K8s core, workloads, GCM health,
    Alertmanager) is a self-contained Python class. Adding a new one is one file + one line of registration.
  • Multi-LLM support. Backend abstraction supports Anthropic and OpenAI APIs. Default: Claude Sonnet 4.6.
  • Separate Helm chart at charts/gcm-sentinel/ (alongside existing charts/gcm/). Separate PyPI package (gcm-sentinel). Independent deployment, RBAC, and lifecycle from the core GCM DaemonSets.

Deploy to a cluster (observe-only):

  # Create API key secret
  kubectl create secret generic gcm-sentinel-llm \
    --namespace monitoring \
    --from-literal=api-key=YOUR_KEY

  # Deploy — observe-only, single node
  helm install gcm-sentinel oci://ghcr.io/facebookresearch/charts/gcm-sentinel \
    --namespace monitoring \
    --set llm.existingSecret=gcm-sentinel-llm \
    --set sentinel.nodeAllowlist="YOUR_GPU_NODE_NAME"

  # Verify
  kubectl logs -n monitoring -l app=gcm-sentinel | head -10
  kubectl get events -A --field-selector reason=GCMSentinel --watch

Differential Revision: D96650734

Differential Revision: D96650734
@github-actions
Copy link
Copy Markdown

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 15, 2026

@luccabb has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96650734.

docker push ${{ env.NPD_IMAGE }}:latest

- name: Build and push Sentinel image
uses: docker/build-push-action@v6

Check warning

Code scanning / CodeQL

Unpinned tag for a non-immutable Action in workflow Medium

Unpinned 3rd party Action 'Release GCM Monitoring and Health Checks on version bump' step
Uses Step
uses 'docker/build-push-action' with ref 'v6', not a pinned commit hash
"labels": dict(node.metadata.labels or {}),
"annotations": {
k: v for k, v in (node.metadata.annotations or {}).items()
if k.startswith("gcm-sentinel") or k.startswith("node.kubernetes.io/")

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
node.kubernetes.io/
may be at an arbitrary position in the sanitized URL.
@luccabb luccabb changed the title ai triage ai sentinel Mar 15, 2026
@luccabb luccabb changed the title ai sentinel AI sentinel Mar 15, 2026
@luccabb luccabb changed the title AI sentinel AI sentinel helm Mar 15, 2026
Summary:
gcm-sentinel: AI-powered GPU cluster investigation agent

Summary

Adds gcm-sentinel, a new component to the GCM ecosystem that uses an LLM (Claude or GPT) to investigate GPU hardware failures detected by GCM Health Checks. When a node condition changes (e.g. GcmXidErrorsProblem), the agent queries Prometheus metrics, Kubernetes state, pod logs, and GPU exporter data, then produces a severity assessment, root cause analysis, and recommended remediation action.

Key design decisions

  - Observe-only by default (actionMode=recommend). Remediation tools are absent from the LLM's schema unless explicitly set to execute mode. Safety enforced at the code level, not prompt level.
- Plugin-based data sources. Each data source (Prometheus, DCGM direct, node-exporter, K8s core, workloads, GCM health,
  Alertmanager) is a self-contained Python class. Adding a new one is one file + one line of registration.
- Multi-LLM support. Backend abstraction supports Anthropic and OpenAI APIs. Default: Claude Sonnet 4.6.
- Separate Helm chart at charts/gcm-sentinel/ (alongside existing charts/gcm/). Separate PyPI package (gcm-sentinel). Independent deployment, RBAC, and lifecycle from the core GCM DaemonSets.

Deploy to a cluster (observe-only):
```
  # Create API key secret
  kubectl create secret generic gcm-sentinel-llm \
    --namespace monitoring \
    --from-literal=api-key=YOUR_KEY

  # Deploy — observe-only, single node
  helm install gcm-sentinel oci://ghcr.io/facebookresearch/charts/gcm-sentinel \
    --namespace monitoring \
    --set llm.existingSecret=gcm-sentinel-llm \
    --set sentinel.nodeAllowlist="YOUR_GPU_NODE_NAME"

  # Verify
  kubectl logs -n monitoring -l app=gcm-sentinel | head -10
  kubectl get events -A --field-selector reason=GCMSentinel --watch
```

Differential Revision: D96650734
@meta-codesync meta-codesync Bot changed the title AI sentinel helm ai triage Mar 15, 2026
@luccabb luccabb changed the title ai triage AI sentinel agent Mar 15, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 15, 2026

@luccabb has imported this pull request. If you are a Meta employee, you can view this in D96650734.

@luccabb
Copy link
Copy Markdown
Member Author

luccabb commented Mar 16, 2026

@claude

@claude
Copy link
Copy Markdown

claude Bot commented Mar 16, 2026

Claude finished @luccabb's task in 1m 50s —— View job


PR Review: AI Sentinel Agent

  • Read repo context and changed files
  • Analyze core logic (engine, watcher, datasources, backends)
  • Review security findings from automated scanners
  • Review Helm chart and RBAC
  • Provide comprehensive review feedback

This is a well-structured addition. The plugin-based datasource architecture, three-tier safety model (LLM schema → code guards → RBAC), and the observe-by-default posture are all solid design choices. Below are findings ranging from critical to minor.


Critical: Race condition on _allowed_node in execute mode

File: gcm-sentinel/src/gcm_sentinel/datasources/remediation.py:29 and gcm-sentinel/src/gcm_sentinel/engine.py:92-94

RemediationDataSource._allowed_node is instance state shared across concurrent investigations. If two node conditions fire simultaneously:

  1. Investigation A calls ds.set_allowed_node("node-A")
  2. Investigation B calls ds.set_allowed_node("node-B") before A completes
  3. Investigation A now has _allowed_node == "node-B" — the guard at line 75 of remediation.py would allow A to act on node-B

This is a TOCTOU race in execute mode. The fix is to pass allowed_node as a parameter to execute() rather than storing it as mutable state, or create a per-investigation RemediationDataSource instance.

Fix this →


High: Shared registry, concurrent set_allowed_node not isolated

Directly related to the above — engine.py:91-94 mutates the shared self._registry in NodeConditionWatcher._handle_event, which is called from concurrent asyncio tasks (see watcher.py:108). Even without the race on _allowed_node, the design of mutating shared datasource state per-investigation is inherently unsafe for concurrency. Consider making the registry stateless or scoping it per-investigation.


High: Automated scanner finding — kubernetes.py:30 annotation filter

The GitHub Advanced Security finding flags node.kubernetes.io/ at line 30:

"annotations": {
    k: v for k, v in (node.metadata.annotations or {}).items()
    if k.startswith("gcm-sentinel") or k.startswith("node.kubernetes.io/")
},

This is likely a false positivestartswith always checks from position 0. However, note that this leaks all annotations prefixed with node.kubernetes.io/ into the LLM context. If a future annotation contains sensitive data under that prefix (e.g., node topology secrets), it would be sent to the external LLM API. Consider making the annotation allowlist more restrictive or explicit.


Medium: No container security context in Helm chart

File: charts/gcm-sentinel/templates/deployment.yaml

The deployment has no securityContext. For a long-running cluster agent, best practice is:

securityContext:
  runAsNonRoot: true
  runAsUser: 65534
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

Fix this →


Medium: No liveness/readiness probes in deployment

File: charts/gcm-sentinel/templates/deployment.yaml

The container is a long-running watcher with no health checks. If the watch loop hangs silently (e.g., on a deadlocked investigation), Kubernetes has no way to detect and restart it. Consider adding a liveness probe based on a heartbeat file or HTTP endpoint.


Medium: Unpinned GitHub Actions tag (scanner finding)

File: .github/workflows/release.yml:350

docker/build-push-action@v6 should be pinned to a commit SHA for supply chain security. This is a legitimate concern:

# Instead of:
uses: docker/build-push-action@v6
# Use:
uses: docker/build-push-action@v6.x.x  # pinned SHA preferred

Note: You mentioned you cannot modify .github/workflows/ files — this is flagged for awareness.


Medium: drain_node is fire-and-forget

File: gcm-sentinel/src/gcm_sentinel/tools/kubernetes.py:222-245

The drain implementation evicts pods but doesn't wait for them to terminate. In execute mode, the LLM receives "Drained node X: evicted N pods" immediately, but pods may still be running. This could cause confusion in the investigation log and mislead the LLM into thinking the drain completed successfully.


Medium: emit_k8s_event hardcodes default namespace

File: gcm-sentinel/src/gcm_sentinel/notify.py:36

get_core_api().create_namespaced_event("default", ...)

Events for node issues are written to the default namespace, but the sentinel runs in monitoring. Most cluster operators look at events in the same namespace as the relevant workload. Consider making this configurable or using the sentinel's own namespace.


Low: Duplicate docstring line in remediation.py

File: gcm-sentinel/src/gcm_sentinel/datasources/remediation.py:9-10

Lines 9 and 10 are identical:

- Node name validation (can only act on the node being investigated)
- Node name validation (can only act on the node being investigated)

Low: Empty API key starts silently, fails on first use

File: gcm-sentinel/src/gcm_sentinel/config.py:39

api_key defaults to "". The agent starts successfully, logs its banner, and only fails when the first investigation fires an API call. Consider adding a startup validation that warns (or fails) if api_key is empty and llm_api is not a local endpoint.


Low: Lazy _registry initialization not thread-safe

File: gcm-sentinel/src/gcm_sentinel/watcher.py:76-79

if self._registry is None:
    self._registry = build_datasources(self.cfg)

This check-then-set is not atomic in asyncio. If two events arrive before the first investigation completes initialization, build_datasources could be called twice. While the result is likely harmless (second assignment overwrites the first), it's cleaner to initialize in __init__ or watch_loop before the first event.


Positive Observations

  • Three-layer safety model (LLM schema, code guards, RBAC) is excellent design. The fact that remediation tools are literally absent from the LLM's tool schema in non-execute modes is far more robust than prompt-level instructions.
  • _parse_assessment uses a bracket-scanning fallback rather than naive json.loads — handles LLM markdown fences cleanly.
  • Rate-limiter using a sliding deque window is correct and efficient.
  • fnmatch allowlist for node names supports glob patterns like gpu-node-0* — a nice usability touch.
  • Test coverage of the critical safety paths (mode gating, rate limiter, cooldown, allowlist) is good.
  • Helm chart RBAC is correctly conditioned on actionMode — the ClusterRole grants patch and eviction only when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants