AI sentinel agent by luccabb · Pull Request #98 · facebookresearch/gcm

luccabb · 2026-03-15T23:13:35Z

Summary:
gcm-sentinel: AI-powered GPU cluster investigation agent

3 lines to release:

# 1. Deploy GCM health checks
$ helm install gcm oci://ghcr.io/facebookresearch/charts/gcm --set monitoring.enabled=false

# 2. Deploy Sentinel (AI investigation agent)
$ kubectl create secret generic gcm-sentinel-llm --from-literal=api-key=YOUR_KEY -n monitoring 
$ helm install gcm-sentinel oci://ghcr.io/facebookresearch/charts/gcm-sentinel --set llm.existingSecret=gcm-sentinel-llm

Summary

Adds gcm-sentinel, a new component to the GCM ecosystem that uses an LLM (Claude or GPT) to investigate GPU hardware failures detected by GCM Health Checks. When a node condition changes (e.g. GcmXidErrorsProblem), the agent queries Prometheus metrics, Kubernetes state, pod logs, and GPU exporter data, then produces a severity assessment, root cause analysis, and recommended remediation action.

Key design decisions

Observe-only by default (actionMode=recommend). Remediation tools are absent from the LLM's schema unless explicitly set to execute mode. Safety enforced at the code level, not prompt level.
Plugin-based data sources. Each data source (Prometheus, DCGM direct, node-exporter, K8s core, workloads, GCM health,
Alertmanager) is a self-contained Python class. Adding a new one is one file + one line of registration.
Multi-LLM support. Backend abstraction supports Anthropic and OpenAI APIs. Default: Claude Sonnet 4.6.
Separate Helm chart at charts/gcm-sentinel/ (alongside existing charts/gcm/). Separate PyPI package (gcm-sentinel). Independent deployment, RBAC, and lifecycle from the core GCM DaemonSets.

Deploy to a cluster (observe-only):

  # Create API key secret
  kubectl create secret generic gcm-sentinel-llm \
    --namespace monitoring \
    --from-literal=api-key=YOUR_KEY

  # Deploy — observe-only, single node
  helm install gcm-sentinel oci://ghcr.io/facebookresearch/charts/gcm-sentinel \
    --namespace monitoring \
    --set llm.existingSecret=gcm-sentinel-llm \
    --set sentinel.nodeAllowlist="YOUR_GPU_NODE_NAME"

  # Verify
  kubectl logs -n monitoring -l app=gcm-sentinel | head -10
  kubectl get events -A --field-selector reason=GCMSentinel --watch

Differential Revision: D96650734

github-actions · 2026-03-15T23:13:44Z

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow	What it runs
GPU Cluster Monitoring Python CI	lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI	shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command	Description	Requires approval?
`/metaci tests`	Runs Meta internal integration tests (pytest)	Yes — a maintainer must trigger the command and approve the deployment request
`/metaci integration tests`	Same as above (alias)	Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

meta-codesync · 2026-03-15T23:13:53Z

@luccabb has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96650734.

          docker push ${{ env.NPD_IMAGE }}:latest
+
+      - name: Build and push Sentinel image
+        uses: docker/build-push-action@v6


+        "labels": dict(node.metadata.labels or {}),
+        "annotations": {
+            k: v for k, v in (node.metadata.annotations or {}).items()
+            if k.startswith("gcm-sentinel") or k.startswith("node.kubernetes.io/")


Summary: gcm-sentinel: AI-powered GPU cluster investigation agent Summary Adds gcm-sentinel, a new component to the GCM ecosystem that uses an LLM (Claude or GPT) to investigate GPU hardware failures detected by GCM Health Checks. When a node condition changes (e.g. GcmXidErrorsProblem), the agent queries Prometheus metrics, Kubernetes state, pod logs, and GPU exporter data, then produces a severity assessment, root cause analysis, and recommended remediation action. Key design decisions - Observe-only by default (actionMode=recommend). Remediation tools are absent from the LLM's schema unless explicitly set to execute mode. Safety enforced at the code level, not prompt level. - Plugin-based data sources. Each data source (Prometheus, DCGM direct, node-exporter, K8s core, workloads, GCM health, Alertmanager) is a self-contained Python class. Adding a new one is one file + one line of registration. - Multi-LLM support. Backend abstraction supports Anthropic and OpenAI APIs. Default: Claude Sonnet 4.6. - Separate Helm chart at charts/gcm-sentinel/ (alongside existing charts/gcm/). Separate PyPI package (gcm-sentinel). Independent deployment, RBAC, and lifecycle from the core GCM DaemonSets. Deploy to a cluster (observe-only): ``` # Create API key secret kubectl create secret generic gcm-sentinel-llm \ --namespace monitoring \ --from-literal=api-key=YOUR_KEY # Deploy — observe-only, single node helm install gcm-sentinel oci://ghcr.io/facebookresearch/charts/gcm-sentinel \ --namespace monitoring \ --set llm.existingSecret=gcm-sentinel-llm \ --set sentinel.nodeAllowlist="YOUR_GPU_NODE_NAME" # Verify kubectl logs -n monitoring -l app=gcm-sentinel | head -10 kubectl get events -A --field-selector reason=GCMSentinel --watch ``` Differential Revision: D96650734

meta-codesync · 2026-03-15T23:41:43Z

@luccabb has imported this pull request. If you are a Meta employee, you can view this in D96650734.

luccabb · 2026-03-16T05:11:50Z

@claude

claude · 2026-03-16T05:12:05Z

Claude finished @luccabb's task in 1m 50s —— View job

PR Review: AI Sentinel Agent

Read repo context and changed files
Analyze core logic (engine, watcher, datasources, backends)
Review security findings from automated scanners
Review Helm chart and RBAC
Provide comprehensive review feedback

This is a well-structured addition. The plugin-based datasource architecture, three-tier safety model (LLM schema → code guards → RBAC), and the observe-by-default posture are all solid design choices. Below are findings ranging from critical to minor.

Critical: Race condition on `_allowed_node` in execute mode

File: gcm-sentinel/src/gcm_sentinel/datasources/remediation.py:29 and gcm-sentinel/src/gcm_sentinel/engine.py:92-94

RemediationDataSource._allowed_node is instance state shared across concurrent investigations. If two node conditions fire simultaneously:

Investigation A calls ds.set_allowed_node("node-A")
Investigation B calls ds.set_allowed_node("node-B") before A completes
Investigation A now has _allowed_node == "node-B" — the guard at line 75 of remediation.py would allow A to act on node-B

This is a TOCTOU race in execute mode. The fix is to pass allowed_node as a parameter to execute() rather than storing it as mutable state, or create a per-investigation RemediationDataSource instance.

Fix this →

High: Shared registry, concurrent `set_allowed_node` not isolated

Directly related to the above — engine.py:91-94 mutates the shared self._registry in NodeConditionWatcher._handle_event, which is called from concurrent asyncio tasks (see watcher.py:108). Even without the race on _allowed_node, the design of mutating shared datasource state per-investigation is inherently unsafe for concurrency. Consider making the registry stateless or scoping it per-investigation.

High: Automated scanner finding — `kubernetes.py:30` annotation filter

The GitHub Advanced Security finding flags node.kubernetes.io/ at line 30:

"annotations": {
    k: v for k, v in (node.metadata.annotations or {}).items()
    if k.startswith("gcm-sentinel") or k.startswith("node.kubernetes.io/")
},

This is likely a false positive — startswith always checks from position 0. However, note that this leaks all annotations prefixed with node.kubernetes.io/ into the LLM context. If a future annotation contains sensitive data under that prefix (e.g., node topology secrets), it would be sent to the external LLM API. Consider making the annotation allowlist more restrictive or explicit.

Medium: No container security context in Helm chart

File: charts/gcm-sentinel/templates/deployment.yaml

The deployment has no securityContext. For a long-running cluster agent, best practice is:

securityContext:
  runAsNonRoot: true
  runAsUser: 65534
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

Fix this →

Medium: No liveness/readiness probes in deployment

File: charts/gcm-sentinel/templates/deployment.yaml

The container is a long-running watcher with no health checks. If the watch loop hangs silently (e.g., on a deadlocked investigation), Kubernetes has no way to detect and restart it. Consider adding a liveness probe based on a heartbeat file or HTTP endpoint.

Medium: Unpinned GitHub Actions tag (scanner finding)

File: .github/workflows/release.yml:350

docker/build-push-action@v6 should be pinned to a commit SHA for supply chain security. This is a legitimate concern:

# Instead of:
uses: docker/build-push-action@v6
# Use:
uses: docker/build-push-action@v6.x.x  # pinned SHA preferred

Note: You mentioned you cannot modify .github/workflows/ files — this is flagged for awareness.

Medium: `drain_node` is fire-and-forget

File: gcm-sentinel/src/gcm_sentinel/tools/kubernetes.py:222-245

The drain implementation evicts pods but doesn't wait for them to terminate. In execute mode, the LLM receives "Drained node X: evicted N pods" immediately, but pods may still be running. This could cause confusion in the investigation log and mislead the LLM into thinking the drain completed successfully.

Medium: `emit_k8s_event` hardcodes `default` namespace

File: gcm-sentinel/src/gcm_sentinel/notify.py:36

get_core_api().create_namespaced_event("default", ...)

Events for node issues are written to the default namespace, but the sentinel runs in monitoring. Most cluster operators look at events in the same namespace as the relevant workload. Consider making this configurable or using the sentinel's own namespace.

Low: Duplicate docstring line in `remediation.py`

File: gcm-sentinel/src/gcm_sentinel/datasources/remediation.py:9-10

Lines 9 and 10 are identical:

- Node name validation (can only act on the node being investigated)
- Node name validation (can only act on the node being investigated)

Low: Empty API key starts silently, fails on first use

File: gcm-sentinel/src/gcm_sentinel/config.py:39

api_key defaults to "". The agent starts successfully, logs its banner, and only fails when the first investigation fires an API call. Consider adding a startup validation that warns (or fails) if api_key is empty and llm_api is not a local endpoint.

Low: Lazy `_registry` initialization not thread-safe

File: gcm-sentinel/src/gcm_sentinel/watcher.py:76-79

if self._registry is None:
    self._registry = build_datasources(self.cfg)

This check-then-set is not atomic in asyncio. If two events arrive before the first investigation completes initialization, build_datasources could be called twice. While the result is likely harmless (second assignment overwrites the first), it's cleaner to initialize in __init__ or watch_loop before the first event.

Positive Observations

Three-layer safety model (LLM schema, code guards, RBAC) is excellent design. The fact that remediation tools are literally absent from the LLM's tool schema in non-execute modes is far more robust than prompt-level instructions.
_parse_assessment uses a bracket-scanning fallback rather than naive json.loads — handles LLM markdown fences cleanly.
Rate-limiter using a sliding deque window is correct and efficient.
fnmatch allowlist for node names supports glob patterns like gpu-node-0* — a nice usability touch.
Test coverage of the critical safety paths (mode gating, rate limiter, cooldown, allowlist) is good.
Helm chart RBAC is correctly conditioned on actionMode — the ClusterRole grants patch and eviction only when needed.

ai triage

ddfd9af

Differential Revision: D96650734

luccabb requested review from calebho, giongto35 and jj10306 as code owners March 15, 2026 23:13

meta-cla Bot added the cla signed label Mar 15, 2026

meta-codesync Bot added fb-exported meta-exported labels Mar 15, 2026

github-advanced-security AI found potential problems Mar 15, 2026

View reviewed changes

luccabb changed the title ~~ai triage~~ ai sentinel Mar 15, 2026

luccabb changed the title ~~ai sentinel~~ AI sentinel Mar 15, 2026

luccabb changed the title ~~AI sentinel~~ AI sentinel helm Mar 15, 2026

meta-codesync Bot changed the title ~~AI sentinel helm~~ ai triage Mar 15, 2026

luccabb force-pushed the export-D96650734 branch from ddfd9af to 521ff6b Compare March 15, 2026 23:29

luccabb changed the title ~~ai triage~~ AI sentinel agent Mar 15, 2026

luccabb added 2 commits March 15, 2026 16:36

fix docusaurus

066d6b9

fix docusaurus

4109471

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI sentinel agent#98

AI sentinel agent#98
luccabb wants to merge 4 commits intomainfrom
export-D96650734

luccabb commented Mar 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 15, 2026

Uh oh!

meta-codesync Bot commented Mar 15, 2026

Uh oh!

Check warning

Check failure

meta-codesync Bot commented Mar 15, 2026

Uh oh!

luccabb commented Mar 16, 2026

Uh oh!

claude Bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

luccabb commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 15, 2026

CI Commands

Uh oh!

meta-codesync Bot commented Mar 15, 2026

Uh oh!

Check warning

Uh oh!

Check failure

Uh oh!

meta-codesync Bot commented Mar 15, 2026

Uh oh!

luccabb commented Mar 16, 2026

Uh oh!

claude Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: AI Sentinel Agent

Critical: Race condition on _allowed_node in execute mode

High: Shared registry, concurrent set_allowed_node not isolated

High: Automated scanner finding — kubernetes.py:30 annotation filter

Medium: No container security context in Helm chart

Medium: No liveness/readiness probes in deployment

Medium: Unpinned GitHub Actions tag (scanner finding)

Medium: drain_node is fire-and-forget

Medium: emit_k8s_event hardcodes default namespace

Low: Duplicate docstring line in remediation.py

Low: Empty API key starts silently, fails on first use

Low: Lazy _registry initialization not thread-safe

Positive Observations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luccabb commented Mar 15, 2026 •

edited

Loading

claude Bot commented Mar 16, 2026 •

edited

Loading

Critical: Race condition on `_allowed_node` in execute mode

High: Shared registry, concurrent `set_allowed_node` not isolated

High: Automated scanner finding — `kubernetes.py:30` annotation filter

Medium: `drain_node` is fire-and-forget

Medium: `emit_k8s_event` hardcodes `default` namespace

Low: Duplicate docstring line in `remediation.py`

Low: Lazy `_registry` initialization not thread-safe