Skip to content

Analysis fails to persist when LLM omits proposal.estimatedImpact — schema/CRD contract drift #162

Description

@harche

Summary

A successful analysis run can fail to persist because the JSON schema sent to the agent for structured output does not require proposal.estimatedImpact, but the AnalysisResult CRD does. When the LLM omits the field (non-deterministic), the operator's status patch is rejected by CRD validation, the analysis is marked Failed, and the sandbox pod is orphaned (never torn down).

Impact

  • Valid analyses are lost — AnalysisResult persists with 0 options despite the agent producing a correct option.
  • The proposal is marked Analyzed=False (Failed) even though the LLM call succeeded.
  • The analysis sandbox pod stays Running indefinitely (observed 3h+), since the reconciler never reaches completion.
  • Non-deterministic: runs only fail when the LLM happens to omit estimatedImpact, so it passes intermittently and is easy to miss.

Root cause

Commit 710771d ("Address everettraven PR review") changed ProposalResult.EstimatedImpact in the CRD from +optional to +required (MinLength=1), but the corresponding LLM output schema in controller/proposal/schemas.go was not updated. The proposal.required array is still:

["description", "actions", "risk", "reversible"]

estimatedImpact is missing, so the LLM is free to omit it. (Note the lists have also drifted: the schema requires reversible, which the CRD treats as +optional.)

Observed failure (live cluster)

Proposal validate-acm-policy: Analyzed=False (Failed)
  create analysis result: patch AnalysisResult validate-acm-policy-analysis-1 status:
  AnalysisResult.agentic.openshift.io "validate-acm-policy-analysis-1" is invalid:
  [status.options[0].proposal.estimatedImpact: Required value, <nil>: Invalid value: null:
   some validation rules were not checked because the object was invalid]

AnalysisResult validate-acm-policy-analysis-1: options=0, Completed=True (Failed)
Pod ls-analysis-validate-acm-policy: 1/1 Running (3h17m, orphaned)

Agent log confirms the LLM call itself succeeded (success=True, cost=$0.3242) and returned a well-formed option — only the persist step failed.

Reproduces deterministically in Default analysis mode (spec.analysisOutput unset), which forces a full proposal object.

Related latent drift (same class, currently dormant)

The verification branch of the schema has the same CRD-vs-schema gap and will fail identically once exercised:

  • CRD VerificationStep requires name + type; schema verification.steps[] items have no required list.
  • CRD VerificationPlan requires description; schema verification object has no required list.

These don't fire today only because the affected proposals have no verification step.

Proposed fix

The schema sent to the LLM and the AnalysisResult CRD currently disagree on whether proposal.estimatedImpact is required. Either side can be moved to make them consistent — two viable approaches:

Option A — Tighten the LLM schema (make it required everywhere)

In controller/proposal/schemas.go, add estimatedImpact to proposal.required so the LLM is contractually obligated to produce it, matching the CRD.

  • Pros: Keeps estimatedImpact as a guaranteed, always-present signal for approval decisions (alongside risk and reversible); no CRD change, so no migration/regeneration concern for existing stored objects; honors the intent of 710771d.
  • Cons: Forces the field even for analyses where impact is genuinely not meaningful (e.g. pure validation/advisory proposals like validate-acm-policy), where the LLM must emit filler such as "No remediation; validation only."

Option B — Relax the CRD (make it optional again)

Revert ProposalResult.EstimatedImpact to +optional (drop MinLength=1), reverting that part of 710771d.

  • Pros: Matches the reality that not every proposal has a meaningful remediation impact (validation/advisory flows); omitempty already handled the "unknown impact" case the original comment described.
  • Cons: estimatedImpact becomes a best-effort field consumers can't rely on for approval UX; requires CRD regeneration (make manifests) and a redeploy.

Also fix the dormant drift (independent of A/B)

Regardless of which approach is chosen for estimatedImpact, align the verification branch in schemas.go with the CRD to prevent the next identical incident:

  • Add required: ["name", "type"] to verification.steps[] items.
  • Add required: ["description"] to the verification object.

Consider a guard test asserting every CRD-required field appears in the corresponding schema required list, to catch this class of drift automatically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions