Summary
A successful analysis run can fail to persist because the JSON schema sent to the agent for structured output does not require proposal.estimatedImpact, but the AnalysisResult CRD does. When the LLM omits the field (non-deterministic), the operator's status patch is rejected by CRD validation, the analysis is marked Failed, and the sandbox pod is orphaned (never torn down).
Impact
- Valid analyses are lost —
AnalysisResult persists with 0 options despite the agent producing a correct option.
- The proposal is marked
Analyzed=False (Failed) even though the LLM call succeeded.
- The analysis sandbox pod stays
Running indefinitely (observed 3h+), since the reconciler never reaches completion.
- Non-deterministic: runs only fail when the LLM happens to omit
estimatedImpact, so it passes intermittently and is easy to miss.
Root cause
Commit 710771d ("Address everettraven PR review") changed ProposalResult.EstimatedImpact in the CRD from +optional to +required (MinLength=1), but the corresponding LLM output schema in controller/proposal/schemas.go was not updated. The proposal.required array is still:
["description", "actions", "risk", "reversible"]
estimatedImpact is missing, so the LLM is free to omit it. (Note the lists have also drifted: the schema requires reversible, which the CRD treats as +optional.)
Observed failure (live cluster)
Proposal validate-acm-policy: Analyzed=False (Failed)
create analysis result: patch AnalysisResult validate-acm-policy-analysis-1 status:
AnalysisResult.agentic.openshift.io "validate-acm-policy-analysis-1" is invalid:
[status.options[0].proposal.estimatedImpact: Required value, <nil>: Invalid value: null:
some validation rules were not checked because the object was invalid]
AnalysisResult validate-acm-policy-analysis-1: options=0, Completed=True (Failed)
Pod ls-analysis-validate-acm-policy: 1/1 Running (3h17m, orphaned)
Agent log confirms the LLM call itself succeeded (success=True, cost=$0.3242) and returned a well-formed option — only the persist step failed.
Reproduces deterministically in Default analysis mode (spec.analysisOutput unset), which forces a full proposal object.
Related latent drift (same class, currently dormant)
The verification branch of the schema has the same CRD-vs-schema gap and will fail identically once exercised:
- CRD
VerificationStep requires name + type; schema verification.steps[] items have no required list.
- CRD
VerificationPlan requires description; schema verification object has no required list.
These don't fire today only because the affected proposals have no verification step.
Proposed fix
The schema sent to the LLM and the AnalysisResult CRD currently disagree on whether proposal.estimatedImpact is required. Either side can be moved to make them consistent — two viable approaches:
Option A — Tighten the LLM schema (make it required everywhere)
In controller/proposal/schemas.go, add estimatedImpact to proposal.required so the LLM is contractually obligated to produce it, matching the CRD.
- Pros: Keeps
estimatedImpact as a guaranteed, always-present signal for approval decisions (alongside risk and reversible); no CRD change, so no migration/regeneration concern for existing stored objects; honors the intent of 710771d.
- Cons: Forces the field even for analyses where impact is genuinely not meaningful (e.g. pure validation/advisory proposals like
validate-acm-policy), where the LLM must emit filler such as "No remediation; validation only."
Option B — Relax the CRD (make it optional again)
Revert ProposalResult.EstimatedImpact to +optional (drop MinLength=1), reverting that part of 710771d.
- Pros: Matches the reality that not every proposal has a meaningful remediation impact (validation/advisory flows);
omitempty already handled the "unknown impact" case the original comment described.
- Cons:
estimatedImpact becomes a best-effort field consumers can't rely on for approval UX; requires CRD regeneration (make manifests) and a redeploy.
Also fix the dormant drift (independent of A/B)
Regardless of which approach is chosen for estimatedImpact, align the verification branch in schemas.go with the CRD to prevent the next identical incident:
- Add
required: ["name", "type"] to verification.steps[] items.
- Add
required: ["description"] to the verification object.
Consider a guard test asserting every CRD-required field appears in the corresponding schema required list, to catch this class of drift automatically.
Summary
A successful analysis run can fail to persist because the JSON schema sent to the agent for structured output does not require
proposal.estimatedImpact, but theAnalysisResultCRD does. When the LLM omits the field (non-deterministic), the operator's status patch is rejected by CRD validation, the analysis is markedFailed, and the sandbox pod is orphaned (never torn down).Impact
AnalysisResultpersists with 0 options despite the agent producing a correct option.Analyzed=False (Failed)even though the LLM call succeeded.Runningindefinitely (observed 3h+), since the reconciler never reaches completion.estimatedImpact, so it passes intermittently and is easy to miss.Root cause
Commit
710771d("Address everettraven PR review") changedProposalResult.EstimatedImpactin the CRD from+optionalto+required(MinLength=1), but the corresponding LLM output schema incontroller/proposal/schemas.gowas not updated. Theproposal.requiredarray is still:estimatedImpactis missing, so the LLM is free to omit it. (Note the lists have also drifted: the schema requiresreversible, which the CRD treats as+optional.)Observed failure (live cluster)
Agent log confirms the LLM call itself succeeded (
success=True, cost=$0.3242) and returned a well-formed option — only the persist step failed.Reproduces deterministically in Default analysis mode (
spec.analysisOutputunset), which forces a fullproposalobject.Related latent drift (same class, currently dormant)
The
verificationbranch of the schema has the same CRD-vs-schema gap and will fail identically once exercised:VerificationSteprequiresname+type; schemaverification.steps[]items have norequiredlist.VerificationPlanrequiresdescription; schemaverificationobject has norequiredlist.These don't fire today only because the affected proposals have no verification step.
Proposed fix
The schema sent to the LLM and the
AnalysisResultCRD currently disagree on whetherproposal.estimatedImpactis required. Either side can be moved to make them consistent — two viable approaches:Option A — Tighten the LLM schema (make it required everywhere)
In
controller/proposal/schemas.go, addestimatedImpacttoproposal.requiredso the LLM is contractually obligated to produce it, matching the CRD.estimatedImpactas a guaranteed, always-present signal for approval decisions (alongsideriskandreversible); no CRD change, so no migration/regeneration concern for existing stored objects; honors the intent of710771d.validate-acm-policy), where the LLM must emit filler such as "No remediation; validation only."Option B — Relax the CRD (make it optional again)
Revert
ProposalResult.EstimatedImpactto+optional(dropMinLength=1), reverting that part of710771d.omitemptyalready handled the "unknown impact" case the original comment described.estimatedImpactbecomes a best-effort field consumers can't rely on for approval UX; requires CRD regeneration (make manifests) and a redeploy.Also fix the dormant drift (independent of A/B)
Regardless of which approach is chosen for
estimatedImpact, align theverificationbranch inschemas.gowith the CRD to prevent the next identical incident:required: ["name", "type"]toverification.steps[]items.required: ["description"]to theverificationobject.Consider a guard test asserting every CRD-required field appears in the corresponding schema
requiredlist, to catch this class of drift automatically.