Analysis fails to persist when LLM omits proposal.estimatedImpact — schema/CRD contract drift

## Summary

A successful analysis run can fail to persist because the JSON schema sent to the agent for structured output does not require `proposal.estimatedImpact`, but the `AnalysisResult` CRD does. When the LLM omits the field (non-deterministic), the operator's status patch is rejected by CRD validation, the analysis is marked `Failed`, and the sandbox pod is orphaned (never torn down).

## Impact

- Valid analyses are lost — `AnalysisResult` persists with **0 options** despite the agent producing a correct option.
- The proposal is marked `Analyzed=False (Failed)` even though the LLM call succeeded.
- The analysis sandbox pod stays `Running` indefinitely (observed 3h+), since the reconciler never reaches completion.
- Non-deterministic: runs only fail when the LLM happens to omit `estimatedImpact`, so it passes intermittently and is easy to miss.

## Root cause

Commit `710771d` ("Address everettraven PR review") changed `ProposalResult.EstimatedImpact` in the CRD from `+optional` to **`+required`** (`MinLength=1`), but the corresponding LLM output schema in `controller/proposal/schemas.go` was not updated. The `proposal.required` array is still:

```json
["description", "actions", "risk", "reversible"]
```

`estimatedImpact` is missing, so the LLM is free to omit it. (Note the lists have also drifted: the schema requires `reversible`, which the CRD treats as `+optional`.)

## Observed failure (live cluster)

```
Proposal validate-acm-policy: Analyzed=False (Failed)
  create analysis result: patch AnalysisResult validate-acm-policy-analysis-1 status:
  AnalysisResult.agentic.openshift.io "validate-acm-policy-analysis-1" is invalid:
  [status.options[0].proposal.estimatedImpact: Required value, <nil>: Invalid value: null:
   some validation rules were not checked because the object was invalid]

AnalysisResult validate-acm-policy-analysis-1: options=0, Completed=True (Failed)
Pod ls-analysis-validate-acm-policy: 1/1 Running (3h17m, orphaned)
```

Agent log confirms the LLM call itself succeeded (`success=True, cost=$0.3242`) and returned a well-formed option — only the persist step failed.

Reproduces deterministically in **Default** analysis mode (`spec.analysisOutput` unset), which forces a full `proposal` object.

## Related latent drift (same class, currently dormant)

The `verification` branch of the schema has the same CRD-vs-schema gap and will fail identically once exercised:

- CRD `VerificationStep` requires `name` + `type`; schema `verification.steps[]` items have **no** `required` list.
- CRD `VerificationPlan` requires `description`; schema `verification` object has **no** `required` list.

These don't fire today only because the affected proposals have no verification step.

## Proposed fix

The schema sent to the LLM and the `AnalysisResult` CRD currently disagree on whether `proposal.estimatedImpact` is required. Either side can be moved to make them consistent — two viable approaches:

### Option A — Tighten the LLM schema (make it required everywhere)

In `controller/proposal/schemas.go`, add `estimatedImpact` to `proposal.required` so the LLM is contractually obligated to produce it, matching the CRD.

- **Pros:** Keeps `estimatedImpact` as a guaranteed, always-present signal for approval decisions (alongside `risk` and `reversible`); no CRD change, so no migration/regeneration concern for existing stored objects; honors the intent of `710771d`.
- **Cons:** Forces the field even for analyses where impact is genuinely not meaningful (e.g. pure validation/advisory proposals like `validate-acm-policy`), where the LLM must emit filler such as "No remediation; validation only."

### Option B — Relax the CRD (make it optional again)

Revert `ProposalResult.EstimatedImpact` to `+optional` (drop `MinLength=1`), reverting that part of `710771d`.

- **Pros:** Matches the reality that not every proposal has a meaningful remediation impact (validation/advisory flows); `omitempty` already handled the "unknown impact" case the original comment described.
- **Cons:** `estimatedImpact` becomes a best-effort field consumers can't rely on for approval UX; requires CRD regeneration (`make manifests`) and a redeploy.

### Also fix the dormant drift (independent of A/B)

Regardless of which approach is chosen for `estimatedImpact`, align the `verification` branch in `schemas.go` with the CRD to prevent the next identical incident:

- Add `required: ["name", "type"]` to `verification.steps[]` items.
- Add `required: ["description"]` to the `verification` object.

Consider a guard test asserting every CRD-required field appears in the corresponding schema `required` list, to catch this class of drift automatically.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis fails to persist when LLM omits proposal.estimatedImpact — schema/CRD contract drift #162

Summary

Impact

Root cause

Observed failure (live cluster)

Related latent drift (same class, currently dormant)

Proposed fix

Option A — Tighten the LLM schema (make it required everywhere)

Option B — Relax the CRD (make it optional again)

Also fix the dormant drift (independent of A/B)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Analysis fails to persist when LLM omits proposal.estimatedImpact — schema/CRD contract drift #162

Description

Summary

Impact

Root cause

Observed failure (live cluster)

Related latent drift (same class, currently dormant)

Proposed fix

Option A — Tighten the LLM schema (make it required everywhere)

Option B — Relax the CRD (make it optional again)

Also fix the dormant drift (independent of A/B)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions