[fm] Add disk diagnoser with typed fact tables by smklein · Pull Request #10541 · oxidecomputer/omicron

smklein · 2026-06-03T18:42:12Z

This PR provides an alternative to #10460 , using strongly-typed "enum-ish" tables for each DE's set of facts, rather than a JSON blob. It should be mutually exclusive with that PR (we either do JSON, or we don't).

The first fault management diagnosis engine: opens a case for any non-Online zpool whose backing physical disk is currently in service in the control plane, and closes it on recovery or expungement.

Supporting infrastructure introduced along the way:

DiagnosisEngineKind::Disk variant (Rust + DB enum)
fm_case_fact child table for per-engine state (one case has 0..N mmutable facts; stable UUIDs across sitreps; participates in copy-forward + GC like other sitrep child tables)
CaseBuilder::{add_fact, remove_fact, facts} API
InServiceDisk nexus-types projection consumed by FM, populated from the existing zpool_list_all_external_batched datastore method with policy filtering done in the background task

smklein · 2026-06-03T18:49:52Z

-                    )
-                    .execute_async(&conn)
-                    .await?;
+                    // Delete every child row by sitrep_id. Driving this off


This looks a little involved, but it's worth noting: This is only used in tests? So doing some one-time complexity to avoid a need for future updates on each new fact type seems like a nice tradeoff.

…lerate duplicate zpools - The per-kind CHECK constraint on fm_fact_physical_disk is now an implication (kind != 'zpool_unhealthy' OR columns present), so future fact kinds add their own constraint instead of rewriting this one - The fm_analysis in-service disk projection skips soft-deleted physical_disk rows, matching what its comment already claimed - A duplicate physical disk in the zpool listing now logs a warning and keeps the first zpool instead of panicking the background task

SitrepBuilder::cases is seeded from the open cases of the Input the builder was constructed from, but diagnosis::analyze() previously took the Input as a separate argument; nothing tied the two together, and an engine handed a mismatched pair would panic looking up a case that was never seeded. Store the Input on the builder and have engines read it from there, making the mismatch unrepresentable.

Facts are only read during sitrep load, which paginates over the (sitrep_id, id) primary key; nothing queries by case directly. Easy to re-add with a migration if omdb grows a case-scoped fact query.

Replace the per-disk linear scan of parent cases with a one-time inverse index. First case ID wins on (pathological) collisions, same as the scan it replaces.

smklein · 2026-06-10T21:14:32Z

+    /// The analysis input this builder was constructed from. `cases` is
+    /// seeded from this input's open cases, so diagnosis engines must read
+    /// their inputs from here rather than taking a separate `Input` argument
+    /// that might disagree.
+    input: &'a analysis_input::Input,


So: I'm shoving this in here because we previously called analyze with SitrepBuilder and input as two distinct arguments.

Realistically, they would come from the same input, but I want to prevent them from ever diverging, so... now they're both part of SitrepBuilder.

(This is a very tiny change but it does change the API of analyze)

smklein · 2026-06-10T21:15:53Z

+    -- UUID of the case this fact attaches to.
+    case_id UUID NOT NULL,


originally I had an index for "lookup fact by case ID" but I actually removed it; it's not necessary with how we do sitrep loading currently.

[fm] Add disk diagnoser with typed fact tables

6e094f4

smklein commented Jun 3, 2026

View reviewed changes

smklein added 7 commits June 3, 2026 11:51

better errors

bfedaa5

clearer constraints, a non-null column for physical_id

1b5106a

merge

ae1ec41

drop unused lookup_fm_fact_physical_disk_for_case index

14c2a06

Facts are only read during sitrep load, which paginates over the (sitrep_id, id) primary key; nothing queries by case directly. Easy to re-add with a migration if omdb grows a case-scoped fact query.

[fm] disk diagnoser: index parent cases by disk

d2a6767

Replace the per-disk linear scan of parent cases with a one-time inverse index. First case ID wins on (pathological) collisions, same as the scan it replaces.

smklein marked this pull request as ready for review June 10, 2026 21:12

smklein commented Jun 10, 2026

View reviewed changes

smklein mentioned this pull request Jun 11, 2026

[fm] Add saga diagnosis engine #10592

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fm] Add disk diagnoser with typed fact tables#10541

[fm] Add disk diagnoser with typed fact tables#10541
smklein wants to merge 8 commits into
mainfrom
fm-disk-diagnoser-typed

smklein commented Jun 3, 2026

Uh oh!

smklein Jun 3, 2026

Uh oh!

smklein Jun 10, 2026

Uh oh!

smklein Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		-- UUID of the case this fact attaches to.
		case_id UUID NOT NULL,

Conversation

smklein commented Jun 3, 2026

Uh oh!

smklein Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

smklein Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

smklein Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant