Skip to content

[fm] Add disk diagnoser with typed fact tables#10541

Open
smklein wants to merge 8 commits into
mainfrom
fm-disk-diagnoser-typed
Open

[fm] Add disk diagnoser with typed fact tables#10541
smklein wants to merge 8 commits into
mainfrom
fm-disk-diagnoser-typed

Conversation

@smklein

@smklein smklein commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

This PR provides an alternative to #10460 , using strongly-typed "enum-ish" tables for each DE's set of facts, rather than a JSON blob. It should be mutually exclusive with that PR (we either do JSON, or we don't).

The first fault management diagnosis engine: opens a case for any non-Online zpool whose backing physical disk is currently in service in the control plane, and closes it on recovery or expungement.

Supporting infrastructure introduced along the way:

  • DiagnosisEngineKind::Disk variant (Rust + DB enum)
  • fm_case_fact child table for per-engine state (one case has 0..N mmutable facts; stable UUIDs across sitreps; participates in copy-forward + GC like other sitrep child tables)
  • CaseBuilder::{add_fact, remove_fact, facts} API
  • InServiceDisk nexus-types projection consumed by FM, populated from the existing zpool_list_all_external_batched datastore method with policy filtering done in the background task

)
.execute_async(&conn)
.await?;
// Delete every child row by sitrep_id. Driving this off

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little involved, but it's worth noting: This is only used in tests? So doing some one-time complexity to avoid a need for future updates on each new fact type seems like a nice tradeoff.

smklein added 7 commits June 3, 2026 11:51
…lerate duplicate zpools

- The per-kind CHECK constraint on fm_fact_physical_disk is now an
  implication (kind != 'zpool_unhealthy' OR columns present), so future
  fact kinds add their own constraint instead of rewriting this one
- The fm_analysis in-service disk projection skips soft-deleted
  physical_disk rows, matching what its comment already claimed
- A duplicate physical disk in the zpool listing now logs a warning and
  keeps the first zpool instead of panicking the background task
SitrepBuilder::cases is seeded from the open cases of the Input the
builder was constructed from, but diagnosis::analyze() previously took
the Input as a separate argument; nothing tied the two together, and an
engine handed a mismatched pair would panic looking up a case that was
never seeded. Store the Input on the builder and have engines read it
from there, making the mismatch unrepresentable.
Facts are only read during sitrep load, which paginates over the
(sitrep_id, id) primary key; nothing queries by case directly. Easy to
re-add with a migration if omdb grows a case-scoped fact query.
Replace the per-disk linear scan of parent cases with a one-time
inverse index. First case ID wins on (pathological) collisions, same
as the scan it replaces.
@smklein smklein marked this pull request as ready for review June 10, 2026 21:12
Comment thread nexus/fm/src/builder.rs
Comment on lines +27 to +31
/// The analysis input this builder was constructed from. `cases` is
/// seeded from this input's open cases, so diagnosis engines must read
/// their inputs from here rather than taking a separate `Input` argument
/// that might disagree.
input: &'a analysis_input::Input,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So: I'm shoving this in here because we previously called analyze with SitrepBuilder and input as two distinct arguments.

Realistically, they would come from the same input, but I want to prevent them from ever diverging, so... now they're both part of SitrepBuilder.

(This is a very tiny change but it does change the API of analyze)

Comment thread schema/crdb/dbinit.sql
Comment on lines +7734 to +7735
-- UUID of the case this fact attaches to.
case_id UUID NOT NULL,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally I had an index for "lookup fact by case ID" but I actually removed it; it's not necessary with how we do sitrep loading currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant