Skip to content

[fm] stash sitrep analysis reports in the db for retrieval by omdb#10556

Open
hawkw wants to merge 11 commits into
mainfrom
eliza/sitrep-reports
Open

[fm] stash sitrep analysis reports in the db for retrieval by omdb#10556
hawkw wants to merge 11 commits into
mainfrom
eliza/sitrep-reports

Conversation

@hawkw

@hawkw hawkw commented Jun 5, 2026

Copy link
Copy Markdown
Member

As described in #10505, the fm_analysis background task that produces a sitrep also emits two human-readable report types that contain debugging information describing the analysis step that produced the sitrep, and the inputs to that analysis step. Currently, these are included in the status object for the fm_analysis task, meaning that they can be viewed by omdb nexus background-tasks status. However, once a new analysis step has begun, they are no longer accessible. And, even if no new analysis has been performed yet, accessing them through omdb nexus background-tasks status requires tracking down the Nexus that actually produced a given sitrep.

This branch extends this system to also store these reports in a CockroachDB table, so that they may be retrieved and displayed by omdb. Now, they are accessible as long as the sitrep that they describe remains in the database. These entries are children of the sitrep, so that they can be GCed when the sitrep is deleted. In addition, this branch adds an omdb db sitrep analysis-reports subcommand for retrieving and displaying these reports. We attempt to use the typed representation from nexus_types::fm when possible, but fall back to printing whatever JSON is stored when the value cannot be decoded as the typed representation (i.e. because it changed in an incompatible way). This way, we can always display historical reports from older Nexus versions, but we don't have to worry too much about versioning this debug-only data. We also include the Git SHA of the Nexus build that produced a report so that we can include it in the omdb output, which is especially helpful when the JSON can't deserialize as the current typed representation.

Closes #10505

@hawkw hawkw force-pushed the eliza/sitrep-reports branch from 6c2088b to 0ddba26 Compare June 7, 2026 16:55
@hawkw hawkw changed the title [wip][fm] stash sitrep analysis reports in the db for future retrieval [fm] stash sitrep analysis reports in the db for retrieval by omdb Jun 7, 2026
@hawkw hawkw marked this pull request as ready for review June 7, 2026 17:02
@hawkw hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Jun 7, 2026
@hawkw hawkw self-assigned this Jun 7, 2026
@hawkw hawkw added this to the 21 milestone Jun 7, 2026
@hawkw hawkw requested review from mergeconflict and smklein and removed request for mergeconflict June 7, 2026 17:02
Comment thread schema/crdb/dbinit.sql
--
-- These are for diagnostic purposes only (i.e. for use by omdb, inclusion in
-- support bundles, etc), and are not intended as operational data.
CREATE TABLE IF NOT EXISTS omicron.public.fm_sitrep_analysis_report (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Organizational question: This seems like it'll have a lifetime which has parity with fm_sitrep, basically as an exact match. What's the motivation to store this in a separate table?

I guess I can understand the argument of "we don't want to read it on the loading path, and we want to keep the metadata logically separate"? But if we opt for this separate table approach, it might be worth documenting that "This thing should basically share the lifetime of the sitrep itself".

(Why I bring this up: I always wonder about the GC of rows like this, and try to monitor for "can this leak". I think we're safe here, but understanding the lifetime is an important piece of that puzzle)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FOLLOW-UP: I see fm_sitrep_insert leaves this as an optional argument. I think in prod, we'll be reliably inserting it, but I can see the argument for "technically it is optional and separate".

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FOLLOW-UP: I see fm_sitrep_insert leaves this as an optional argument. I think in prod, we'll be reliably inserting it, but I can see the argument for "technically it is optional and separate".

We will generally insert it as long as serializing it doesn't fail --- which we currently record a warning about. I didn't want to block the correct functioning of the FM subsystem on a failure to serialize the report (which shouldn't happen)

But, yeah, the rationale here was that I wanted to make it a separate table mostly as a way of making it "feel" distinct from the operational data in the sitrep that the FM system actually reads; these are supposed to be optional and separate.

They are currently tied to the lifecycle of the sitrep, but they don't have to stay that way. I can kind of imagine them getting pretty big and wanting to be able to delete them more aggressively than the sitreps themselves. At the same time, I can also maybe imagine us wanting to be able to "pin" a given report so that it doesn't get automatically deleted if we're in the middle of actively debugging something. Although maybe that's better served by just exporting it to a file...

analysis_report,
} = report;

let our_git_commit = env!("VERGEN_GIT_SHA");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the Nexus side of things also checks VERGEN_GIT_DIRTY.

I know that if that's set, all bets are basically off, but it might be worth checking here, too? There are some weird equality rules at play, where:

Nexus Sha = X Nexus Sha = X-Dirty
Omdb Sha = X reports match don't necessarily match
Omdb Sha = X-Dirty don't necessarily match don't necessarily match

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I'll go fix this --- though it's worth noting that the omdb db blueprint planner-report show command has the same behavior, where the actual planning reports in the database may include -dirty but OMDB never checks for it. Which I blindly copied from without thinking too hard about it. 😅

I'll fix this, but maybe I should fix that too, while I'm at it. It might be nice to have a function someplace for "what's the Git sha string to use for this" so that it's always consistent. I might do that in a separate PR and then stack this on top of that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually, upon thinking through this, the current behavior basically implements this, just in a non-obvious way. omdb will never add the -dirty suffix, so if the Nexus commit was dirty, it will never be equal to omdb's our_git_commit, even if the omdb build was also from a dirty repo. However, this behavior is a bit un-obvious, so I probably will replace it with an explicit PartialEq impl.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of #10578? It may be a little bit overengineered but I'm pleased with it...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big issue is within #10578 (comment) - I think the following could happen:

  1. start on commit abc123 with a clean git tree. cargo build runs the build script, git sees that things are clean, so VERGEN_GIT_DIRTY=false is baked into the binary.
  2. We edit something in nexus and rebuild. New API, sitrep analysis, etc. omicron-nexus is recompiled, but omicron-git-version is untouched. .git/HEAD is also untouched. the build script never re-runs.
  3. The newly compiled Nexus binary contains all the modified code, but it still contains VERGEN_GIT_DIRTY=false, even though it's clearly actually dirty.

(The same issue applies in reverse too - if we build starting from a dirty git repo, and go back to clean, we will keep seeing the -dirty suffix until we force .git/HEAD to change some other way, or do a rebuild from scratch somehow, like cargo clean).

Comment thread dev-tools/omdb/src/bin/omdb/db/sitrep.rs Outdated
Comment thread nexus/src/app/background/tasks/fm_analysis.rs Outdated
Comment thread nexus/types/src/fm/analysis_reports.rs
thanks @smklein!

Co-authored-by: Sean Klein <seanmarionklein@gmail.com>
Ok(dsl::fm_sitrep_analysis_report
.filter(dsl::sitrep_id.eq(sitrep.into_untyped_uuid()))
.select(model::fm::SitrepAnalysisReport::as_select())
.first_async(&*datastore.pool_connection_for_tests().await?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably doesn't matter at all, but if we don't have an analysis report for the sitrep, we'll return NotFound here. This could technically happen if SitrepAnalysisReport::new fails, or if there were somehow a preexisting sitrep in the db from before this PR merges.

Would it make sense for this to return Result<Option<...>>? I'm guessing probably not, we do expect an analysis report in the happy path...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i think returning the error and printing "not found" in the error message is fine here. I thought about making it return an option, but I felt like in this case, all the error does is mean that we will print the error message out and exit, which is basically what we'd do if we got a None, too. And I think returning a non-zero exit code is basically fine here: you asked for the analysis report for a particular sitrep, and we couldn't find one; it's fine for omdb to treat that as an "error", IMO. 🤷‍♀️

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to check if the sitrep exists at all, though, so that we can print a different message if "you asked for a sitrep that doesn't exist" versus "you asked for the analysis report for a sitrep that exists but has no analysis report"...

.internal_context(
"failed to insert sitrep analysis report",
)
})?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails for whatever reason, we'll bail out here rather than continuing. In analyze you commented that saving the analysis report should be "purely diagnostic" but this breaks that. I'm guessing you'll probably want to just log the error here instead of propagating it out.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good catch --- will fix.

Comment on lines +435 to +455
fn print_analysis_reports(
report: &model::fm::SitrepAnalysisReport,
json: bool,
) -> anyhow::Result<()> {
use nexus_types::fm::analysis_reports::{AnalysisReport, InputReport};

let model::fm::SitrepAnalysisReport {
sitrep_id: _,
git_commit,
input_report,
analysis_report,
} = report;

let our_git_commit = env!("VERGEN_GIT_SHA");
if our_git_commit != git_commit {
eprintln!(
"note: these sitrep analysis reports were produced by a Nexus \
on git commit {git_commit}. this omdb was built from \
{our_git_commit}."
);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "reports" or "report"?

})?;
}

// Store the analysis reports describing how this sitrep was produced,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: as above in sitrep.rs, are we dealing with multiple reports here or just one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[fm] put sitrep analysis and input reports in the db when committing a sitrep

3 participants