[fm] stash sitrep analysis reports in the db for retrieval by omdb#10556
[fm] stash sitrep analysis reports in the db for retrieval by omdb#10556hawkw wants to merge 11 commits into
Conversation
these look bad with the new omdb command
6c2088b to
0ddba26
Compare
| -- | ||
| -- These are for diagnostic purposes only (i.e. for use by omdb, inclusion in | ||
| -- support bundles, etc), and are not intended as operational data. | ||
| CREATE TABLE IF NOT EXISTS omicron.public.fm_sitrep_analysis_report ( |
There was a problem hiding this comment.
Organizational question: This seems like it'll have a lifetime which has parity with fm_sitrep, basically as an exact match. What's the motivation to store this in a separate table?
I guess I can understand the argument of "we don't want to read it on the loading path, and we want to keep the metadata logically separate"? But if we opt for this separate table approach, it might be worth documenting that "This thing should basically share the lifetime of the sitrep itself".
(Why I bring this up: I always wonder about the GC of rows like this, and try to monitor for "can this leak". I think we're safe here, but understanding the lifetime is an important piece of that puzzle)
There was a problem hiding this comment.
FOLLOW-UP: I see fm_sitrep_insert leaves this as an optional argument. I think in prod, we'll be reliably inserting it, but I can see the argument for "technically it is optional and separate".
There was a problem hiding this comment.
FOLLOW-UP: I see
fm_sitrep_insertleaves this as an optional argument. I think in prod, we'll be reliably inserting it, but I can see the argument for "technically it is optional and separate".
We will generally insert it as long as serializing it doesn't fail --- which we currently record a warning about. I didn't want to block the correct functioning of the FM subsystem on a failure to serialize the report (which shouldn't happen)
But, yeah, the rationale here was that I wanted to make it a separate table mostly as a way of making it "feel" distinct from the operational data in the sitrep that the FM system actually reads; these are supposed to be optional and separate.
They are currently tied to the lifecycle of the sitrep, but they don't have to stay that way. I can kind of imagine them getting pretty big and wanting to be able to delete them more aggressively than the sitreps themselves. At the same time, I can also maybe imagine us wanting to be able to "pin" a given report so that it doesn't get automatically deleted if we're in the middle of actively debugging something. Although maybe that's better served by just exporting it to a file...
| analysis_report, | ||
| } = report; | ||
|
|
||
| let our_git_commit = env!("VERGEN_GIT_SHA"); |
There was a problem hiding this comment.
Note that the Nexus side of things also checks VERGEN_GIT_DIRTY.
I know that if that's set, all bets are basically off, but it might be worth checking here, too? There are some weird equality rules at play, where:
| Nexus Sha = X | Nexus Sha = X-Dirty | |
|---|---|---|
| Omdb Sha = X | reports match | don't necessarily match |
| Omdb Sha = X-Dirty | don't necessarily match | don't necessarily match |
There was a problem hiding this comment.
Yup, I'll go fix this --- though it's worth noting that the omdb db blueprint planner-report show command has the same behavior, where the actual planning reports in the database may include -dirty but OMDB never checks for it. Which I blindly copied from without thinking too hard about it. 😅
I'll fix this, but maybe I should fix that too, while I'm at it. It might be nice to have a function someplace for "what's the Git sha string to use for this" so that it's always consistent. I might do that in a separate PR and then stack this on top of that?
There was a problem hiding this comment.
Oh, actually, upon thinking through this, the current behavior basically implements this, just in a non-obvious way. omdb will never add the -dirty suffix, so if the Nexus commit was dirty, it will never be equal to omdb's our_git_commit, even if the omdb build was also from a dirty repo. However, this behavior is a bit un-obvious, so I probably will replace it with an explicit PartialEq impl.
There was a problem hiding this comment.
What do you think of #10578? It may be a little bit overengineered but I'm pleased with it...
There was a problem hiding this comment.
My big issue is within #10578 (comment) - I think the following could happen:
- start on commit abc123 with a clean git tree. cargo build runs the build script, git sees that things are clean, so
VERGEN_GIT_DIRTY=falseis baked into the binary. - We edit something in
nexusand rebuild. New API, sitrep analysis, etc.omicron-nexusis recompiled, butomicron-git-versionis untouched..git/HEADis also untouched. the build script never re-runs. - The newly compiled Nexus binary contains all the modified code, but it still contains
VERGEN_GIT_DIRTY=false, even though it's clearly actually dirty.
(The same issue applies in reverse too - if we build starting from a dirty git repo, and go back to clean, we will keep seeing the -dirty suffix until we force .git/HEAD to change some other way, or do a rebuild from scratch somehow, like cargo clean).
thanks @smklein! Co-authored-by: Sean Klein <seanmarionklein@gmail.com>
| Ok(dsl::fm_sitrep_analysis_report | ||
| .filter(dsl::sitrep_id.eq(sitrep.into_untyped_uuid())) | ||
| .select(model::fm::SitrepAnalysisReport::as_select()) | ||
| .first_async(&*datastore.pool_connection_for_tests().await?) |
There was a problem hiding this comment.
This probably doesn't matter at all, but if we don't have an analysis report for the sitrep, we'll return NotFound here. This could technically happen if SitrepAnalysisReport::new fails, or if there were somehow a preexisting sitrep in the db from before this PR merges.
Would it make sense for this to return Result<Option<...>>? I'm guessing probably not, we do expect an analysis report in the happy path...
There was a problem hiding this comment.
hmm, i think returning the error and printing "not found" in the error message is fine here. I thought about making it return an option, but I felt like in this case, all the error does is mean that we will print the error message out and exit, which is basically what we'd do if we got a None, too. And I think returning a non-zero exit code is basically fine here: you asked for the analysis report for a particular sitrep, and we couldn't find one; it's fine for omdb to treat that as an "error", IMO. 🤷♀️
There was a problem hiding this comment.
we might want to check if the sitrep exists at all, though, so that we can print a different message if "you asked for a sitrep that doesn't exist" versus "you asked for the analysis report for a sitrep that exists but has no analysis report"...
| .internal_context( | ||
| "failed to insert sitrep analysis report", | ||
| ) | ||
| })?; |
There was a problem hiding this comment.
If this fails for whatever reason, we'll bail out here rather than continuing. In analyze you commented that saving the analysis report should be "purely diagnostic" but this breaks that. I'm guessing you'll probably want to just log the error here instead of propagating it out.
There was a problem hiding this comment.
yeah, good catch --- will fix.
| fn print_analysis_reports( | ||
| report: &model::fm::SitrepAnalysisReport, | ||
| json: bool, | ||
| ) -> anyhow::Result<()> { | ||
| use nexus_types::fm::analysis_reports::{AnalysisReport, InputReport}; | ||
|
|
||
| let model::fm::SitrepAnalysisReport { | ||
| sitrep_id: _, | ||
| git_commit, | ||
| input_report, | ||
| analysis_report, | ||
| } = report; | ||
|
|
||
| let our_git_commit = env!("VERGEN_GIT_SHA"); | ||
| if our_git_commit != git_commit { | ||
| eprintln!( | ||
| "note: these sitrep analysis reports were produced by a Nexus \ | ||
| on git commit {git_commit}. this omdb was built from \ | ||
| {our_git_commit}." | ||
| ); | ||
| } |
There was a problem hiding this comment.
Nit: "reports" or "report"?
| })?; | ||
| } | ||
|
|
||
| // Store the analysis reports describing how this sitrep was produced, |
There was a problem hiding this comment.
Nit: as above in sitrep.rs, are we dealing with multiple reports here or just one?
As described in #10505, the
fm_analysisbackground task that produces a sitrep also emits two human-readable report types that contain debugging information describing the analysis step that produced the sitrep, and the inputs to that analysis step. Currently, these are included in the status object for thefm_analysistask, meaning that they can be viewed byomdb nexus background-tasks status. However, once a new analysis step has begun, they are no longer accessible. And, even if no new analysis has been performed yet, accessing them throughomdb nexus background-tasks statusrequires tracking down the Nexus that actually produced a given sitrep.This branch extends this system to also store these reports in a CockroachDB table, so that they may be retrieved and displayed by
omdb. Now, they are accessible as long as the sitrep that they describe remains in the database. These entries are children of the sitrep, so that they can be GCed when the sitrep is deleted. In addition, this branch adds anomdb db sitrep analysis-reportssubcommand for retrieving and displaying these reports. We attempt to use the typed representation fromnexus_types::fmwhen possible, but fall back to printing whatever JSON is stored when the value cannot be decoded as the typed representation (i.e. because it changed in an incompatible way). This way, we can always display historical reports from older Nexus versions, but we don't have to worry too much about versioning this debug-only data. We also include the Git SHA of the Nexus build that produced a report so that we can include it in the omdb output, which is especially helpful when the JSON can't deserialize as the current typed representation.Closes #10505