Skip to content

[alerts] add explicit versions to alert payloads#10555

Merged
hawkw merged 6 commits into
mainfrom
eliza/alert-versions
Jun 10, 2026
Merged

[alerts] add explicit versions to alert payloads#10555
hawkw merged 6 commits into
mainfrom
eliza/alert-versions

Conversation

@hawkw

@hawkw hawkw commented Jun 5, 2026

Copy link
Copy Markdown
Member

This branch adds an explicit notion of version numbers for alert payloads. This is implemented by changing the entry points for publishing alerts (either via Nexus::alert_publish or SitrepBuilder::request_alert for FM) to take a new AlertPayload trait implemented by types which describe alert payloads. These must provide an alert class and version constant, in addition to implementing serde::Serialize and JsonSchema. Eventually we will also use these to generate the schemas for each version of the alert so that they can be communicated to potential consumers. For now, though, we are at least enforcing that a given schema always provides the same alert class and version number (rather than being able to pass in any arbitrary class).

Now, when an alert is created, the schema version is recorded in the alert record. We backfill any existing alert records as being version 0. This doesn't actually matter because, you know, there aren't any, but we may as well do the right thing there. When an alert is delivered to a webhook receiver, the version number is included in a JSON field and a new x-oxide-alert-version header.

@hawkw hawkw added this to the 21 milestone Jun 5, 2026
@hawkw hawkw requested review from mergeconflict and smklein June 5, 2026 21:16
@hawkw hawkw added nexus Related to nexus fault-management Everything related to the fault-management initiative (RFD480 and others) labels Jun 5, 2026
Comment thread schema/crdb/dbinit.sql Outdated
comment TEXT NOT NULL,

-- The version of the alert class' schema that this alert's payload conforms
-- to.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment indicating that this is zero-indexed?

I know this is sorta implied by the type, but convention sometimes implies starting with 1, so it's worth being explicit

Comment thread nexus/fm/src/builder/case.rs Outdated
Comment thread nexus/src/app/webhook.rs Outdated
Comment on lines 195 to 199
const CLASS: AlertClass = AlertClass::Probe;
const VERSION: u32 =
<alert_types::Probe as alert_types::AlertPayload>::VERSION;
static DATA: LazyLock<serde_json::Value> =
LazyLock::new(|| serde_json::json!({}));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not matter much right now for probes, but to set a good precedent:

  • Should CLASS reference alert_types::Probe::CLASS ? Currently it points directly at the db-model enum variant.
  • Should DATA be referencing alert_types::Probe? These happen to both be empty JSON, but could theoretically get out-of-sync?

Comment thread nexus/types/src/alert.rs
use std::fmt;

/// Trait implemented by alerts.
pub trait AlertPayload: Serialize + JsonSchema + std::fmt::Debug {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of thoughts:

  • This trait requires JsonSchema, but AFAICT, nothing is consuming that right now? I understand that we want to use this in the future though.
  • From what I can tell, nothing actually validates that "the version should be bumped when the schema changes". If someone alters a payload struct and doesn't bump the version, nothing fails (or notifies them at all).

I know you touched on this a bit in the commit message, but:

  • How hard would it be to have a registry of all AlertPayload impls, with the schema of each JsonSchema saved somewhere keyed by (class, version)? Then changing a struct without bumping VERSION could cause a test failure, and make it clear what a programmer needs to change.

We don't technically need to block the PR behind this, but it seems like we should move VERSION to be guarded by something better than "the honor system" ASAP.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of what you're describing is the eventual goal. For now, my intent for this PR was just to have the base notion of having versions, so that we could add version 0 for some alerts. I wanted to get the bare minimum infrastructure for saying that alerts have versions at all before we added any alert classes. I definitely agree that we will need to be able to generate the schemas and emit them somewhere, and that we need tooling for actually validating that versions are correct. I want to be able to do that separately from writing a DE that emits v0 alerts, though.

@mergeconflict mergeconflict left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! i think @smklein already found all the fun stuff, so just one nit from me.

Comment thread nexus/types/src/alert.rs

impl AlertPayload for $Name {
const CLASS: AlertClass = AlertClass::$Class;
const VERSION: u32 = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i wonder if we could vary the version (maybe spit out arbitrary version numbers from proptest) so that if the production code happened to have an "oops, the version is always zero" bug, we'd catch it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think testing for this is part of the work on actually adding tooling and infrastructure for supporting multiple versions, which I'd like to defer for a later PR.

hawkw added a commit that referenced this pull request Jun 10, 2026
As per @smklein in
#10555 (comment)
this should be documented, and I figured we should have a constraint for
it too.
hawkw added a commit that referenced this pull request Jun 10, 2026
This addresses the suggestion from
#10555 (comment)
hawkw and others added 5 commits June 10, 2026 11:12
This branch adds an explicit notion of version numbers for alert
payloads. This is implemented by changing the entry points for
publishing alerts (either via `Nexus::alert_publish` or
`SitrepBuilder::request_alert` for FM) to take a new `AlertPayload`
trait implemented by types which describe alert payloads. These must
provide an alert class and version constant, in addition to implementing
`serde::Serialize` and `JsonSchema`. Eventually we will also use these
to generate the schemas for each version of the alert so that they can
be communicated to potential consumers. For now, though, we are at least
enforcing that a given schema always provides the same alert class and
version number (rather than being able to pass in any arbitrary class).

Now, when an alert is created, the schema version is recorded in the
alert record. We backfill any existing alert records as being version 0.
This doesn't actually matter because, you know, there aren't any, but we
may as well do the right thing there. When an alert is delivered to a
webhook receiver, the version number is included in a JSON field and a
new `x-oxide-alert-version` header.
Co-authored-by: Sean Klein <sean@oxide.computer>
As per @smklein in
#10555 (comment)
this should be documented, and I figured we should have a constraint for
it too.
This addresses the suggestion from
#10555 (comment)
@hawkw hawkw force-pushed the eliza/alert-versions branch from 153b061 to 2729e97 Compare June 10, 2026 18:12
@hawkw hawkw enabled auto-merge (squash) June 10, 2026 18:13
@hawkw hawkw merged commit 6b52a71 into main Jun 10, 2026
18 checks passed
@hawkw hawkw deleted the eliza/alert-versions branch June 10, 2026 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others) nexus Related to nexus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants