Skip to content

chore: add metric for missed interruptions events#9037

Open
ryan-mist wants to merge 1 commit intoaws:mainfrom
ryan-mist:missed-drain-metric
Open

chore: add metric for missed interruptions events#9037
ryan-mist wants to merge 1 commit intoaws:mainfrom
ryan-mist:missed-drain-metric

Conversation

@ryan-mist
Copy link
Copy Markdown
Contributor

@ryan-mist ryan-mist commented Mar 27, 2026

Fixes #N/A

Description

  • adds a new metric when karpenter_interruption_missed_termination_total, that is incremented when a interruptible instance (i.e. spot or from an IODCR) is terminated and Karpenter did not recieve an SQS interruption warning
    • this is done by adding a new annotation, karpenter.k8s.aws/instance-interrupted when Karpenter receives the interruption event

Note that we can't do this through NodeClaim status conditions because Drained as Unknown is the indicating status condition for both case (1) where the SQS interruption warning is not received and (2) when the node was not fully drained before the interruption period is up. We also can't use the InstanceTerminating status condition for the same reason: in cases (1) and (2) it is never set on the NodeClaim. (code ref in core)

How was this change tested?

  • unit tests and manual testing

Manual Testing

With correct SQS Qeueue Configured
When an instance is interrupted, the annotation is added:

Annotations:  karpenter.k8s.aws/ec2nodeclass-hash: 13930302764067181154
              karpenter.k8s.aws/ec2nodeclass-hash-version: v4
              karpenter.k8s.aws/instance-interrupted: true

and metric not incremented:

ryanmist@c889f3b6ff52 ~ % curl http://localhost:8080/metrics | grep 'karpenter_interruption'
...
karpenter_interruption_received_messages_total{message_type="capacity_reservation_interrupted"} 2
...
ryanmist@c889f3b6ff52 ~ % curl http://localhost:8080/metrics | grep 'missed_termination_total'
ryanmist@c889f3b6ff52 ~ %

With incorrect SQS QueueConfigured
When an instance is interrupted, the annotation is not added. When its terminated, the metric is incremented:

ryanmist@c889f3b6ff52 ~ % curl http://localhost:8080/metrics | grep 'karpenter_interruption'
# TYPE karpenter_interruption_missed_termination_total counter
karpenter_interruption_missed_termination_total{capacity_type="reserved"} 1
ryanmist@c889f3b6ff52 ~ %

and associated logs:

{"level":"INFO","time":"2026-03-27T01:14:16.926Z","logger":"controller","caller":"cloudprovider/cloudprovider.go:258","message":"detected instance termination without interruption notification","commit":"7d6a7ff-dirty","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"iodcr-pool-dd75k"},"namespace":"","name":"iodcr-pool-dd75k","reconcileID":"2b74ffa4-0036-4865-a43a-158139c7ee5e","provider-id":"aws:///us-west-2c/i-...","Node":{"name":"...","capacity-type":"reserved"}

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ryan-mist ryan-mist requested a review from a team as a code owner March 27, 2026 01:25
@ryan-mist ryan-mist requested a review from bwagner5 March 27, 2026 01:25
@github-actions
Copy link
Copy Markdown
Contributor

Preview deployment ready!

Preview URL: https://pr-9037.d18coufmbnnaag.amplifyapp.com

Built from commit b54ff6939618b520aa807b2a483003ef9db2f6ed

Copy link
Copy Markdown
Contributor

@AndrewMitchell25 AndrewMitchell25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants