chore: add metric for missed interruptions events#9037
Open
chore: add metric for missed interruptions events#9037
Conversation
Contributor
|
Preview deployment ready! Preview URL: https://pr-9037.d18coufmbnnaag.amplifyapp.com Built from commit |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #N/A
Description
karpenter_interruption_missed_termination_total, that is incremented when a interruptible instance (i.e. spot or from an IODCR) is terminated and Karpenter did not recieve an SQS interruption warningkarpenter.k8s.aws/instance-interruptedwhen Karpenter receives the interruption eventNote that we can't do this through NodeClaim status conditions because
DrainedasUnknownis the indicating status condition for both case (1) where the SQS interruption warning is not received and (2) when the node was not fully drained before the interruption period is up. We also can't use theInstanceTerminatingstatus condition for the same reason: in cases (1) and (2) it is never set on the NodeClaim. (code ref in core)How was this change tested?
Manual Testing
With correct SQS Qeueue Configured
When an instance is interrupted, the annotation is added:
and metric not incremented:
With incorrect SQS QueueConfigured
When an instance is interrupted, the annotation is not added. When its terminated, the metric is incremented:
and associated logs:
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.