fix(KFLUXINFRA-3529): etcd-shield total_size check + severity fix (staging)#11319
fix(KFLUXINFRA-3529): etcd-shield total_size check + severity fix (staging)#11319peet-rh wants to merge 1 commit intoredhat-appstudio:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: peet-rh The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @peet-rh. Thanks for your PR. I'm waiting for a redhat-appstudio member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Review Summary by QodoFix etcd-shield to detect total DB size capacity issues
WalkthroughsDescription• Add etcd_mvcc_db_total_size_in_bytes check to etcd-shield recording rule • Fix alert severity from warning to critical for consistency • Extend hysteresis condition to check both in-use and total DB size metrics • Staging deployment only (2 clusters, low risk) Diagramflowchart LR
A["etcd-shield trigger rule"] -->|add total_size check| B["Detect 80% total DB size"]
A -->|extend hysteresis| C["Check both in-use and total metrics"]
D["EtcdShieldDenyAdmission alert"] -->|severity upgrade| E["warning → critical"]
B --> F["Prevent missed capacity alerts"]
C --> F
File Changes1. components/etcd-shield/base/etcd_shield_alerts.yaml
|
Code Review by Qodo
|
Kustomize Render DiffComparing
Total: 2 components, +8 -4 lines 📋 Full diff available in the workflow summary and as a downloadable artifact. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #11319 +/- ##
=======================================
Coverage 51.62% 51.62%
=======================================
Files 18 18
Lines 1263 1263
=======================================
Hits 652 652
Misses 539 539
Partials 72 72
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| ((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) or | ||
| (etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70))) == 1) and |
There was a problem hiding this comment.
1. Hysteresis or is broken 🐞 Bug ≡ Correctness
In etcd_shield_trigger’s 70% hysteresis clause, (in_use >= bool …) or (total_size >= bool …) is evaluated before filtering to true (== 1), but >= bool keeps 0-valued samples on both sides, so PromQL’s set-operator or will keep the left-hand (in_use) sample when labelsets match and effectively ignore the total_size branch. This can cause EtcdShieldDenyAdmission to resolve even when etcd_mvcc_db_total_size_in_bytes stays >=70% after being triggered by the new 80% total-size check.
Agent Prompt
### Issue description
The 70% hysteresis clause uses `or` on the output of `>= bool` before filtering to true (`== 1`), which prevents `etcd_mvcc_db_total_size_in_bytes` from contributing when labelsets match.
### Issue Context
At 80% you already use the safe pattern: `((metric >= bool threshold) == 1) or ((other_metric >= bool threshold) == 1)`.
### Fix
Update the 70% clause to filter each side *before* `or` (or remove `bool` so false series are absent), e.g.:
Option A (minimal change, consistent style):
```promql
((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1) or
((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1)) and
(count without (alertname, alertstate, severity) (ALERTS{...}) == bool 1))
```
Option B (simpler PromQL):
```promql
(((etcd_mvcc_db_total_size_in_use_in_bytes >= (etcd_server_quota_backend_bytes * 0.70)) or
(etcd_mvcc_db_total_size_in_bytes >= (etcd_server_quota_backend_bytes * 0.70))) and
(count without (...) (ALERTS{...}) == bool 1))
```
### Fix Focus Areas
- components/etcd-shield/base/etcd_shield_alerts.yaml[21-30]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
The 70% clause already filters each side to == 1 before or:
((in_use >= bool (quota * 0.70)) == 1) or ((total_size >= bool (quota * 0.70)) == 1)
This is the same pattern used on the 80% lines. False series (value 0) are dropped by == 1 before or sees them, so both branches contribute correctly.
…hield (staging) Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Peter Kirkpatrick <[email protected]>
9ccb33d to
6ac24b4
Compare
|
/review |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
/ok-to-test |
🤖 Pipeline Failure AnalysisCategory: Configuration The pipeline failed because the 📋 Technical DetailsImmediate CauseThe Contributing FactorsThis deployment failure led to the overall ArgoCD synchronization timing out after multiple attempts, as the ImpactThe failure to properly deploy and stabilize the 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
|
/retest |
What
Add etcd_mvcc_db_total_size_in_bytes check to etcd-shield recording rule and fix severity warning → critical to match hysteresis ALERTS filter. Staging only.
Why
ITN-2026-00103: etcd-shield missed stone-prd-rh01 because only in_use bytes were checked — physical DB size hit 80% undetected.
Validation
Tested against live Prometheus on stone-prd-rh01. All 3 conditions verified.
Risk Assessment
Risk Level: Low — Staging only (2 clusters), single revert rollback.