fix(KFLUXINFRA-3529): etcd-shield total_size check + severity fix (staging) by peet-rh · Pull Request #11319 · redhat-appstudio/infra-deployments

peet-rh · 2026-04-16T01:48:50Z

What

Add etcd_mvcc_db_total_size_in_bytes check to etcd-shield recording rule and fix severity warning → critical to match hysteresis ALERTS filter. Staging only.

Why

ITN-2026-00103: etcd-shield missed stone-prd-rh01 because only in_use bytes were checked — physical DB size hit 80% undetected.

Validation

Tested against live Prometheus on stone-prd-rh01. All 3 conditions verified.

Risk Assessment

Risk Level: Low — Staging only (2 clusters), single revert rollback.

openshift-ci · 2026-04-16T01:48:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: peet-rh
Once this PR has been reviewed and has the lgtm label, please assign gbenhaim for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

components/etcd-shield/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-04-16T01:49:01Z

Hi @peet-rh. Thanks for your PR.

I'm waiting for a redhat-appstudio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

qodo-code-review · 2026-04-16T01:49:02Z

Review Summary by Qodo

Fix etcd-shield to detect total DB size capacity issues

🐞 Bug fix

Walkthroughs

Description

• Add etcd_mvcc_db_total_size_in_bytes check to etcd-shield recording rule
• Fix alert severity from warning to critical for consistency
• Extend hysteresis condition to check both in-use and total DB size metrics
• Staging deployment only (2 clusters, low risk)

Diagram

flowchart LR
  A["etcd-shield trigger rule"] -->|add total_size check| B["Detect 80% total DB size"]
  A -->|extend hysteresis| C["Check both in-use and total metrics"]
  D["EtcdShieldDenyAdmission alert"] -->|severity upgrade| E["warning → critical"]
  B --> F["Prevent missed capacity alerts"]
  C --> F

File Changes

1. components/etcd-shield/base/etcd_shield_alerts.yaml 🐞 Bug fix +4/-2

Add total_size metric and fix alert severity

• Changed EtcdShieldDenyAdmission alert severity from warning to critical
• Added etcd_mvcc_db_total_size_in_bytes >= 0.80 * quota condition to recording rule
• Extended hysteresis condition to check both etcd_mvcc_db_total_size_in_use_in_bytes and
 etcd_mvcc_db_total_size_in_bytes at 70% threshold
• Ensures physical DB size capacity issues are detected alongside in-use bytes

components/etcd-shield/base/etcd_shield_alerts.yaml

qodo-code-review · 2026-04-16T01:49:03Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

🐞\ ≡ Correctness (1)

1. Hysteresis OR is broken 🐞 ≡

Description

In etcd_shield_trigger’s 70% hysteresis clause, (in_use >= bool …) or (total_size >= bool …) is
evaluated before filtering to true (== 1), but >= bool keeps 0-valued samples on both sides, so
PromQL’s set-operator or will keep the left-hand (in_use) sample when labelsets match and
effectively ignore the total_size branch. This can cause EtcdShieldDenyAdmission to resolve even
when etcd_mvcc_db_total_size_in_bytes stays >=70% after being triggered by the new 80% total-size
check.

Code

components/etcd-shield/base/etcd_shield_alerts.yaml[R23-24]

+            ((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) or
+                (etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70))) == 1) and

Evidence

The 80% checks correctly filter each branch to only-true series before or (`(... >= bool ...) ==
1 on both lines 21-22), but the new 70% clause applies or between two >= bool` expressions and
only then compares the result to 1 (lines 23-24). Since >= bool produces a present time series
with value 0 or 1, or operates as a label-set union (not value-level boolean OR) and will select
the left-hand sample for matching labelsets, preventing the total-size condition from contributing
to the 70% hysteresis path.

components/etcd-shield/base/etcd_shield_alerts.yaml[21-25]
components/etcd-shield/base/etcd_shield_alerts.yaml[23-24]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The 70% hysteresis clause uses `or` on the output of `>= bool` before filtering to true (`== 1`), which prevents `etcd_mvcc_db_total_size_in_bytes` from contributing when labelsets match.

### Issue Context
At 80% you already use the safe pattern: `((metric >= bool threshold) == 1) or ((other_metric >= bool threshold) == 1)`.

### Fix
Update the 70% clause to filter each side *before* `or` (or remove `bool` so false series are absent), e.g.:

Option A (minimal change, consistent style):
```promql
((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1) or
 ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1)) and
(count without (alertname, alertstate, severity) (ALERTS{...}) == bool 1))
```

Option B (simpler PromQL):
```promql
(((etcd_mvcc_db_total_size_in_use_in_bytes >= (etcd_server_quota_backend_bytes * 0.70)) or
 (etcd_mvcc_db_total_size_in_bytes >= (etcd_server_quota_backend_bytes * 0.70))) and
(count without (...) (ALERTS{...}) == bool 1))
```

### Fix Focus Areas
- components/etcd-shield/base/etcd_shield_alerts.yaml[21-30]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

github-actions · 2026-04-16T01:49:20Z

Kustomize Render Diff

Comparing bd9a404d6 → 132b6de9f

Component	Environment	Changes
`components/etcd-shield/staging/stone-stage-p01`	staging	+4 -2
`components/etcd-shield/staging/stone-stg-rh01`	staging	+4 -2

Total: 2 components, +8 -4 lines

📋 Full diff available in the workflow summary and as a downloadable artifact.

codecov · 2026-04-16T01:49:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 51.62%. Comparing base (bd9a404) to head (6ac24b4).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #11319   +/-   ##
=======================================
  Coverage   51.62%   51.62%           
=======================================
  Files          18       18           
  Lines        1263     1263           
=======================================
  Hits          652      652           
  Misses        539      539           
  Partials       72       72

Flag	Coverage Δ
go	`51.62% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

qodo-code-review · 2026-04-16T01:52:14Z

+            ((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) or
+                (etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70))) == 1) and


1. Hysteresis or is broken 🐞 Bug ≡ Correctness

In etcd_shield_trigger’s 70% hysteresis clause, (in_use >= bool …) or (total_size >= bool …) is evaluated before filtering to true (== 1), but >= bool keeps 0-valued samples on both sides, so PromQL’s set-operator or will keep the left-hand (in_use) sample when labelsets match and effectively ignore the total_size branch. This can cause EtcdShieldDenyAdmission to resolve even when etcd_mvcc_db_total_size_in_bytes stays >=70% after being triggered by the new 80% total-size check.

Agent Prompt

### Issue description The 70% hysteresis clause uses `or` on the output of `>= bool` before filtering to true (`== 1`), which prevents `etcd_mvcc_db_total_size_in_bytes` from contributing when labelsets match. ### Issue Context At 80% you already use the safe pattern: `((metric >= bool threshold) == 1) or ((other_metric >= bool threshold) == 1)`. ### Fix Update the 70% clause to filter each side *before* `or` (or remove `bool` so false series are absent), e.g.: Option A (minimal change, consistent style): ```promql ((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1) or ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1)) and (count without (alertname, alertstate, severity) (ALERTS{...}) == bool 1)) ``` Option B (simpler PromQL): ```promql (((etcd_mvcc_db_total_size_in_use_in_bytes >= (etcd_server_quota_backend_bytes * 0.70)) or (etcd_mvcc_db_total_size_in_bytes >= (etcd_server_quota_backend_bytes * 0.70))) and (count without (...) (ALERTS{...}) == bool 1)) ``` ### Fix Focus Areas - components/etcd-shield/base/etcd_shield_alerts.yaml[21-30]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

The 70% clause already filters each side to == 1 before or:

((in_use >= bool (quota * 0.70)) == 1) or ((total_size >= bool (quota * 0.70)) == 1)

This is the same pattern used on the 80% lines. False series (value 0) are dropped by == 1 before or sees them, so both branches contribute correctly.

…hield (staging) Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Peter Kirkpatrick <[email protected]>

peet-rh · 2026-04-16T02:21:34Z

/review

qodo-code-review · 2026-04-16T02:21:51Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review PromQL Logic Validate the updated `etcd_shield_trigger` PromQL boolean logic is correct and does not change the intended hysteresis behavior. In particular, confirm the added `etcd_mvcc_db_total_size_in_bytes` checks at both 80% and 70% thresholds are combined as intended with the existing `ALERTS{alertname="EtcdShieldDenyAdmission", alertstate="firing"}` gating, and that operator precedence/parentheses yield the expected truth table. expr: \| (((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.80)) == 1) or ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.80)) == 1) or ((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1) or ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1)) and (count without (alertname, alertstate, severity) (ALERTS{ alertname="EtcdShieldDenyAdmission", alertstate="firing", YAML Structure Confirm `labels` contains all required keys and the indentation renders valid YAML for PrometheusRule/alerting rule parsing. A mis-indented `severity` (or an empty/invalid `labels` mapping) can cause the rule to be rejected or the label not to be applied. labels: severity: critical annotations: Metric Availability Ensure `etcd_mvcc_db_total_size_in_bytes` is reliably present for all targeted clusters/etcd versions and has compatible labelsets with `etcd_server_quota_backend_bytes`, otherwise the new comparisons may produce unexpected empty vectors or mismatched series. (((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.80)) == 1) or ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.80)) == 1) or ((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1) or ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1)) and (count without (alertname, alertstate, severity)

enkeefe00 · 2026-04-16T15:38:06Z

/ok-to-test

konflux-ci-qe-bot · 2026-04-16T18:12:28Z

🤖 Pipeline Failure Analysis

Category: Configuration

The pipeline failed because the build-service-controller-manager deployment did not stabilize within its progress deadline, causing ArgoCD synchronization to time out during cluster bootstrapping.

📋 Technical Details

Immediate Cause

The build-service-controller-manager deployment, a key component of the build-service-in-cluster-local ArgoCD application, failed to reach a healthy state and stabilize within its allocated progress deadline. This degradation was explicitly reported during the ArgoCD synchronization process.

Contributing Factors

This deployment failure led to the overall ArgoCD synchronization timing out after multiple attempts, as the build-service-in-cluster-local application remained unhealthy. Consequently, the cluster bootstrapping process for the appstudio-e2e environment failed, exhausting the maximum number of attempts.

Impact

The failure to properly deploy and stabilize the build-service-controller-manager prevented the essential redhat-appstudio-e2e environment from being set up. This blockage halted the entire appstudio-e2e-tests step, ensuring no end-to-end tests could execute and resulting in a complete pipeline failure.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: configuration
Root Cause: The build-service-controller-manager deployment, a component of the build-service-in-cluster-local ArgoCD application, repeatedly failed to initialize and stabilize within its progress deadline, causing the overall ArgoCD synchronization to time out. This suggests a configuration error or an environmental constraint preventing the build-service-controller-manager from deploying successfully.

Logs: