Skip to content

fix(KFLUXINFRA-3529): etcd-shield total_size check + severity fix (staging)#11319

Open
peet-rh wants to merge 1 commit intoredhat-appstudio:mainfrom
peet-rh:fix/etcd-shield-staging
Open

fix(KFLUXINFRA-3529): etcd-shield total_size check + severity fix (staging)#11319
peet-rh wants to merge 1 commit intoredhat-appstudio:mainfrom
peet-rh:fix/etcd-shield-staging

Conversation

@peet-rh
Copy link
Copy Markdown
Contributor

@peet-rh peet-rh commented Apr 16, 2026

What

Add etcd_mvcc_db_total_size_in_bytes check to etcd-shield recording rule and fix severity warning → critical to match hysteresis ALERTS filter. Staging only.

Why

ITN-2026-00103: etcd-shield missed stone-prd-rh01 because only in_use bytes were checked — physical DB size hit 80% undetected.

Validation

Tested against live Prometheus on stone-prd-rh01. All 3 conditions verified.

Risk Assessment

Risk Level: Low — Staging only (2 clusters), single revert rollback.

@openshift-ci openshift-ci bot requested review from enkeefe00 and glevi-rh April 16, 2026 01:48
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 16, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: peet-rh
Once this PR has been reviewed and has the lgtm label, please assign gbenhaim for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 16, 2026

Hi @peet-rh. Thanks for your PR.

I'm waiting for a redhat-appstudio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Fix etcd-shield to detect total DB size capacity issues

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Add etcd_mvcc_db_total_size_in_bytes check to etcd-shield recording rule
• Fix alert severity from warning to critical for consistency
• Extend hysteresis condition to check both in-use and total DB size metrics
• Staging deployment only (2 clusters, low risk)
Diagram
flowchart LR
  A["etcd-shield trigger rule"] -->|add total_size check| B["Detect 80% total DB size"]
  A -->|extend hysteresis| C["Check both in-use and total metrics"]
  D["EtcdShieldDenyAdmission alert"] -->|severity upgrade| E["warning → critical"]
  B --> F["Prevent missed capacity alerts"]
  C --> F
Loading

Grey Divider

File Changes

1. components/etcd-shield/base/etcd_shield_alerts.yaml 🐞 Bug fix +4/-2

Add total_size metric and fix alert severity

• Changed EtcdShieldDenyAdmission alert severity from warning to critical
• Added etcd_mvcc_db_total_size_in_bytes >= 0.80 * quota condition to recording rule
• Extended hysteresis condition to check both etcd_mvcc_db_total_size_in_use_in_bytes and
 etcd_mvcc_db_total_size_in_bytes at 70% threshold
• Ensures physical DB size capacity issues are detected alongside in-use bytes

components/etcd-shield/base/etcd_shield_alerts.yaml


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 16, 2026

Code Review by Qodo

🐞 Bugs (1)   📘 Rule violations (0)   📎 Requirement gaps (0)
🐞\ ≡ Correctness (1)

Grey Divider


Action required

1. Hysteresis OR is broken 🐞
Description
In etcd_shield_trigger’s 70% hysteresis clause, (in_use >= bool …) or (total_size >= bool …) is
evaluated before filtering to true (== 1), but >= bool keeps 0-valued samples on both sides, so
PromQL’s set-operator or will keep the left-hand (in_use) sample when labelsets match and
effectively ignore the total_size branch. This can cause EtcdShieldDenyAdmission to resolve even
when etcd_mvcc_db_total_size_in_bytes stays >=70% after being triggered by the new 80% total-size
check.
Code

components/etcd-shield/base/etcd_shield_alerts.yaml[R23-24]

+            ((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) or
+                (etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70))) == 1) and
Evidence
The 80% checks correctly filter each branch to only-true series before or (`(... >= bool ...) ==
1 on both lines 21-22), but the new 70% clause applies or between two >= bool` expressions and
only then compares the result to 1 (lines 23-24). Since >= bool produces a present time series
with value 0 or 1, or operates as a label-set union (not value-level boolean OR) and will select
the left-hand sample for matching labelsets, preventing the total-size condition from contributing
to the 70% hysteresis path.

components/etcd-shield/base/etcd_shield_alerts.yaml[21-25]
components/etcd-shield/base/etcd_shield_alerts.yaml[23-24]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The 70% hysteresis clause uses `or` on the output of `>= bool` before filtering to true (`== 1`), which prevents `etcd_mvcc_db_total_size_in_bytes` from contributing when labelsets match.

### Issue Context
At 80% you already use the safe pattern: `((metric >= bool threshold) == 1) or ((other_metric >= bool threshold) == 1)`.

### Fix
Update the 70% clause to filter each side *before* `or` (or remove `bool` so false series are absent), e.g.:

Option A (minimal change, consistent style):
```promql
((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1) or
 ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1)) and
(count without (alertname, alertstate, severity) (ALERTS{...}) == bool 1))
```

Option B (simpler PromQL):
```promql
(((etcd_mvcc_db_total_size_in_use_in_bytes >= (etcd_server_quota_backend_bytes * 0.70)) or
 (etcd_mvcc_db_total_size_in_bytes >= (etcd_server_quota_backend_bytes * 0.70))) and
(count without (...) (ALERTS{...}) == bool 1))
```

### Fix Focus Areas
- components/etcd-shield/base/etcd_shield_alerts.yaml[21-30]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

Kustomize Render Diff

Comparing bd9a404d6132b6de9f

Component Environment Changes
components/etcd-shield/staging/stone-stage-p01 staging +4 -2
components/etcd-shield/staging/stone-stg-rh01 staging +4 -2

Total: 2 components, +8 -4 lines

📋 Full diff available in the workflow summary and as a downloadable artifact.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 51.62%. Comparing base (bd9a404) to head (6ac24b4).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #11319   +/-   ##
=======================================
  Coverage   51.62%   51.62%           
=======================================
  Files          18       18           
  Lines        1263     1263           
=======================================
  Hits          652      652           
  Misses        539      539           
  Partials       72       72           
Flag Coverage Δ
go 51.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +23 to +24
((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) or
(etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70))) == 1) and
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Hysteresis or is broken 🐞 Bug ≡ Correctness

In etcd_shield_trigger’s 70% hysteresis clause, (in_use >= bool …) or (total_size >= bool …) is
evaluated before filtering to true (== 1), but >= bool keeps 0-valued samples on both sides, so
PromQL’s set-operator or will keep the left-hand (in_use) sample when labelsets match and
effectively ignore the total_size branch. This can cause EtcdShieldDenyAdmission to resolve even
when etcd_mvcc_db_total_size_in_bytes stays >=70% after being triggered by the new 80% total-size
check.
Agent Prompt
### Issue description
The 70% hysteresis clause uses `or` on the output of `>= bool` before filtering to true (`== 1`), which prevents `etcd_mvcc_db_total_size_in_bytes` from contributing when labelsets match.

### Issue Context
At 80% you already use the safe pattern: `((metric >= bool threshold) == 1) or ((other_metric >= bool threshold) == 1)`.

### Fix
Update the 70% clause to filter each side *before* `or` (or remove `bool` so false series are absent), e.g.:

Option A (minimal change, consistent style):
```promql
((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1) or
  ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1)) and
 (count without (alertname, alertstate, severity) (ALERTS{...}) == bool 1))
```

Option B (simpler PromQL):
```promql
(((etcd_mvcc_db_total_size_in_use_in_bytes >= (etcd_server_quota_backend_bytes * 0.70)) or
  (etcd_mvcc_db_total_size_in_bytes >= (etcd_server_quota_backend_bytes * 0.70))) and
 (count without (...) (ALERTS{...}) == bool 1))
```

### Fix Focus Areas
- components/etcd-shield/base/etcd_shield_alerts.yaml[21-30]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 70% clause already filters each side to == 1 before or:

((in_use >= bool (quota * 0.70)) == 1) or ((total_size >= bool (quota * 0.70)) == 1)

This is the same pattern used on the 80% lines. False series (value 0) are dropped by == 1 before or sees them, so both branches contribute correctly.

…hield (staging)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Peter Kirkpatrick <[email protected]>
@peet-rh peet-rh force-pushed the fix/etcd-shield-staging branch from 9ccb33d to 6ac24b4 Compare April 16, 2026 02:15
@peet-rh
Copy link
Copy Markdown
Contributor Author

peet-rh commented Apr 16, 2026

/review

@qodo-code-review
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

PromQL Logic

Validate the updated etcd_shield_trigger PromQL boolean logic is correct and does not change the intended hysteresis behavior. In particular, confirm the added etcd_mvcc_db_total_size_in_bytes checks at both 80% and 70% thresholds are combined as intended with the existing ALERTS{alertname="EtcdShieldDenyAdmission", alertstate="firing"} gating, and that operator precedence/parentheses yield the expected truth table.

expr: |
  (((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.80)) == 1) or
      ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.80)) == 1) or
      ((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1) or
          ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1)) and
          (count without (alertname, alertstate, severity)
                (ALERTS{
                  alertname="EtcdShieldDenyAdmission",
                  alertstate="firing",
YAML Structure

Confirm labels contains all required keys and the indentation renders valid YAML for PrometheusRule/alerting rule parsing. A mis-indented severity (or an empty/invalid labels mapping) can cause the rule to be rejected or the label not to be applied.

labels:
  severity: critical
annotations:
Metric Availability

Ensure etcd_mvcc_db_total_size_in_bytes is reliably present for all targeted clusters/etcd versions and has compatible labelsets with etcd_server_quota_backend_bytes, otherwise the new comparisons may produce unexpected empty vectors or mismatched series.

(((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.80)) == 1) or
    ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.80)) == 1) or
    ((((etcd_mvcc_db_total_size_in_use_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1) or
        ((etcd_mvcc_db_total_size_in_bytes >= bool (etcd_server_quota_backend_bytes * 0.70)) == 1)) and
        (count without (alertname, alertstate, severity)

@enkeefe00
Copy link
Copy Markdown
Contributor

/ok-to-test

@konflux-ci-qe-bot
Copy link
Copy Markdown

🤖 Pipeline Failure Analysis

Category: Configuration

The pipeline failed because the build-service-controller-manager deployment did not stabilize within its progress deadline, causing ArgoCD synchronization to time out during cluster bootstrapping.

📋 Technical Details

Immediate Cause

The build-service-controller-manager deployment, a key component of the build-service-in-cluster-local ArgoCD application, failed to reach a healthy state and stabilize within its allocated progress deadline. This degradation was explicitly reported during the ArgoCD synchronization process.

Contributing Factors

This deployment failure led to the overall ArgoCD synchronization timing out after multiple attempts, as the build-service-in-cluster-local application remained unhealthy. Consequently, the cluster bootstrapping process for the appstudio-e2e environment failed, exhausting the maximum number of attempts.

Impact

The failure to properly deploy and stabilize the build-service-controller-manager prevented the essential redhat-appstudio-e2e environment from being set up. This blockage halted the entire appstudio-e2e-tests step, ensuring no end-to-end tests could execute and resulting in a complete pipeline failure.

🔍 Evidence

appstudio-e2e-tests/redhat-appstudio-e2e

Category: configuration
Root Cause: The build-service-controller-manager deployment, a component of the build-service-in-cluster-local ArgoCD application, repeatedly failed to initialize and stabilize within its progress deadline, causing the overall ArgoCD synchronization to time out. This suggests a configuration error or an environmental constraint preventing the build-service-controller-manager from deploying successfully.

Logs:

artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt
[2026-04-16 16:43:17] [ERROR] TIMEOUT: Applications failed to sync within 45 minutes
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt
[2026-04-16 16:43:19] [ERROR]   Health Status: Degraded
[2026-04-16 16:43:20] [ERROR]   Unhealthy Resources:
    - Deployment/build-service-controller-manager: Degraded - Deployment "build-service-controller-manager" exceeded its progress deadline
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt
[2026-04-16 16:43:21] [ERROR] Application: konflux-kite-in-cluster-local
[2026-04-16 16:43:23] [ERROR]   Sync Status: OutOfSync
[2026-04-16 16:43:23] [ERROR]   Health Status: Healthy
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt
[2026-04-16 17:42:13] [ERROR] TIMEOUT: Applications failed to sync within 45 minutes
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt
[2026-04-16 17:42:17] [ERROR]   Unhealthy Resources:
    - Deployment/build-service-controller-manager: Degraded - Deployment "build-service-controller-manager" exceeded its progress deadline
artifacts/appstudio-e2e-tests/redhat-appstudio-e2e/build-log.txt
Error: error when bootstrapping cluster: reached maximum number of attempts (2). error: exit status 1

Analysis powered by prow-failure-analysis | Build: 2044802545100001280

@peet-rh
Copy link
Copy Markdown
Contributor Author

peet-rh commented Apr 16, 2026

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants