Skip to content

OCPEDGE-1829: Add e2e tests for pacemaker out-of-service taint and annotation on two node fencing clusters#31304

Open
vimauro wants to merge 2 commits into
openshift:mainfrom
vimauro:taintint-test-suite
Open

OCPEDGE-1829: Add e2e tests for pacemaker out-of-service taint and annotation on two node fencing clusters#31304
vimauro wants to merge 2 commits into
openshift:mainfrom
vimauro:taintint-test-suite

Conversation

@vimauro

@vimauro vimauro commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • Tests
    • Added an end-to-end test suite that validates the full fencing taint lifecycle in a dual-node scenario, including pacemaker fencing alerts, journal verification, disruptive network behavior, and post-recovery taint/annotation cleanup.
  • Bug Fixes
    • Improved taint/untaint lifecycle validation coverage by adding more robust journal and service completion checks across both nodes.
  • Chores
    • Added helper utilities to detect out-of-service taints/annotations and to query node journal/service logs for expected markers.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 858f9f18-c77b-4548-8551-34c24c223a3f

📥 Commits

Reviewing files that changed from the base of the PR and between 523df25 and b7582d4.

📒 Files selected for processing (2)
  • test/extended/edge_topologies/tnf_taint.go
  • test/extended/edge_topologies/utils/services/taint.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/extended/edge_topologies/utils/services/taint.go
  • test/extended/edge_topologies/tnf_taint.go

Walkthrough

Adds two new files implementing TNF (Two-Node with Fencing) fencing taint support. utils/services/taint.go introduces exported constants for taint/annotation keys, pacemaker alert identifiers, and systemd unit templates, plus node inspection predicates and debug-based command helpers. tnf_taint.go adds a Ginkgo e2e suite with a pacemaker alert config validation test and a disruptive fencing taint lifecycle test using a concurrent taintObserver.

Changes

TNF Fencing Taint Implementation

Layer / File(s) Summary
Taint/annotation constants and node predicates
test/extended/edge_topologies/utils/services/taint.go
Defines all exported package-level constants for taint/annotation keys and values, pacemaker alert IDs, alert script paths, log tags, and systemd unit name templates; adds HasOutOfServiceTaint and HasOutOfServiceAnnotation predicates to detect taint/annotation presence on nodes.
Debug and Kubernetes API command helpers
test/extended/edge_topologies/utils/services/taint.go
Provides JournalGrepViaDebug, GetTimestampViaDebug, PcsAlertConfigViaDebug, SystemdServiceJournalGrep, FetchNodeObject, and RemoveTaintAndAnnotation, each wrapping node debug/chroot commands or Kubernetes API calls for journal inspection, timestamp capture, pacemaker config querying, and best-effort taint/annotation cleanup.
Test suite infrastructure and taint observer
test/extended/edge_topologies/tnf_taint.go
Establishes test package setup with shared polling constants, introduces checkJournalOnNodes for multi-node journal scanning across a time window, and implements the mutex-protected taintObserver struct with background polling to detect and record the first observed out-of-service taint occurrence.
Pacemaker alert configuration validation test
test/extended/edge_topologies/tnf_taint.go
Non-disruptive Ginkgo test case that retrieves a control-plane node, queries pacemaker alert configuration via debug, verifies both taint and untaint alert IDs are registered, and confirms both alert script paths exist on disk with executable permissions.
Disruptive fencing taint lifecycle test
test/extended/edge_topologies/tnf_taint.go
DualReplica-gated disruptive test that establishes suite setup and cleanup, captures a baseline timestamp, starts the taint observer, triggers network disruption between two nodes, validates etcd recovery state, determines which node was fenced via taint checks and observer evidence, verifies journal and systemd service completion logs, awaits learner rejoin/promotion, confirms taint/annotation removal, and validates untaint alert logs across both nodes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Test code uses undefined constants (memberIsLeaderTimeout, memberRejoinedLearnerTimeout, memberPromotedVotingTimeout, networkDisruptionDuration, logFinalClusterStatus) not defined in file, preventi... Add missing constant definitions from tnf_recovery.go and ensure function logFinalClusterStatus is defined or imported in tnf_taint.go.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main changes: adding end-to-end tests for pacemaker out-of-service taint and annotation on two-node fencing clusters, which aligns with the new test files and utility functions added.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 4 Ginkgo test definitions (2 Describe, 2 It) use static, deterministic names with no dynamic values, node names, timestamps, UUIDs, or other changing content.
Microshift Test Compatibility ✅ Passed Both new test cases have [apigroup:config.openshift.io] tags, which causes MicroShift CI to automatically skip them; tests are properly protected from MicroShift execution.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Tests are protected from SNO execution via SkipIfNotTopology(oc, v1.DualReplicaTopologyMode) checks in BeforeEach, which skips on SNO's SingleReplicaTopologyMode.
Topology-Aware Scheduling Compatibility ✅ Passed This PR adds only test files (Ginkgo e2e tests and test utilities) with no deployment manifests, operator code, controllers, or scheduling constraints. The check is not applicable to test-only code...
Ote Binary Stdout Contract ✅ Passed No OTE Binary Stdout Contract violations found. Both new files use only framework.Logf/e2e.Logf for logging and contain no fmt.Print*, klog, or stdout writes at process-level (init, main, BeforeSui...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed New tests use dynamic IP family detection via TriggerNetworkDisruption (selects iptables/ip6tables based on ip.To4() check), cluster-internal APIs only, and node address retrieval from Kubernetes A...
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or insecure secret comparisons found in added files.
Container-Privileges ✅ Passed PR adds only Go test code files with no Kubernetes manifests or container specifications. No privileged settings (privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN, allowPrivilegeEscalatio...
No-Sensitive-Data-In-Logs ✅ Passed Logging statements in the new test files only expose node names, timestamps, taint statuses, and filtered application logs—no passwords, tokens, API keys, PII, or customer data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot

openshift-ci-robot commented Jun 15, 2026

Copy link
Copy Markdown

@vimauro: This pull request references OCPEDGE-1829 which is a valid jira issue.

Details

In response to this:

Summary by CodeRabbit

  • Tests
  • Added end-to-end test suite for dual-node fencing taint lifecycle validation, including network disruption scenarios, pacemaker alert verification, and node recovery confirmation.
  • Added utility functions for systemd journal inspection, node taint/annotation detection, and cluster state verification in edge topology scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 15, 2026
@vimauro

vimauro commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

/verified by @vimauro on a live TNF cluster with the code changes from those PRs included: openshift/machine-config-operator#6092 and openshift/cluster-etcd-operator#1625

@vimauro

vimauro commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

/label tide/merge-method-squash

@openshift-ci openshift-ci Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/edge_topologies/tnf_taint.go`:
- Around line 161-166: Add explicit validation of the node count after
retrieving nodes with utils.GetNodes. Before accessing nodes.Items with
rand.Intn() and the modulo arithmetic, verify that the slice contains at least 2
nodes using an expectation check (similar to the existing error check pattern).
This ensures a clear failure message rather than a panic if nodes.Items is
empty, and confirms there are enough nodes for the randomIndex and targetNode
selection logic that follows.
- Around line 95-97: The Stop() method in the taintObserver type will panic if
called more than once because it attempts to close t.stopCh without guarding
against multiple invocations. To fix this, protect the close operation using
sync.Once by adding a sync.Once field to the taintObserver struct and wrapping
the close(t.stopCh) call in the Stop() method with a once.Do() guard to ensure
it only executes on the first call.

In `@test/extended/edge_topologies/utils/services/taint.go`:
- Around line 167-169: The node Update call in the cleanup path is susceptible
to resourceVersion conflicts on this high-churn resource, and a single failed
update can leave taints/annotations behind and contaminate subsequent tests.
Wrap the oc.AdminKubeClient().CoreV1().Nodes().Update() call in a retry loop
that retries on conflict errors (check for ResourceVersion conflict scenarios),
fetching the latest node state before each retry attempt. This ensures the
cleanup operation is more robust and reduces the chance of test contamination
from failed cleanup attempts.
- Around line 120-123: Replace context.Background() with timeout-bounded
contexts in the Kubernetes API calls to prevent indefinite blocking during
control-plane degradation. In the FetchNodeObject function at lines 120-123,
replace the context.Background() passed to the Get() call with a context that
includes a timeout (such as context.WithTimeout). Additionally, apply the same
fix at lines 167-168 where an Update() call also uses context.Background(). Both
calls should use a reasonably bounded timeout context (typically a few seconds)
instead of the unbounded background context to ensure the test can complete
within acceptable timeframes.
- Around line 87-90: The JournalGrepViaDebug and SystemdServiceJournalGrep
functions construct bash commands by directly interpolating unvalidated
variables (tag, unitName, sinceTimestamp, pattern) into fmt.Sprintf calls that
are passed to bash -c commands, creating a potential shell injection vector.
Harden this by adding allow-list validation for each interpolated variable
before use in command construction. Define validation rules (e.g., regex
patterns or character allowlists) appropriate for each variable type to ensure
only expected formats are accepted, and reject any input that does not match the
allow-list. Apply this validation consistently at both the initial
JournalGrepViaDebug function (around the journalctl command construction at
lines 87-90) and the SystemdServiceJournalGrep function (around the systemctl
command construction at lines 128-130).
- Around line 99-100: The date command in the bash expression used by
exutil.DebugNodeRetryWithOptionsAndChroot currently returns a UTC timestamp
without timezone information (using '+%Y-%m-%d %H:%M:%S'), which can be
misinterpreted as local time by journalctl --since. Update the date format
string to include timezone qualification by appending '+00:00' or another
timezone marker to indicate UTC, ensuring journalctl correctly interprets the
timestamp regardless of the host's local timezone setting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a1b9cc77-3a79-473d-8757-c387c9f66bca

📥 Commits

Reviewing files that changed from the base of the PR and between f0a8565 and 523df25.

📒 Files selected for processing (2)
  • test/extended/edge_topologies/tnf_taint.go
  • test/extended/edge_topologies/utils/services/taint.go

Comment thread test/extended/edge_topologies/tnf_taint.go
Comment thread test/extended/edge_topologies/tnf_taint.go
Comment thread test/extended/edge_topologies/utils/services/taint.go
Comment thread test/extended/edge_topologies/utils/services/taint.go
Comment thread test/extended/edge_topologies/utils/services/taint.go
Comment thread test/extended/edge_topologies/utils/services/taint.go Outdated
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 15, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@vimauro: This PR has been marked as verified by @vimauro on a live TNF cluster with the code changes from those PRs included: https://github.com/openshift/machine-config-operator/pull/6092 and https://github.com/openshift/cluster-etcd-operator/pull/1625.

Details

In response to this:

/verified by @vimauro on a live TNF cluster with the code changes from those PRs included: openshift/machine-config-operator#6092 and openshift/cluster-etcd-operator#1625

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Jun 15, 2026
@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 15, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 15, 2026

Copy link
Copy Markdown

@vimauro: This pull request references OCPEDGE-1829 which is a valid jira issue.

Details

In response to this:

Summary by CodeRabbit

  • Tests
  • Added an end-to-end test suite that validates the full fencing taint lifecycle in a dual-node scenario, including pacemaker fencing alerts, journal verification, disruptive network behavior, and post-recovery taint/annotation cleanup.
  • Bug Fixes
  • Improved taint/untaint lifecycle validation coverage by adding more robust journal and service completion checks across both nodes.
  • Chores
  • Added helper utilities to detect out-of-service taints/annotations and to query node journal/service logs for expected markers.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vimauro

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@vimauro: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ovn-two-node-fencing b7582d4 link false /test e2e-metal-ovn-two-node-fencing
ci/prow/e2e-metal-ovn-two-node-fencing-recovery b7582d4 link false /test e2e-metal-ovn-two-node-fencing-recovery
ci/prow/e2e-aws-ovn-fips b7582d4 link true /test e2e-aws-ovn-fips

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants