Skip to content

fix(tests): create MinIO buckets from test to avoid flaky operator bu…#882

Merged
nammn merged 4 commits intomasterfrom
fix-minio-backup-test-bucket-flakiness
Mar 11, 2026
Merged

fix(tests): create MinIO buckets from test to avoid flaky operator bu…#882
nammn merged 4 commits intomasterfrom
fix-minio-backup-test-bucket-flakiness

Conversation

@nammn
Copy link
Collaborator

@nammn nammn commented Mar 10, 2026

Summary

The e2e_om_ops_manager_backup_restore_minio test continued to experience intermittent failures even after PR #852 fixed the MinIO operator's CA trust issue. The remaining flakiness was caused by a race condition between the MinIO operator's bucket creation attempts and MinIO's readiness to accept HTTPS connections.

Why the MinIO operator still fails to create buckets (even with CA trust fixed):

Even though PR #852 resolved the x509 certificate trust issue by providing the test CA to the MinIO operator, the operator still occasionally fails with "connection refused" errors when attempting to create buckets. This happens because:

  1. The MinIO operator's reconcile loop triggers immediately after the MinIO tenant pods become "Ready"
  2. Pod readiness does not guarantee that MinIO is fully ready to accept HTTPS connections
  3. The operator attempts to create buckets via HTTPS API calls before MinIO's TLS listener is fully initialized
  4. This results in intermittent "connection refused" errors, causing bucket creation to fail silently

The test then times out waiting for buckets that were never created.

The fix:

Instead of relying on the MinIO operator's timing-dependent bucket creation, we now create the buckets directly from the test code using boto3:

  1. After the MinIO tenant pods are ready, call _create_minio_buckets() which uses boto3's S3 client
  2. Configure boto3 to use the test CA certificate for TLS verification
  3. Implement retry logic (120s timeout, 5s intervals) to handle MinIO startup timing
  4. Check if buckets already exist (created by operator) or create them via boto3
  5. Log which method created each bucket for observability

This approach completely bypasses the operator's unreliable bucket creation and makes the test deterministic.

Proof of Work

5 independent test patches were run to verify the fix eliminates flakiness:

Patch # Patch ID Status Build URL
1 69b02324c32e6b00075f770b ✅ Success (4/4) https://evergreen.mongodb.com/version/69b02324c32e6b00075f770b
2 69b0232e8a5d860007735fc9 ✅ Success (4/4) https://evergreen.mongodb.com/version/69b0232e8a5d860007735fc9
3 69b02339c03a4d00075eac0c ✅ Success (4/4) https://evergreen.mongodb.com/version/69b02339c03a4d00075eac0c
4 69b02343f4bdb00007c3f6cd ✅ Success (4/4) https://evergreen.mongodb.com/version/69b02343f4bdb00007c3f6cd
5 69b0234cc1f1690007bb9f21 ✅ Success (4/4) https://evergreen.mongodb.com/version/69b0234cc1f1690007bb9f21

Result: 20/20 tests passed (100% success rate) all were created by boto3

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

…cket creation

The MinIO operator creates buckets over HTTPS. With custom TLS (tls-ssl-minio
signed by the test CA), the operator often fails with x509 or connection
refused, making test_install_minio flaky. Create buckets from the test via
boto3 using the test CA so the test no longer depends on operator timing.

Log whether each bucket was created via boto3 or already existed (MinIO operator).

Made-with: Cursor
@github-actions
Copy link

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.7.1 Release Notes

Other Changes

  • Container images: Merged the init-database and init-appdb init container images into a single init-database image. The init-appdb image will no longer be published and does not affect existing deployments.
  • Helm Chart: Removed operator.baseName Helm value. This value was never intended to be consumed by operator users and was never documented. The value controls the prefix for workload RBAC resource names (mongodb-kubernetes default), but changing it could break the operator and workloads because the operator is not aware of custom prefixes. With this change, the Helm chart will no longer allow customisation and the relevant resources will be deployed with predefined names (ServiceAccount with names mongodb-kubernetes-appdb, mongodb-kubernetes-database-pods, mongodb-kubernetes-ops-manager, Role with name mongodb-kubernetes-appdb and RoleBinding with name mongodb-kubernetes-appdb).

@nammn nammn added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Mar 10, 2026
@nammn nammn marked this pull request as ready for review March 10, 2026 15:58
@nammn nammn requested review from a team as code owners March 10, 2026 15:58
@nammn nammn requested review from Julien-Ben and lucian-tosa March 10, 2026 15:58
try:
s3.create_bucket(Bucket=bucket)
ready.add(bucket)
except ClientError as ce:
Copy link
Collaborator

@MaciejKaras MaciejKaras Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this check is unnecessary, we will find about existing bucket in s3.head_bucket(Bucket=bucket) call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nammn nammn merged commit b770203 into master Mar 11, 2026
31 checks passed
@nammn nammn deleted the fix-minio-backup-test-bucket-flakiness branch March 11, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Use this label in Pull Request to not require new changelog entry file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants