Skip to content

fix(e2e): Fixed sporadic end-to-end (e2e) failures caused by slow pod…#170

Merged
cheyang merged 10 commits intosgl-project:mainfrom
Syspretor:fix/sporadic-e2e-failure
Mar 5, 2026
Merged

fix(e2e): Fixed sporadic end-to-end (e2e) failures caused by slow pod…#170
cheyang merged 10 commits intosgl-project:mainfrom
Syspretor:fix/sporadic-e2e-failure

Conversation

@Syspretor
Copy link
Collaborator

@Syspretor Syspretor commented Mar 3, 2026

… processing.

Ⅰ. Motivation

Ⅱ. Modifications

Summary

Fix sporadic e2e test failures and enhance diagnostic capabilities.

Problems Fixed

1. E2E case ConfigMap Not Found

Root Cause: The ensureDiscoveryConfigMode was called after constructAndUpdateRoleStatuses, which populates rbg.Status.RoleStatuses. This caused shouldUseLegacyDiscoveryConfig to incorrectly detect new RBGs as legacy ones, resulting in ConfigMap not being created before workload creation.

Fix: Reorder reconcile steps - move discovery config mode initialization and ConfigMap creation before status construction.

2. Deployment Status Overwritten

Root Cause: When pod_controller calls setRestartCondition, it patches the entire status including stale RoleStatuses from memory, overwriting the latest values updated by rbg_controller (both use the same FieldManager="rbg").

Fix: Create toRBGApplyConfigurationForConditionsOnly to patch only conditions without touching RoleStatuses.

3. Image Pull Timeout in CI

Fix: Pre-load e2e test images into Kind cluster before running tests.

Diagnostic Enhancements

  • Add comprehensive debug info output for all test cases (RBG status, workload status, pod conditions)
  • Only dump debug info on test failure
  • Query and display Pod events for not-ready pods to help diagnose root causes
  • Increase test timeout from 90s to 150s

Ⅲ. Does this pull request fix one issue?

fixes #XXXX

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

@Syspretor Syspretor requested a review from cheyang March 3, 2026 02:52
@gemini-code-assist
Copy link

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Syspretor Syspretor force-pushed the fix/sporadic-e2e-failure branch from 4358553 to 6b28f15 Compare March 3, 2026 03:28
@Syspretor Syspretor force-pushed the fix/sporadic-e2e-failure branch from 6b28f15 to 475af17 Compare March 3, 2026 03:43
Copy link
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@Syspretor Syspretor force-pushed the fix/sporadic-e2e-failure branch from 5d4d933 to ad3ca38 Compare March 3, 2026 06:36
@Syspretor Syspretor force-pushed the fix/sporadic-e2e-failure branch from ad3ca38 to c3ddbd0 Compare March 3, 2026 07:06
@Syspretor Syspretor force-pushed the fix/sporadic-e2e-failure branch from 67fe91b to d5e101a Compare March 5, 2026 08:59
Copy link
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@cheyang cheyang merged commit 6a0350b into sgl-project:main Mar 5, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants