Skip to content

e2e(SchedulerSharding): add setup and base e2e tests for Volcano Scheduler Sharding#4992

Open
agrawalcodes wants to merge 3 commits intovolcano-sh:masterfrom
agrawalcodes:e2e-schedulersharding
Open

e2e(SchedulerSharding): add setup and base e2e tests for Volcano Scheduler Sharding#4992
agrawalcodes wants to merge 3 commits intovolcano-sh:masterfrom
agrawalcodes:e2e-schedulersharding

Conversation

@agrawalcodes
Copy link

@agrawalcodes agrawalcodes commented Jan 25, 2026

Summary

Add E2E test infrastructure and base tests for Volcano Scheduler with sharding support.

Base setup from parent PR: #4990

This is part of the LFX mentorship project. Additional test cases covering allocate, preempt, reclaim, backfill actions under different sharding modes will be added as part of the mentorship.

Changes

  • test/e2e/schedulersharding/: Add Volcano Scheduler Sharding E2E test suite (4 test cases)
  • .github/workflows/e2e_scheduler_sharding.yaml: Add CI workflow

Test Cases

  • Volcano NodeShard should exist with nodes assigned
  • Both NodeShards should be accessible
  • Volcano jobs should be scheduled to cluster nodes
  • Shard state should remain consistent during job execution

How to Run

make e2e-test-schedulersharding

Test Plan

  • make e2e-test-schedulershardingr passes locally
  • CI workflow runs successfully

Related Issues

Ref: Part of #4883

Part of LFX mentorship project: E2E Test Suite for Volcano Agent Scheduling

Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>
Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>
…duler Sharding

Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign k82cn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link

Summary of Changes

Hello @agrawalcodes, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational end-to-end testing framework for the Volcano Scheduler's sharding capabilities. It integrates new test suites into the build system, configures the Volcano Helm chart to enable the sharding controller, and provides initial test cases to validate the core functionality of NodeShard management and basic job scheduling in a sharded cluster. This work is a crucial step in ensuring the reliability and correctness of the Volcano scheduler's distributed scheduling features.

Highlights

  • New E2E Test Suites: Introduced dedicated end-to-end test suites for both the Volcano Scheduler Sharding and the ShardingController, enabling comprehensive testing of sharding functionalities.
  • Build System Integration: Added new make targets (e2e-test-shardingcontroller and e2e-test-schedulersharding) and updated the E2E runner script (hack/run-e2e-kind.sh) to support the execution of these new test types.
  • Volcano Helm Chart Configuration: Enhanced the Volcano Helm chart to allow explicit control over enabled controllers via a new controller_controllers parameter, which is crucial for enabling the sharding controller during E2E tests.
  • Initial Sharding E2E Tests: Implemented base test cases to verify NodeShard creation, accessibility, basic Volcano job scheduling within sharded environments, and the consistency of shard state during job execution.
  • E2E Utility Functions: Added a new utility file (test/e2e/util/shard.go) containing helper functions for creating, managing, and asserting NodeShard resources within E2E tests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/e2e_scheduler_sharding.yaml
    • .github/workflows/e2e_sharding_controller.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 25, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request does a great job of adding the E2E test infrastructure and initial test cases for Volcano Scheduler Sharding. The addition of helper functions for NodeShard in test/e2e/util/shard.go is a good step towards reusable test code.

My review has identified a few areas for improvement:

  • There is significant code duplication in hack/run-e2e-kind.sh which should be refactored to improve maintainability.
  • The helper function waitForNodeShardsCreated is duplicated across two new test suites and should be moved to a shared utility file.
  • There is extensive use of context.TODO() in the new test files. While this is for tests, it's a good practice to use a more specific context like context.Background() or pass a context down.

Overall, this is a solid contribution that lays a good foundation for more comprehensive sharding tests. Addressing the feedback will enhance the quality and maintainability of the new test code.

Comment on lines +194 to +291
"SHARDINGCONTROLLER"|"SCHEDULERSHARDING")
echo "Install volcano chart with crd version $crd_version and sharding controller enabled"
cat <<EOF | helm install ${CLUSTER_NAME} installer/helm/chart/volcano \
--namespace ${NAMESPACE} \
--kubeconfig ${KUBECONFIG} \
--values - \
--wait
basic:
image_pull_policy: IfNotPresent
image_tag_version: ${TAG}
scheduler_config_file: config/volcano-scheduler-ci.conf
crd_version: ${crd_version}

custom:
scheduler_log_level: 5
controller_log_level: 5
controller_controllers: "*"
admission_tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
controller_tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
scheduler_tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
default_ns:
node-role.kubernetes.io/control-plane: ""
scheduler_feature_gates: ${FEATURE_GATES}
enabled_admissions: "/pods/mutate,/queues/mutate,/podgroups/mutate,/jobs/mutate,/jobs/validate,/jobflows/validate,/pods/validate,/queues/validate,/podgroups/validate,/hypernodes/validate,/cronjobs/validate"
ignored_provisioners: ${IGNORED_PROVISIONERS:-""}
EOF
;;
"AGENTSCHEDULER")
echo "Install volcano chart with crd version $crd_version, sharding controller and agent scheduler enabled"
cat <<EOF | helm install ${CLUSTER_NAME} installer/helm/chart/volcano \
--namespace ${NAMESPACE} \
--kubeconfig ${KUBECONFIG} \
--values - \
--wait
basic:
image_pull_policy: IfNotPresent
image_tag_version: ${TAG}
scheduler_config_file: config/volcano-scheduler-ci.conf
crd_version: ${crd_version}

custom:
scheduler_log_level: 5
controller_log_level: 5
controller_controllers: "*"
agent_scheduler_enable: true
admission_tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
controller_tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
scheduler_tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
agent_scheduler_tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
default_ns:
node-role.kubernetes.io/control-plane: ""
scheduler_feature_gates: ${FEATURE_GATES}
enabled_admissions: "/pods/mutate,/queues/mutate,/podgroups/mutate,/jobs/mutate,/jobs/validate,/jobflows/validate,/pods/validate,/queues/validate,/podgroups/validate,/hypernodes/validate,/cronjobs/validate"
ignored_provisioners: ${IGNORED_PROVISIONERS:-""}
EOF
;;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a lot of duplicated code for the helm install command across different E2E_TYPE cases. This makes the script difficult to read and maintain. Consider refactoring this by defining a common set of helm values and overriding specific ones within each case, then having a single helm install call at the end. Alternatively, a bash function could be used to encapsulate the helm install logic.

Comment on lines +68 to +97
waitForNodeShardsCreated := func() {
By("Waiting for ShardingController to create NodeShards")
err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true,
func(c context.Context) (bool, error) {
shards, err := e2eutil.ListNodeShards(ctx)
if err != nil {
GinkgoWriter.Printf("Error listing NodeShards: %v\n", err)
return false, nil
}
// Check if both expected shards exist
foundVolcano := false
foundAgent := false
for _, shard := range shards.Items {
if shard.Name == VolcanoShardName {
foundVolcano = true
}
if shard.Name == AgentSchedulerShardName {
foundAgent = true
}
}
if foundVolcano && foundAgent {
GinkgoWriter.Printf("Found NodeShards: volcano=%v, agent-scheduler=%v\n", foundVolcano, foundAgent)
return true, nil
}
GinkgoWriter.Printf("Waiting for NodeShards... volcano=%v, agent-scheduler=%v, total=%d\n",
foundVolcano, foundAgent, len(shards.Items))
return false, nil
})
Expect(err).NotTo(HaveOccurred(), "NodeShards should be created by ShardingController")
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This helper function waitForNodeShardsCreated is also defined in test/e2e/shardingcontroller/sharding_controller_test.go. To avoid code duplication and improve reusability, consider moving this function to a shared utility package, such as the new test/e2e/util/shard.go.

// waitForNodeShardsCreated waits for NodeShards to be created by the ShardingController
waitForNodeShardsCreated := func() {
By("Waiting for ShardingController to create NodeShards")
err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.TODO() is generally discouraged as it's a placeholder for an unknown or undecided context. It's better to use context.Background() for top-level contexts. This applies to other places in this file where context.TODO() is used (e.g., line 163).

Suggested change
err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true,
err := wait.PollUntilContextTimeout(context.Background(), pollInterval, shardCreationTimeout, true,

Comment on lines +68 to +98
// waitForNodeShardsCreated waits for NodeShards to be created by the ShardingController
waitForNodeShardsCreated := func() {
By("Waiting for ShardingController to create NodeShards")
err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true,
func(c context.Context) (bool, error) {
shards, err := e2eutil.ListNodeShards(ctx)
if err != nil {
GinkgoWriter.Printf("Error listing NodeShards: %v\n", err)
return false, nil
}
// Check if both expected shards exist
foundVolcano := false
foundAgent := false
for _, shard := range shards.Items {
if shard.Name == VolcanoShardName {
foundVolcano = true
}
if shard.Name == AgentSchedulerShard {
foundAgent = true
}
}
if foundVolcano && foundAgent {
GinkgoWriter.Printf("Found NodeShards: volcano=%v, agent-scheduler=%v\n", foundVolcano, foundAgent)
return true, nil
}
GinkgoWriter.Printf("Waiting for NodeShards... volcano=%v, agent-scheduler=%v, total=%d\n",
foundVolcano, foundAgent, len(shards.Items))
return false, nil
})
Expect(err).NotTo(HaveOccurred(), "NodeShards should be created by ShardingController")
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This helper function waitForNodeShardsCreated is also defined in test/e2e/schedulersharding/volcano_scheduler_sharding_test.go. To avoid code duplication and improve reusability, consider moving this function to a shared utility package, such as the new test/e2e/util/shard.go.

// waitForNodeShardsCreated waits for NodeShards to be created by the ShardingController
waitForNodeShardsCreated := func() {
By("Waiting for ShardingController to create NodeShards")
err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.TODO() is generally discouraged. Please consider using context.Background() instead for better practice, even in test code. This applies to other places in this file where context.TODO() is used (e.g., line 127).

Suggested change
err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true,
err := wait.PollUntilContextTimeout(context.Background(), pollInterval, shardCreationTimeout, true,

Comment on lines +49 to +53
nodeShard, err := ctx.Vcclient.ShardV1alpha1().NodeShards().Create(
context.TODO(),
nodeShard,
metav1.CreateOptions{},
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function uses context.TODO(). It's a best practice to avoid context.TODO() in utility functions. Please consider replacing it with context.Background(). Ideally, these utility functions would accept a context.Context as an argument to give callers control over cancellation and deadlines. This comment applies to all other functions in this file that use context.TODO().

Suggested change
nodeShard, err := ctx.Vcclient.ShardV1alpha1().NodeShards().Create(
context.TODO(),
nodeShard,
metav1.CreateOptions{},
)
nodeShard, err := ctx.Vcclient.ShardV1alpha1().NodeShards().Create(
context.Background(),
nodeShard,
metav1.CreateOptions{},
)

@agrawalcodes
Copy link
Author

agrawalcodes commented Jan 25, 2026

@qi-min @JesseStutler, This one of e2e base suite for Sharding Scheduler extends the same #4990.

@JesseStutler
Copy link
Member

It's a great contribution, thanks, will take a look after

@agrawalcodes
Copy link
Author

agrawalcodes commented Jan 26, 2026

It's a great contribution, thanks, will take a look after

Cool @JesseStutler. I'll make changes to the base and approach as per your review. This should give us a good foundational base for e2e setup of Agent Scheduling. The gemini review seems to be very generic, I'll resolve them after your review.

The CI ran successfully in GitHub Actions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants