e2e(SchedulerSharding): add setup and base e2e tests for Volcano Scheduler Sharding#4992
e2e(SchedulerSharding): add setup and base e2e tests for Volcano Scheduler Sharding#4992agrawalcodes wants to merge 3 commits intovolcano-sh:masterfrom
Conversation
Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>
Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>
…duler Sharding Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @agrawalcodes, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes the foundational end-to-end testing framework for the Volcano Scheduler's sharding capabilities. It integrates new test suites into the build system, configures the Volcano Helm chart to enable the sharding controller, and provides initial test cases to validate the core functionality of NodeShard management and basic job scheduling in a sharded cluster. This work is a crucial step in ensuring the reliability and correctness of the Volcano scheduler's distributed scheduling features. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request does a great job of adding the E2E test infrastructure and initial test cases for Volcano Scheduler Sharding. The addition of helper functions for NodeShard in test/e2e/util/shard.go is a good step towards reusable test code.
My review has identified a few areas for improvement:
- There is significant code duplication in
hack/run-e2e-kind.shwhich should be refactored to improve maintainability. - The helper function
waitForNodeShardsCreatedis duplicated across two new test suites and should be moved to a shared utility file. - There is extensive use of
context.TODO()in the new test files. While this is for tests, it's a good practice to use a more specific context likecontext.Background()or pass a context down.
Overall, this is a solid contribution that lays a good foundation for more comprehensive sharding tests. Addressing the feedback will enhance the quality and maintainability of the new test code.
| "SHARDINGCONTROLLER"|"SCHEDULERSHARDING") | ||
| echo "Install volcano chart with crd version $crd_version and sharding controller enabled" | ||
| cat <<EOF | helm install ${CLUSTER_NAME} installer/helm/chart/volcano \ | ||
| --namespace ${NAMESPACE} \ | ||
| --kubeconfig ${KUBECONFIG} \ | ||
| --values - \ | ||
| --wait | ||
| basic: | ||
| image_pull_policy: IfNotPresent | ||
| image_tag_version: ${TAG} | ||
| scheduler_config_file: config/volcano-scheduler-ci.conf | ||
| crd_version: ${crd_version} | ||
|
|
||
| custom: | ||
| scheduler_log_level: 5 | ||
| controller_log_level: 5 | ||
| controller_controllers: "*" | ||
| admission_tolerations: | ||
| - key: "node-role.kubernetes.io/control-plane" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| - key: "node-role.kubernetes.io/master" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| controller_tolerations: | ||
| - key: "node-role.kubernetes.io/control-plane" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| - key: "node-role.kubernetes.io/master" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| scheduler_tolerations: | ||
| - key: "node-role.kubernetes.io/control-plane" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| - key: "node-role.kubernetes.io/master" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| default_ns: | ||
| node-role.kubernetes.io/control-plane: "" | ||
| scheduler_feature_gates: ${FEATURE_GATES} | ||
| enabled_admissions: "/pods/mutate,/queues/mutate,/podgroups/mutate,/jobs/mutate,/jobs/validate,/jobflows/validate,/pods/validate,/queues/validate,/podgroups/validate,/hypernodes/validate,/cronjobs/validate" | ||
| ignored_provisioners: ${IGNORED_PROVISIONERS:-""} | ||
| EOF | ||
| ;; | ||
| "AGENTSCHEDULER") | ||
| echo "Install volcano chart with crd version $crd_version, sharding controller and agent scheduler enabled" | ||
| cat <<EOF | helm install ${CLUSTER_NAME} installer/helm/chart/volcano \ | ||
| --namespace ${NAMESPACE} \ | ||
| --kubeconfig ${KUBECONFIG} \ | ||
| --values - \ | ||
| --wait | ||
| basic: | ||
| image_pull_policy: IfNotPresent | ||
| image_tag_version: ${TAG} | ||
| scheduler_config_file: config/volcano-scheduler-ci.conf | ||
| crd_version: ${crd_version} | ||
|
|
||
| custom: | ||
| scheduler_log_level: 5 | ||
| controller_log_level: 5 | ||
| controller_controllers: "*" | ||
| agent_scheduler_enable: true | ||
| admission_tolerations: | ||
| - key: "node-role.kubernetes.io/control-plane" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| - key: "node-role.kubernetes.io/master" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| controller_tolerations: | ||
| - key: "node-role.kubernetes.io/control-plane" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| - key: "node-role.kubernetes.io/master" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| scheduler_tolerations: | ||
| - key: "node-role.kubernetes.io/control-plane" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| - key: "node-role.kubernetes.io/master" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| agent_scheduler_tolerations: | ||
| - key: "node-role.kubernetes.io/control-plane" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| - key: "node-role.kubernetes.io/master" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" | ||
| default_ns: | ||
| node-role.kubernetes.io/control-plane: "" | ||
| scheduler_feature_gates: ${FEATURE_GATES} | ||
| enabled_admissions: "/pods/mutate,/queues/mutate,/podgroups/mutate,/jobs/mutate,/jobs/validate,/jobflows/validate,/pods/validate,/queues/validate,/podgroups/validate,/hypernodes/validate,/cronjobs/validate" | ||
| ignored_provisioners: ${IGNORED_PROVISIONERS:-""} | ||
| EOF | ||
| ;; |
There was a problem hiding this comment.
There's a lot of duplicated code for the helm install command across different E2E_TYPE cases. This makes the script difficult to read and maintain. Consider refactoring this by defining a common set of helm values and overriding specific ones within each case, then having a single helm install call at the end. Alternatively, a bash function could be used to encapsulate the helm install logic.
| waitForNodeShardsCreated := func() { | ||
| By("Waiting for ShardingController to create NodeShards") | ||
| err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true, | ||
| func(c context.Context) (bool, error) { | ||
| shards, err := e2eutil.ListNodeShards(ctx) | ||
| if err != nil { | ||
| GinkgoWriter.Printf("Error listing NodeShards: %v\n", err) | ||
| return false, nil | ||
| } | ||
| // Check if both expected shards exist | ||
| foundVolcano := false | ||
| foundAgent := false | ||
| for _, shard := range shards.Items { | ||
| if shard.Name == VolcanoShardName { | ||
| foundVolcano = true | ||
| } | ||
| if shard.Name == AgentSchedulerShardName { | ||
| foundAgent = true | ||
| } | ||
| } | ||
| if foundVolcano && foundAgent { | ||
| GinkgoWriter.Printf("Found NodeShards: volcano=%v, agent-scheduler=%v\n", foundVolcano, foundAgent) | ||
| return true, nil | ||
| } | ||
| GinkgoWriter.Printf("Waiting for NodeShards... volcano=%v, agent-scheduler=%v, total=%d\n", | ||
| foundVolcano, foundAgent, len(shards.Items)) | ||
| return false, nil | ||
| }) | ||
| Expect(err).NotTo(HaveOccurred(), "NodeShards should be created by ShardingController") | ||
| } |
There was a problem hiding this comment.
| // waitForNodeShardsCreated waits for NodeShards to be created by the ShardingController | ||
| waitForNodeShardsCreated := func() { | ||
| By("Waiting for ShardingController to create NodeShards") | ||
| err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true, |
There was a problem hiding this comment.
Using context.TODO() is generally discouraged as it's a placeholder for an unknown or undecided context. It's better to use context.Background() for top-level contexts. This applies to other places in this file where context.TODO() is used (e.g., line 163).
| err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true, | |
| err := wait.PollUntilContextTimeout(context.Background(), pollInterval, shardCreationTimeout, true, |
| // waitForNodeShardsCreated waits for NodeShards to be created by the ShardingController | ||
| waitForNodeShardsCreated := func() { | ||
| By("Waiting for ShardingController to create NodeShards") | ||
| err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true, | ||
| func(c context.Context) (bool, error) { | ||
| shards, err := e2eutil.ListNodeShards(ctx) | ||
| if err != nil { | ||
| GinkgoWriter.Printf("Error listing NodeShards: %v\n", err) | ||
| return false, nil | ||
| } | ||
| // Check if both expected shards exist | ||
| foundVolcano := false | ||
| foundAgent := false | ||
| for _, shard := range shards.Items { | ||
| if shard.Name == VolcanoShardName { | ||
| foundVolcano = true | ||
| } | ||
| if shard.Name == AgentSchedulerShard { | ||
| foundAgent = true | ||
| } | ||
| } | ||
| if foundVolcano && foundAgent { | ||
| GinkgoWriter.Printf("Found NodeShards: volcano=%v, agent-scheduler=%v\n", foundVolcano, foundAgent) | ||
| return true, nil | ||
| } | ||
| GinkgoWriter.Printf("Waiting for NodeShards... volcano=%v, agent-scheduler=%v, total=%d\n", | ||
| foundVolcano, foundAgent, len(shards.Items)) | ||
| return false, nil | ||
| }) | ||
| Expect(err).NotTo(HaveOccurred(), "NodeShards should be created by ShardingController") | ||
| } |
There was a problem hiding this comment.
| // waitForNodeShardsCreated waits for NodeShards to be created by the ShardingController | ||
| waitForNodeShardsCreated := func() { | ||
| By("Waiting for ShardingController to create NodeShards") | ||
| err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true, |
There was a problem hiding this comment.
Using context.TODO() is generally discouraged. Please consider using context.Background() instead for better practice, even in test code. This applies to other places in this file where context.TODO() is used (e.g., line 127).
| err := wait.PollUntilContextTimeout(context.TODO(), pollInterval, shardCreationTimeout, true, | |
| err := wait.PollUntilContextTimeout(context.Background(), pollInterval, shardCreationTimeout, true, |
| nodeShard, err := ctx.Vcclient.ShardV1alpha1().NodeShards().Create( | ||
| context.TODO(), | ||
| nodeShard, | ||
| metav1.CreateOptions{}, | ||
| ) |
There was a problem hiding this comment.
This function uses context.TODO(). It's a best practice to avoid context.TODO() in utility functions. Please consider replacing it with context.Background(). Ideally, these utility functions would accept a context.Context as an argument to give callers control over cancellation and deadlines. This comment applies to all other functions in this file that use context.TODO().
| nodeShard, err := ctx.Vcclient.ShardV1alpha1().NodeShards().Create( | |
| context.TODO(), | |
| nodeShard, | |
| metav1.CreateOptions{}, | |
| ) | |
| nodeShard, err := ctx.Vcclient.ShardV1alpha1().NodeShards().Create( | |
| context.Background(), | |
| nodeShard, | |
| metav1.CreateOptions{}, | |
| ) |
|
@qi-min @JesseStutler, This one of e2e base suite for Sharding Scheduler extends the same #4990. |
|
It's a great contribution, thanks, will take a look after |
Cool @JesseStutler. I'll make changes to the base and approach as per your review. This should give us a good foundational base for e2e setup of Agent Scheduling. The gemini review seems to be very generic, I'll resolve them after your review. The CI ran successfully in GitHub Actions! |
Summary
Add E2E test infrastructure and base tests for Volcano Scheduler with sharding support.
Base setup from parent PR: #4990
This is part of the LFX mentorship project. Additional test cases covering allocate, preempt, reclaim, backfill actions under different sharding modes will be added as part of the mentorship.
Changes
Test Cases
How to Run
make e2e-test-schedulershardingTest Plan
make e2e-test-schedulershardingrpasses locallyRelated Issues
Ref: Part of #4883
Part of LFX mentorship project: E2E Test Suite for Volcano Agent Scheduling