Cleanup unnecessary error return value#5043
Cleanup unnecessary error return value#5043hzxuzhonghu wants to merge 8 commits intovolcano-sh:masterfrom
Conversation
Signed-off-by: Zhonghu Xu <[email protected]>
Signed-off-by: Zhonghu Xu <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @hzxuzhonghu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors error handling within the scheduler's core components, specifically in how tasks and pods are managed within the cache and API. By removing error return values from several key functions, the change aims to simplify the codebase, reduce the need for intricate rollback mechanisms, and streamline the overall control flow. Potential issues that previously resulted in errors are now handled by logging warnings, making the operations more robust and less prone to complex error propagation, which directly addresses the complexity of rolling back statement operations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Can review by commit |
There was a problem hiding this comment.
Code Review
This pull request aims to simplify the codebase by removing unnecessary error return values from various functions, particularly those related to cache operations. While this is a good goal for reducing complexity, the current implementation introduces critical issues in pkg/scheduler/api/node_info.go. Specifically, AddTask and UpdateTask can now fail silently, leading to an inconsistent cache state. For example, a task might be queued for binding even if it couldn't be added to a node's resource accounting, or an update operation could leave a task removed but not re-added. My review includes detailed comments on these critical issues with suggestions on how to address them by ensuring that callers are aware of operation failures.
pkg/scheduler/api/node_info.go
Outdated
There was a problem hiding this comment.
This function was changed to no longer return an error, but it has early return statements that can cause silent failures. For example:
- If the task already exists on the node (lines 436-439).
- If the node has insufficient resources for a
Bindingtask (lines 457-459).
In these cases, AddTask returns without adding the task, but callers like AddBindTask in pkg/scheduler/cache/cache.go and pkg/agentscheduler/cache/cache.go are unaware of the failure. They proceed as if the task was successfully added, leading to an inconsistent cache state and incorrect scheduling behavior (e.g., a task is sent for binding but not accounted for on the node).
To fix this, AddTask should signal success or failure to its callers, for example by returning a boolean. Callers must then handle the failure case, for instance by rolling back state changes or not proceeding with further actions like binding.
pkg/scheduler/api/node_info.go
Outdated
There was a problem hiding this comment.
This simplification of UpdateTask is problematic because AddTask can now fail silently (as noted in another comment). If AddTask fails after RemoveTask has succeeded, the task will be removed but not added back, leaving NodeInfo in an inconsistent state. This can break callers like Evict in pkg/scheduler/cache/cache.go, which rely on UpdateTask to correctly transition a task's state on the node.
UpdateTask should be atomic. If AddTask can fail, UpdateTask needs to handle that failure, perhaps by checking if the addition is possible before removing the old task, or by rolling back the removal if the addition fails.
There was a problem hiding this comment.
Pull request overview
This pull request removes error return values from several core scheduler operations, including AddTask, RemoveTask, UpdateTask, UpdateTaskStatus, and DeleteTaskInfo. The stated goal is to simplify rollback complexity in statement operations following PR #5010, which fixed preemption eviction rollback issues.
Changes:
- Modified core API methods (NodeInfo, JobInfo) to use warning logs instead of returning errors
- Updated Statement and Session operations to call these methods without error handling
- Removed the
validateStatusUpdatefunction - Updated cache operations to ignore errors from underlying API calls
- Modified tests to expect success instead of errors in failure scenarios
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/scheduler/api/node_info.go | Changed AddTask, RemoveTask, UpdateTask from error-returning to void functions with warning logs |
| pkg/scheduler/api/job_info.go | Changed UpdateTaskStatus and DeleteTaskInfo to void functions |
| pkg/scheduler/api/types.go | Removed validateStatusUpdate function |
| pkg/scheduler/framework/statement.go | Updated Evict, Pipeline, Allocate to call updated API methods without error handling |
| pkg/scheduler/framework/session.go | Updated Pipeline, Allocate, Evict to call updated API methods without error handling |
| pkg/scheduler/cache/cache.go | Removed error handling and rollback logic from Evict and AddBindTask |
| pkg/scheduler/cache/event_handlers.go | Removed error handling from addTask and deleteTask |
| pkg/scheduler/cache/event_handlers_test.go | Changed test expectations from errors to success |
| pkg/scheduler/cache/cache_test.go | Changed test to expect bind success even with insufficient resources |
| pkg/agentscheduler/cache/cache.go | Removed error handling from AddBindTask |
| pkg/agentscheduler/cache/event_handlers.go | Changed deleteTask and deletePod to void functions |
| pkg/scheduler/actions/reclaim/reclaim.go | Removed error handling from Evict call |
| pkg/scheduler/actions/preempt/preempt.go | Removed error handling from Evict and prepareCandidate |
| pkg/scheduler/plugins/util/k8s/snapshot_test.go | Removed error handling from AddTask in tests |
| pkg/scheduler/api/node_info_test.go | Removed error checking from AddTask test |
Comments suppressed due to low confidence (6)
pkg/scheduler/actions/preempt/preempt.go:490
- Critical issue: When stmt.Pipeline fails (line 481), the code attempts to rollback with UnPipeline (line 484), but it does NOT return an error or discard the statement. According to PR #5010, which this PR is related to, when Pipeline fails after victims have been evicted (via prepareCandidate on line 479), the entire statement should be discarded to prevent victim evictions from being committed.
Currently, the function returns (true, nil) on line 490 regardless of whether Pipeline succeeded or failed. This means:
- Victims are evicted via prepareCandidate (line 479)
- If Pipeline fails, UnPipeline attempts to rollback the preemptor
- But the statement containing the victim evictions is NOT discarded
- When the statement is later committed, victims are evicted even though preemption failed
This breaks the atomicity assumption described in PR #5010 and can cause victim tasks to be evicted without the preemptor being scheduled.
prepareCandidate(bestCandidate, preemptor.Pod, stmt)
if err := stmt.Pipeline(preemptor, bestCandidate.Name(), true); err != nil {
klog.Errorf("Failed to pipeline Task <%s/%s> on Node <%s>",
preemptor.Namespace, preemptor.Name, bestCandidate.Name())
if rollbackErr := stmt.UnPipeline(preemptor); rollbackErr != nil {
klog.Errorf("Failed to unpipeline Task %v on %v in Session %v for %v.",
preemptor.UID, bestCandidate.Name(), ssn.UID, rollbackErr)
}
}
return true, nil
pkg/scheduler/cache/cache_test.go:212
- This test's expectations have been changed but may not accurately reflect the actual system behavior after the refactoring. The test expects:
- Bind to succeed even with oversubscription (line 198)
- Task status to be Binding (line 205-206)
- Node to remain unchanged (line 210-212)
However, when AddTask is called on a node with insufficient resources for a Binding task, it logs a warning and returns early without adding the task (see node_info.go lines 457-459). This creates an inconsistent state where the task status is Binding in the job, but the task is not actually added to the node. This inconsistency is not checked by the test.
The test should verify that either:
a) The bind operation properly detects and handles the inconsistency (e.g., by calling resyncTask), or
b) The bind is rejected entirely to maintain consistency
The current test expectations may mask a real bug where partial state updates can occur.
if err != nil {
t.Errorf("expected bind to succeed even when node is oversubscribed: %v", err)
}
_, taskAfterBind, err := cache.findJobAndTask(task)
if err != nil {
t.Errorf("expected to find task after failed bind")
}
if taskAfterBind.Status != api.Binding {
t.Errorf("expected task status to be Binding after bind attempt, got %v", taskAfterBind.Status)
}
nodeAfterBind := cache.Nodes["n1"].Clone()
if !reflect.DeepEqual(nodeBeforeBind, nodeAfterBind) {
t.Errorf("expected node to remain the same after failed bind")
}
pkg/scheduler/framework/statement.go:164
- When Pipeline is called and the job is found, UpdateTaskStatus is called unconditionally (line 144), but if the node is not found (line 155-164), the task status has already been updated to Pipelined even though the node operations cannot be performed.
Similarly, when the node is found, node.AddTask is called (line 156) which may silently fail and return early without actually adding the task (e.g., if the task already exists on the node). In this case:
- The job thinks the task is Pipelined
- The task.NodeName is set to hostname (line 152)
- But the task may not actually be in the node's task list if AddTask failed silently
This creates an inconsistent state where the session's view doesn't match reality. The error is collected in errInfos and returned, but the partial state changes (job status update) are not reverted.
job, found := s.ssn.Jobs[task.Job]
if found {
job.UpdateTaskStatus(task, api.Pipelined)
} else {
err := fmt.Errorf("Failed to find Job <%s> in Session <%s> index when pipeline.",
task.Job, s.ssn.UID)
klog.Errorf("%v", err)
errInfos = append(errInfos, err)
}
task.NodeName = hostname
task.EvictionOccurred = evictionOccurred
if node, found := s.ssn.Nodes[hostname]; found {
node.AddTask(task)
klog.V(3).Infof("After pipelined Task <%v/%v> to Node <%v>: idle <%v>, used <%v>, releasing <%v>",
task.Namespace, task.Name, node.Name, node.Idle, node.Used, node.Releasing)
} else {
err := fmt.Errorf("Failed to find Node <%s> in Session <%s> index when pipeline.",
hostname, s.ssn.UID)
klog.Errorf("%v", err)
errInfos = append(errInfos, err)
}
pkg/scheduler/framework/session.go:811
- Same issue as in Session.Pipeline: when the job is found, UpdateTaskStatus is called unconditionally (line 794) and both task.Pod.Spec.NodeName (line 789) and task.NodeName (line 801) are set. If the node is not found (lines 803-811), an error is returned AFTER these state changes. If node.AddTask silently fails (line 804), no error is returned.
This creates inconsistent state where the job and task think allocation succeeded, but the node doesn't actually have the task. The problem is compounded by the dispatch logic (lines 822-830) which may be called based on incorrect state assumptions.
job, found := ssn.Jobs[task.Job]
if found {
job.UpdateTaskStatus(task, api.Allocated)
} else {
klog.Errorf("Failed to find Job <%s> in Session <%s> index when binding.",
task.Job, ssn.UID)
return fmt.Errorf("failed to find job %s", task.Job)
}
task.NodeName = hostname
if node, found := ssn.Nodes[hostname]; found {
node.AddTask(task)
klog.V(3).Infof("After allocated Task <%v/%v> to Node <%v>: idle <%v>, used <%v>, releasing <%v>",
task.Namespace, task.Name, node.Name, node.Idle, node.Used, node.Releasing)
} else {
klog.Errorf("Failed to find Node <%s> in Session <%s> index when binding.",
hostname, ssn.UID)
return fmt.Errorf("failed to find node %s", hostname)
}
pkg/scheduler/framework/statement.go:258
- Same issue as in Pipeline: when Allocate is called and the job is found, UpdateTaskStatus is called unconditionally (line 240), but if the node is not found (lines 249-258) or node.AddTask silently fails (line 250), the task status has already been updated to Allocated even though the allocation didn't fully succeed.
Specifically:
- The job thinks the task is Allocated
- task.Pod.Spec.NodeName is set (line 235)
- task.NodeName is set (line 248)
- But if node.AddTask fails silently (e.g., task already exists or insufficient resources for Binding status), the node doesn't actually have the task
This creates an inconsistent state. While errors are collected and returned, the partial state changes are not reverted, violating atomicity.
job, found := s.ssn.Jobs[task.Job]
if found {
job.UpdateTaskStatus(task, api.Allocated)
} else {
err := fmt.Errorf("Failed to find Job <%s> in Session <%s> index when allocating.",
task.Job, s.ssn.UID)
klog.Errorf("%v", err)
errInfos = append(errInfos, err)
}
task.NodeName = hostname
if node, found := s.ssn.Nodes[hostname]; found {
node.AddTask(task)
klog.V(3).Infof("After allocated Task <%v/%v> to Node <%v>: idle <%v>, used <%v>, releasing <%v>",
task.Namespace, task.Name, node.Name, node.Idle, node.Used, node.Releasing)
} else {
err := fmt.Errorf("Failed to find Node <%s> in Session <%s> index when allocating.",
hostname, s.ssn.UID)
klog.Errorf("%v", err)
errInfos = append(errInfos, err)
}
pkg/scheduler/framework/session.go:773
- In Session.Pipeline, when the job is found, UpdateTaskStatus is called unconditionally (line 756) and task.NodeName is set (line 763), but if the node is not found (lines 765-773), an error is returned AFTER these state changes have already been made. Similarly, if node.AddTask silently fails (line 766), no error is returned at all.
This creates inconsistent state:
- If the node is not found: job status is Pipelined, task.NodeName is set, but an error is returned and the node doesn't have the task
- If node.AddTask fails silently: job status is Pipelined, task.NodeName is set, the function returns success (nil), but the node doesn't actually have the task
The inconsistency is more severe here than in Statement.Pipeline because there's no mechanism to rollback these changes when errors occur.
job, found := ssn.Jobs[task.Job]
if found {
job.UpdateTaskStatus(task, api.Pipelined)
} else {
klog.Errorf("Failed to find Job <%s> in Session <%s> index when pipeline.",
task.Job, ssn.UID)
return fmt.Errorf("failed to find job %s when pipeline", task.Job)
}
task.NodeName = hostname
if node, found := ssn.Nodes[hostname]; found {
node.AddTask(task)
klog.V(3).Infof("After pipelined Task <%v/%v> to Node <%v>: idle <%v>, used <%v>, releasing <%v>",
task.Namespace, task.Name, node.Name, node.Idle, node.Used, node.Releasing)
} else {
klog.Errorf("Failed to find Node <%s> in Session <%s> index when pipeline.",
hostname, ssn.UID)
return fmt.Errorf("failed to find node %s", hostname)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hey @hzxuzhonghu, thanks for being open to reverting that! I appreciate you taking another look. I'm seeing similar concerns in a few other places in this PR where removing error returns might cause issues. The main thing I'm worried about is how this affects the atomicity fix in #5010 ; if operations like Would you be open to discussing this here in the PR? Maybe we can find a way to simplify error handling without losing the ability to detect failures and maintain consistency? The specific concern is: when preemption tries to evict victims and then pipeline the preemptor, if pipeline fails we need to know it failed so we can discard those evictions. Without error returns, we have no way to detect the failure and make that decision. Happy to explain more about what I'm seeing if it helps! |
Signed-off-by: Zhonghu Xu <[email protected]>
Signed-off-by: Zhonghu Xu <[email protected]>
Signed-off-by: Zhonghu Xu <[email protected]>
| if err := ni.AddTask(ti); err != nil { | ||
| // This should never happen if task removal was successful, | ||
| // because only possible error during task addition is when task is still on a node. | ||
| klog.Fatalf("Failed to add Task <%s,%s> to Node <%s> during task update", |
There was a problem hiding this comment.
I think we need to revist this place, fatal should not be used here
There was a problem hiding this comment.
We should just return err here instead of klog.FatalF(.
There was a problem hiding this comment.
Yeah, but this will add a error return value, which need further doing undo error handling in many place.
As the comment say, this could not happen at all.
Signed-off-by: Zhonghu Xu <[email protected]>
Please take a closer look, all what i have done here is redudant error (never can return) |
Signed-off-by: Zhonghu Xu <[email protected]>
|
Removing these error returns breaks what #5010 is fixing. The fix depends on detecting pipeline failures: err := tempStmt.Pipeline(preemptor, node)
if err != nil {
tempStmt.Discard() // throw away evictions
} else {
stmt.Merge(tempStmt) // commit evictions
}Without the error signal, we can't tell if pipeline succeeded or failed. We'd be guessing whether to commit evictions. When we guess wrong, victims get evicted even though the preemptor never scheduled, that's the bug. I know you're trying to simplify things, but we need that error signal for the transactional pattern to work. |
hajnalmt
left a comment
There was a problem hiding this comment.
I want to add my comment, which I elaborated on here:
I feel that this change is really risky and instead of deleting the error handling we should move into the direction where we return more errors and handle errors everywhere properly.
| // UpdateTaskStatus is used to update task's status in a job. | ||
| // If error occurs both task and job are guaranteed to be in the original state. | ||
| func (ji *JobInfo) UpdateTaskStatus(task *TaskInfo, status TaskStatus) error { | ||
| if err := validateStatusUpdate(task.Status, status); err != nil { |
There was a problem hiding this comment.
validateStatusUpdate is currently empty, but this is a needed functionality I think, we just didn't implement it yet as nobody was willing to implement the state restrictions for this.
There was a problem hiding this comment.
not sure, i am always prefer cleaning up unused stuff to keep the code more readable
| if err := ni.AddTask(ti); err != nil { | ||
| // This should never happen if task removal was successful, | ||
| // because only possible error during task addition is when task is still on a node. | ||
| klog.Fatalf("Failed to add Task <%s,%s> to Node <%s> during task update", |
There was a problem hiding this comment.
We should just return err here instead of klog.FatalF(.
I donot actually change any behavior. And this does not conflict with the direction that we should fix. It just remove unused return value and keep the error handling simpler. |
@Aman-Cool Please donot comment without looking, node_info.go and jon_info.go are the main changes, which remove the never returned error. How could it influence commit or discard. |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
The cleanup is very simple only need to take a look at job_info.go and node_info.go
| if !found { | ||
| klog.Warningf("failed to find task <%v/%v> on host <%v>", | ||
| ti.Namespace, ti.Name, ni.Name) | ||
| return nil |
There was a problem hiding this comment.
I should comment for reviewers, this is not kind of error, just a warning, like that the task is not belonging to this node at all.
| // UpdateTaskStatus is used to update task's status in a job. | ||
| // If error occurs both task and job are guaranteed to be in the original state. | ||
| func (ji *JobInfo) UpdateTaskStatus(task *TaskInfo, status TaskStatus) error { | ||
| if err := validateStatusUpdate(task.Status, status); err != nil { |
| } | ||
|
|
||
| // DeleteTaskInfo is used to delete a task from a job | ||
| func (ji *JobInfo) DeleteTaskInfo(ti *TaskInfo) error { |
There was a problem hiding this comment.
There is never a error at all
Signed-off-by: Zhonghu Xu <[email protected]>
|
@hzxuzhonghu I did look at node_info.go and job_info.go. The concern isn't whether these specific functions return errors today, it's that removing error returns from operations like The Merge/Discard pattern needs error signals to know whether operations succeeded or failed. Without them, we can't maintain atomicity in the preemption flow. |
This can help reduce the complexity of rollingback in statement operations
triggered by #5010
cc @hajnalmt