Skip to content

Cleanup unnecessary error return value#5043

Open
hzxuzhonghu wants to merge 8 commits intovolcano-sh:masterfrom
hzxuzhonghu:cleanup
Open

Cleanup unnecessary error return value#5043
hzxuzhonghu wants to merge 8 commits intovolcano-sh:masterfrom
hzxuzhonghu:cleanup

Conversation

@hzxuzhonghu
Copy link
Member

This can help reduce the complexity of rollingback in statement operations

triggered by #5010

cc @hajnalmt

Copilot AI review requested due to automatic review settings February 13, 2026 04:01
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hzxuzhonghu. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 13, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @hzxuzhonghu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors error handling within the scheduler's core components, specifically in how tasks and pods are managed within the cache and API. By removing error return values from several key functions, the change aims to simplify the codebase, reduce the need for intricate rollback mechanisms, and streamline the overall control flow. Potential issues that previously resulted in errors are now handled by logging warnings, making the operations more robust and less prone to complex error propagation, which directly addresses the complexity of rolling back statement operations.

Highlights

  • Simplified Error Handling: Removed error return values from numerous functions across the scheduler cache and API, including AddTask, RemoveTask, UpdateTask, UpdateTaskStatus, DeleteTaskInfo, Evict, Pipeline, Allocate, dispatch, and UnPipeline.
  • Reduced Complexity: Eliminated complex error-reversion logic, such as reverting task statuses, by making these operations idempotent and logging warnings instead of returning errors.
  • Streamlined Control Flow: Refactored code paths to no longer check for and propagate errors from internal task/node manipulation functions, leading to cleaner and more direct logic.
  • Test Updates: Updated corresponding test cases to reflect the changes in function signatures and expected behavior, particularly for scenarios that previously expected errors but now expect successful execution or warning logs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/agentscheduler/cache/cache.go
    • Removed error handling and originalStatus variable from resyncTask and AddBindTask functions.
  • pkg/agentscheduler/cache/event_handlers.go
    • Removed error handling from addTask, deleteTask, deletePod, and DeletePodFromCache functions.
  • pkg/scheduler/actions/preempt/preempt.go
    • Removed error handling from normalPreempt, topologyAwarePreempt, and prepareCandidate functions.
    • Removed error handling from SelectVictimsOnNode for RemoveTask and AddTask calls.
  • pkg/scheduler/actions/reclaim/reclaim.go
    • Removed error handling from reclaimForTask function.
  • pkg/scheduler/api/job_info.go
    • Modified UpdateTaskStatus and DeleteTaskInfo functions to no longer return errors.
    • Removed validateStatusUpdate function.
  • pkg/scheduler/api/node_info.go
    • Modified AddTask, RemoveTask, and UpdateTask functions to no longer return errors.
    • Changed error returns to klog.Warningf and return in AddTask.
  • pkg/scheduler/api/node_info_test.go
    • Updated TestNodeInfo_AddPod to reflect AddTask no longer returning an error.
  • pkg/scheduler/api/types.go
    • Removed validateStatusUpdate function.
  • pkg/scheduler/cache/cache.go
    • Removed error handling and originalStatus variable from Evict and AddBindTask functions.
  • pkg/scheduler/cache/cache_test.go
    • Removed equality import.
    • Updated TestSchedulerCache_Bind_NodeWithInsufficientResources to expect success and check task status instead of error.
  • pkg/scheduler/cache/event_handlers.go
    • Removed error handling from addTask and deleteTask functions.
  • pkg/scheduler/cache/event_handlers_test.go
    • Removed fmt import.
    • Updated TestSchedulerCache_updateTask and TestSchedulerCache_UpdatePod to expect no errors.
  • pkg/scheduler/framework/session.go
    • Removed error handling from Pipeline, Allocate, dispatch, and Evict functions for job.UpdateTaskStatus and node.AddTask/UpdateTask.
  • pkg/scheduler/framework/statement.go
    • Modified Evict, UnPipeline, Allocate, unallocate, and RecoverOperations functions to no longer return errors from internal job.UpdateTaskStatus and node.UpdateTask/AddTask/RemoveTask calls.
  • pkg/scheduler/plugins/util/k8s/snapshot_test.go
    • Updated TestSnapshot_AddOrUpdateNodes and TestSnapshot_DeleteNode to reflect AddTask no longer returning an error.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@hzxuzhonghu
Copy link
Member Author

Can review by commit

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to simplify the codebase by removing unnecessary error return values from various functions, particularly those related to cache operations. While this is a good goal for reducing complexity, the current implementation introduces critical issues in pkg/scheduler/api/node_info.go. Specifically, AddTask and UpdateTask can now fail silently, leading to an inconsistent cache state. For example, a task might be queued for binding even if it couldn't be added to a node's resource accounting, or an update operation could leave a task removed but not re-added. My review includes detailed comments on these critical issues with suggestions on how to address them by ensuring that callers are aware of operation failures.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function was changed to no longer return an error, but it has early return statements that can cause silent failures. For example:

  1. If the task already exists on the node (lines 436-439).
  2. If the node has insufficient resources for a Binding task (lines 457-459).

In these cases, AddTask returns without adding the task, but callers like AddBindTask in pkg/scheduler/cache/cache.go and pkg/agentscheduler/cache/cache.go are unaware of the failure. They proceed as if the task was successfully added, leading to an inconsistent cache state and incorrect scheduling behavior (e.g., a task is sent for binding but not accounted for on the node).

To fix this, AddTask should signal success or failure to its callers, for example by returning a boolean. Callers must then handle the failure case, for instance by rolling back state changes or not proceeding with further actions like binding.

Comment on lines 589 to 591

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This simplification of UpdateTask is problematic because AddTask can now fail silently (as noted in another comment). If AddTask fails after RemoveTask has succeeded, the task will be removed but not added back, leaving NodeInfo in an inconsistent state. This can break callers like Evict in pkg/scheduler/cache/cache.go, which rely on UpdateTask to correctly transition a task's state on the node.

UpdateTask should be atomic. If AddTask can fail, UpdateTask needs to handle that failure, perhaps by checking if the addition is possible before removing the old task, or by rolling back the removal if the addition fails.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request removes error return values from several core scheduler operations, including AddTask, RemoveTask, UpdateTask, UpdateTaskStatus, and DeleteTaskInfo. The stated goal is to simplify rollback complexity in statement operations following PR #5010, which fixed preemption eviction rollback issues.

Changes:

  • Modified core API methods (NodeInfo, JobInfo) to use warning logs instead of returning errors
  • Updated Statement and Session operations to call these methods without error handling
  • Removed the validateStatusUpdate function
  • Updated cache operations to ignore errors from underlying API calls
  • Modified tests to expect success instead of errors in failure scenarios

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pkg/scheduler/api/node_info.go Changed AddTask, RemoveTask, UpdateTask from error-returning to void functions with warning logs
pkg/scheduler/api/job_info.go Changed UpdateTaskStatus and DeleteTaskInfo to void functions
pkg/scheduler/api/types.go Removed validateStatusUpdate function
pkg/scheduler/framework/statement.go Updated Evict, Pipeline, Allocate to call updated API methods without error handling
pkg/scheduler/framework/session.go Updated Pipeline, Allocate, Evict to call updated API methods without error handling
pkg/scheduler/cache/cache.go Removed error handling and rollback logic from Evict and AddBindTask
pkg/scheduler/cache/event_handlers.go Removed error handling from addTask and deleteTask
pkg/scheduler/cache/event_handlers_test.go Changed test expectations from errors to success
pkg/scheduler/cache/cache_test.go Changed test to expect bind success even with insufficient resources
pkg/agentscheduler/cache/cache.go Removed error handling from AddBindTask
pkg/agentscheduler/cache/event_handlers.go Changed deleteTask and deletePod to void functions
pkg/scheduler/actions/reclaim/reclaim.go Removed error handling from Evict call
pkg/scheduler/actions/preempt/preempt.go Removed error handling from Evict and prepareCandidate
pkg/scheduler/plugins/util/k8s/snapshot_test.go Removed error handling from AddTask in tests
pkg/scheduler/api/node_info_test.go Removed error checking from AddTask test
Comments suppressed due to low confidence (6)

pkg/scheduler/actions/preempt/preempt.go:490

  • Critical issue: When stmt.Pipeline fails (line 481), the code attempts to rollback with UnPipeline (line 484), but it does NOT return an error or discard the statement. According to PR #5010, which this PR is related to, when Pipeline fails after victims have been evicted (via prepareCandidate on line 479), the entire statement should be discarded to prevent victim evictions from being committed.

Currently, the function returns (true, nil) on line 490 regardless of whether Pipeline succeeded or failed. This means:

  1. Victims are evicted via prepareCandidate (line 479)
  2. If Pipeline fails, UnPipeline attempts to rollback the preemptor
  3. But the statement containing the victim evictions is NOT discarded
  4. When the statement is later committed, victims are evicted even though preemption failed

This breaks the atomicity assumption described in PR #5010 and can cause victim tasks to be evicted without the preemptor being scheduled.

	prepareCandidate(bestCandidate, preemptor.Pod, stmt)

	if err := stmt.Pipeline(preemptor, bestCandidate.Name(), true); err != nil {
		klog.Errorf("Failed to pipeline Task <%s/%s> on Node <%s>",
			preemptor.Namespace, preemptor.Name, bestCandidate.Name())
		if rollbackErr := stmt.UnPipeline(preemptor); rollbackErr != nil {
			klog.Errorf("Failed to unpipeline Task %v on %v in Session %v for %v.",
				preemptor.UID, bestCandidate.Name(), ssn.UID, rollbackErr)
		}
	}

	return true, nil

pkg/scheduler/cache/cache_test.go:212

  • This test's expectations have been changed but may not accurately reflect the actual system behavior after the refactoring. The test expects:
  1. Bind to succeed even with oversubscription (line 198)
  2. Task status to be Binding (line 205-206)
  3. Node to remain unchanged (line 210-212)

However, when AddTask is called on a node with insufficient resources for a Binding task, it logs a warning and returns early without adding the task (see node_info.go lines 457-459). This creates an inconsistent state where the task status is Binding in the job, but the task is not actually added to the node. This inconsistency is not checked by the test.

The test should verify that either:
a) The bind operation properly detects and handles the inconsistency (e.g., by calling resyncTask), or
b) The bind is rejected entirely to maintain consistency

The current test expectations may mask a real bug where partial state updates can occur.

	if err != nil {
		t.Errorf("expected bind to succeed even when node is oversubscribed: %v", err)
	}

	_, taskAfterBind, err := cache.findJobAndTask(task)
	if err != nil {
		t.Errorf("expected to find task after failed bind")
	}
	if taskAfterBind.Status != api.Binding {
		t.Errorf("expected task status to be Binding after bind attempt, got %v", taskAfterBind.Status)
	}

	nodeAfterBind := cache.Nodes["n1"].Clone()
	if !reflect.DeepEqual(nodeBeforeBind, nodeAfterBind) {
		t.Errorf("expected node to remain the same after failed bind")
	}

pkg/scheduler/framework/statement.go:164

  • When Pipeline is called and the job is found, UpdateTaskStatus is called unconditionally (line 144), but if the node is not found (line 155-164), the task status has already been updated to Pipelined even though the node operations cannot be performed.

Similarly, when the node is found, node.AddTask is called (line 156) which may silently fail and return early without actually adding the task (e.g., if the task already exists on the node). In this case:

  1. The job thinks the task is Pipelined
  2. The task.NodeName is set to hostname (line 152)
  3. But the task may not actually be in the node's task list if AddTask failed silently

This creates an inconsistent state where the session's view doesn't match reality. The error is collected in errInfos and returned, but the partial state changes (job status update) are not reverted.

	job, found := s.ssn.Jobs[task.Job]
	if found {
		job.UpdateTaskStatus(task, api.Pipelined)
	} else {
		err := fmt.Errorf("Failed to find Job <%s> in Session <%s> index when pipeline.",
			task.Job, s.ssn.UID)
		klog.Errorf("%v", err)
		errInfos = append(errInfos, err)
	}

	task.NodeName = hostname
	task.EvictionOccurred = evictionOccurred

	if node, found := s.ssn.Nodes[hostname]; found {
		node.AddTask(task)
		klog.V(3).Infof("After pipelined Task <%v/%v> to Node <%v>: idle <%v>, used <%v>, releasing <%v>",
			task.Namespace, task.Name, node.Name, node.Idle, node.Used, node.Releasing)
	} else {
		err := fmt.Errorf("Failed to find Node <%s> in Session <%s> index when pipeline.",
			hostname, s.ssn.UID)
		klog.Errorf("%v", err)
		errInfos = append(errInfos, err)
	}

pkg/scheduler/framework/session.go:811

  • Same issue as in Session.Pipeline: when the job is found, UpdateTaskStatus is called unconditionally (line 794) and both task.Pod.Spec.NodeName (line 789) and task.NodeName (line 801) are set. If the node is not found (lines 803-811), an error is returned AFTER these state changes. If node.AddTask silently fails (line 804), no error is returned.

This creates inconsistent state where the job and task think allocation succeeded, but the node doesn't actually have the task. The problem is compounded by the dispatch logic (lines 822-830) which may be called based on incorrect state assumptions.

	job, found := ssn.Jobs[task.Job]
	if found {
		job.UpdateTaskStatus(task, api.Allocated)
	} else {
		klog.Errorf("Failed to find Job <%s> in Session <%s> index when binding.",
			task.Job, ssn.UID)
		return fmt.Errorf("failed to find job %s", task.Job)
	}

	task.NodeName = hostname

	if node, found := ssn.Nodes[hostname]; found {
		node.AddTask(task)
		klog.V(3).Infof("After allocated Task <%v/%v> to Node <%v>: idle <%v>, used <%v>, releasing <%v>",
			task.Namespace, task.Name, node.Name, node.Idle, node.Used, node.Releasing)
	} else {
		klog.Errorf("Failed to find Node <%s> in Session <%s> index when binding.",
			hostname, ssn.UID)
		return fmt.Errorf("failed to find node %s", hostname)
	}

pkg/scheduler/framework/statement.go:258

  • Same issue as in Pipeline: when Allocate is called and the job is found, UpdateTaskStatus is called unconditionally (line 240), but if the node is not found (lines 249-258) or node.AddTask silently fails (line 250), the task status has already been updated to Allocated even though the allocation didn't fully succeed.

Specifically:

  1. The job thinks the task is Allocated
  2. task.Pod.Spec.NodeName is set (line 235)
  3. task.NodeName is set (line 248)
  4. But if node.AddTask fails silently (e.g., task already exists or insufficient resources for Binding status), the node doesn't actually have the task

This creates an inconsistent state. While errors are collected and returned, the partial state changes are not reverted, violating atomicity.

	job, found := s.ssn.Jobs[task.Job]
	if found {
		job.UpdateTaskStatus(task, api.Allocated)
	} else {
		err := fmt.Errorf("Failed to find Job <%s> in Session <%s> index when allocating.",
			task.Job, s.ssn.UID)
		klog.Errorf("%v", err)
		errInfos = append(errInfos, err)
	}

	task.NodeName = hostname
	if node, found := s.ssn.Nodes[hostname]; found {
		node.AddTask(task)
		klog.V(3).Infof("After allocated Task <%v/%v> to Node <%v>: idle <%v>, used <%v>, releasing <%v>",
			task.Namespace, task.Name, node.Name, node.Idle, node.Used, node.Releasing)
	} else {
		err := fmt.Errorf("Failed to find Node <%s> in Session <%s> index when allocating.",
			hostname, s.ssn.UID)
		klog.Errorf("%v", err)
		errInfos = append(errInfos, err)
	}

pkg/scheduler/framework/session.go:773

  • In Session.Pipeline, when the job is found, UpdateTaskStatus is called unconditionally (line 756) and task.NodeName is set (line 763), but if the node is not found (lines 765-773), an error is returned AFTER these state changes have already been made. Similarly, if node.AddTask silently fails (line 766), no error is returned at all.

This creates inconsistent state:

  1. If the node is not found: job status is Pipelined, task.NodeName is set, but an error is returned and the node doesn't have the task
  2. If node.AddTask fails silently: job status is Pipelined, task.NodeName is set, the function returns success (nil), but the node doesn't actually have the task

The inconsistency is more severe here than in Statement.Pipeline because there's no mechanism to rollback these changes when errors occur.

	job, found := ssn.Jobs[task.Job]
	if found {
		job.UpdateTaskStatus(task, api.Pipelined)
	} else {
		klog.Errorf("Failed to find Job <%s> in Session <%s> index when pipeline.",
			task.Job, ssn.UID)
		return fmt.Errorf("failed to find job %s when pipeline", task.Job)
	}

	task.NodeName = hostname

	if node, found := ssn.Nodes[hostname]; found {
		node.AddTask(task)
		klog.V(3).Infof("After pipelined Task <%v/%v> to Node <%v>: idle <%v>, used <%v>, releasing <%v>",
			task.Namespace, task.Name, node.Name, node.Idle, node.Used, node.Releasing)
	} else {
		klog.Errorf("Failed to find Node <%s> in Session <%s> index when pipeline.",
			hostname, ssn.UID)
		return fmt.Errorf("failed to find node %s", hostname)
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Aman-Cool
Copy link
Contributor

Hey @hzxuzhonghu, thanks for being open to reverting that! I appreciate you taking another look.

I'm seeing similar concerns in a few other places in this PR where removing error returns might cause issues. The main thing I'm worried about is how this affects the atomicity fix in #5010 ; if operations like Evict() and Pipeline() can't signal failures, we can't properly rollback when preemption attempts fail.

Would you be open to discussing this here in the PR? Maybe we can find a way to simplify error handling without losing the ability to detect failures and maintain consistency?

The specific concern is: when preemption tries to evict victims and then pipeline the preemptor, if pipeline fails we need to know it failed so we can discard those evictions. Without error returns, we have no way to detect the failure and make that decision.

Happy to explain more about what I'm seeing if it helps!

if err := ni.AddTask(ti); err != nil {
// This should never happen if task removal was successful,
// because only possible error during task addition is when task is still on a node.
klog.Fatalf("Failed to add Task <%s,%s> to Node <%s> during task update",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to revist this place, fatal should not be used here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just return err here instead of klog.FatalF(.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but this will add a error return value, which need further doing undo error handling in many place.

As the comment say, this could not happen at all.

Signed-off-by: Zhonghu Xu <[email protected]>
@hzxuzhonghu
Copy link
Member Author

I'm seeing similar concerns in a few other places in this PR where removing error returns might cause issues. The main thing I'm worried about is how this affects the atomicity fix in #5010 ; if operations like Evict() and Pipeline() can't signal failures, we can't properly rollback when preemption attempts fail.

Please take a closer look, all what i have done here is redudant error (never can return)

Signed-off-by: Zhonghu Xu <[email protected]>
@Aman-Cool
Copy link
Contributor

Removing these error returns breaks what #5010 is fixing.

The fix depends on detecting pipeline failures:

err := tempStmt.Pipeline(preemptor, node)
if err != nil {
    tempStmt.Discard()  // throw away evictions
} else {
    stmt.Merge(tempStmt)  // commit evictions
}

Without the error signal, we can't tell if pipeline succeeded or failed. We'd be guessing whether to commit evictions. When we guess wrong, victims get evicted even though the preemptor never scheduled, that's the bug.

I know you're trying to simplify things, but we need that error signal for the transactional pattern to work.

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add my comment, which I elaborated on here:

#5010 (review)

I feel that this change is really risky and instead of deleting the error handling we should move into the direction where we return more errors and handle errors everywhere properly.

// UpdateTaskStatus is used to update task's status in a job.
// If error occurs both task and job are guaranteed to be in the original state.
func (ji *JobInfo) UpdateTaskStatus(task *TaskInfo, status TaskStatus) error {
if err := validateStatusUpdate(task.Status, status); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateStatusUpdate is currently empty, but this is a needed functionality I think, we just didn't implement it yet as nobody was willing to implement the state restrictions for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, i am always prefer cleaning up unused stuff to keep the code more readable

if err := ni.AddTask(ti); err != nil {
// This should never happen if task removal was successful,
// because only possible error during task addition is when task is still on a node.
klog.Fatalf("Failed to add Task <%s,%s> to Node <%s> during task update",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just return err here instead of klog.FatalF(.

@hzxuzhonghu
Copy link
Member Author

I feel that this change is really risky and instead of deleting the error handling we should move into the direction where we return more errors and handle errors everywhere properly.

I donot actually change any behavior. And this does not conflict with the direction that we should fix.

It just remove unused return value and keep the error handling simpler.

@hzxuzhonghu
Copy link
Member Author

Without the error signal, we can't tell if pipeline succeeded or failed. We'd be guessing whether to commit evictions. When we guess wrong, victims get evicted even though the preemptor never scheduled, that's the bug.

@Aman-Cool Please donot comment without looking, node_info.go and jon_info.go are the main changes, which remove the never returned error. How could it influence commit or discard.

Copy link
Member Author

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup is very simple only need to take a look at job_info.go and node_info.go

if !found {
klog.Warningf("failed to find task <%v/%v> on host <%v>",
ti.Namespace, ti.Name, ni.Name)
return nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should comment for reviewers, this is not kind of error, just a warning, like that the task is not belonging to this node at all.

// UpdateTaskStatus is used to update task's status in a job.
// If error occurs both task and job are guaranteed to be in the original state.
func (ji *JobInfo) UpdateTaskStatus(task *TaskInfo, status TaskStatus) error {
if err := validateStatusUpdate(task.Status, status); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy stuff

}

// DeleteTaskInfo is used to delete a task from a job
func (ji *JobInfo) DeleteTaskInfo(ti *TaskInfo) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is never a error at all

Signed-off-by: Zhonghu Xu <[email protected]>
@Aman-Cool
Copy link
Contributor

@hzxuzhonghu I did look at node_info.go and job_info.go. The concern isn't whether these specific functions return errors today, it's that removing error returns from operations like Evict() and Pipeline() (which you're also changing) breaks the ability to make commit/discard decisions in #5010.

The Merge/Discard pattern needs error signals to know whether operations succeeded or failed. Without them, we can't maintain atomicity in the preemption flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments