Skip to content

[Jobs] Implement graceful cancel for managed jobs#8772

Open
kyuds wants to merge 11 commits intomasterfrom
kyuds/job-graceful-cancel
Open

[Jobs] Implement graceful cancel for managed jobs#8772
kyuds wants to merge 11 commits intomasterfrom
kyuds/job-graceful-cancel

Conversation

@kyuds
Copy link
Collaborator

@kyuds kyuds commented Feb 2, 2026

Resolves #8519

--graceful flag for managed jobs. The flag for down and stop is already implemented. We reuse a lot of the actual "graceful" handling mechanism from that pr.

Tested:

  • regular cancel
  • graceful cancel
  • graceful cancel w timeout (timeout=1 helps to identify that flush didn't complete properly, as it was preempted)

Yaml for testing:

resources:
  cloud: aws
  cpus: 2+

file_mounts:
  /mnt/data:
    name: skypilot-graceful-test-kyuds
    mode: MOUNT_CACHED

run: |
  set -ex
  echo "Starting to write files..."

  # Write files continuously (simulates a long-running task)
  for i in $(seq 1 100); do
    dd if=/dev/urandom of=/mnt/data/file_$i.bin bs=1M count=10
    echo "Written file_$i.bin (10MB)"
    sleep 2
  done

  echo "Done writing files"

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kyuds, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the job cancellation mechanism by adding a 'graceful' option. This feature ensures that when a managed job is cancelled, any cached data is safely uploaded before termination, preventing potential data loss. The changes involve updating the command-line interface, the Python SDK, the job server's core logic, and the underlying communication protocols to support this new, more robust cancellation behavior.

Highlights

  • Graceful Job Cancellation: Introduced a new --graceful option for the sky jobs cancel command, allowing users to cancel managed jobs while ensuring that MOUNT_CACHED data is fully uploaded to preserve data integrity.
  • Graceful Timeout: Added a --graceful-timeout option to specify a maximum duration for the graceful cancellation process.
  • API and SDK Integration: Integrated the graceful and graceful_timeout parameters across the job cancellation SDK, server-side logic, and gRPC service, including checks for API server version compatibility.
  • Version Bumps: Updated MANAGED_JOBS_VERSION to 16 and API_VERSION to 34 to reflect the new feature and API changes.
  • Protobuf Schema Update: Modified the CancelJobsRequest protobuf definition to include the new graceful and graceful_timeout fields.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a good start for adding the --graceful cancel feature for managed jobs. The new options are correctly plumbed through the CLI, SDK, and server components. However, the core implementation seems to be missing, as indicated by a TODO in sky/jobs/utils.py where the new parameters are deleted. I've left a few comments on this and other minor issues, such as an incorrect version comment and a log level that might be too subtle for an important warning.

@kyuds
Copy link
Collaborator Author

kyuds commented Feb 2, 2026

/gemini review

@skypilot-org skypilot-org deleted a comment from gemini-code-assist bot Feb 2, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a --graceful cancellation feature for managed jobs. The implementation cleverly uses the msg parameter of asyncio.Task.cancel() to pass the graceful shutdown signal to the job's exception handler. The changes are well-integrated across the CLI, client SDK, server, and controller components. My review focuses on improving the robustness and clarity of this new feature. I've identified a couple of areas for improvement: an inconsistency in handling graceful cancellation for pool-based jobs and a broad exception handler that could obscure potential issues. With the suggested changes, this will be a solid addition.

@kyuds kyuds changed the title [wip][Jobs] Add managed jobs --graceful cancel feature [Jobs] Add managed jobs --graceful cancel feature Feb 2, 2026
@kyuds kyuds changed the title [Jobs] Add managed jobs --graceful cancel feature [Jobs] Implement graceful cancel for managed jobs Feb 2, 2026
@kyuds kyuds requested a review from kevinmingtarja February 2, 2026 23:09
@kyuds
Copy link
Collaborator Author

kyuds commented Feb 3, 2026

/smoke-test

@kyuds
Copy link
Collaborator Author

kyuds commented Feb 3, 2026

/quicktest-core

@kyuds
Copy link
Collaborator Author

kyuds commented Feb 3, 2026

same across multiple PRs: only azure bucket smoke tests are failing

@kyuds
Copy link
Collaborator Author

kyuds commented Feb 11, 2026

/smoke-test

@kyuds
Copy link
Collaborator Author

kyuds commented Feb 11, 2026

/quicktest-core

@kyuds kyuds requested a review from SeungjinYang February 11, 2026 18:42
Comment on lines +860 to +883
# Send the signal to the jobs controller.
signal_file = (pathlib.Path(
managed_job_constants.SIGNAL_FILE_PREFIX.format(job_id)))
# Filelock is needed to prevent race condition between signal
# check/removal and signal writing.
with filelock.FileLock(str(signal_file) + '.lock'):
with signal_file.open('w', encoding='utf-8') as f:
f.write(UserSignal.CANCEL.value)
f.flush()
if graceful:
logger.warning(f'Job {job_id} is on legacy controller, '
'graceful shutdown not supported.')
else:
# New controller process.
try:
signal_file = pathlib.Path(
managed_job_constants.CONSOLIDATED_SIGNAL_PATH, f'{job_id}')
signal_file.touch()
if graceful:
content = 'graceful'
if graceful_timeout is not None:
content = f'graceful:{graceful_timeout}'
signal_file.write_text(content, encoding='utf-8')
else:
signal_file.touch()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need a filelock now that we're writing something to signal_file, similar to the if block above?

graceful_timeout = int(
content.split(':')[1])
except (ValueError, IndexError):
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We hardcode 'graceful' here and in cancel_jobs_by_id(). Could we define it as a const, and maybe also extract this to smth like decode_cancel_signal() and move it to sky/jobs/utils.py?

Comment on lines +1978 to +1979
except Exception: # pylint: disable=broad-except
content = ''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do a logger.warning/debug here?

Comment on lines +1989 to +1990
except (ValueError, IndexError):
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

Comment on lines +1855 to +1856
logger.info(f'Job {job_id} graceful cancel: '
f'graceful={graceful}, timeout={graceful_timeout}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we turn this down to debug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Storage] Add --graceful flag to sky down/sky jobs cancel to wait for MOUNT_CACHED uploads

2 participants