Skip to content

[Offloading] Support Disk Offloading#2373

Open
kylesayrs wants to merge 4 commits intomainfrom
kylesayrs/support-disk-offloading
Open

[Offloading] Support Disk Offloading#2373
kylesayrs wants to merge 4 commits intomainfrom
kylesayrs/support-disk-offloading

Conversation

@kylesayrs
Copy link
Collaborator

Purpose

  • Support disk offloading for very large models

Prerequisites

Examples

  • Add examples/disk_offloading/qwen3_example.py
  • Add examples/disk_offloading/kimi_k2_example.py

Changes

Required

  • Remove post-processing step where remove_dispatch is called
    • Previously, this was used to avoid conflicts between dispatch_for_sequential and dispatch_for_generation.
    • Now, the two functions are directly compatible: you don't need to remove the dispatch of one to use the other
  • Add to_accelerate to save_pretrained_wrapper
    • This ensures that the model is converted to accelerate offloading before saving
    • This ensures the best compatibility with save_pretrained, and reduces excess memory usage which would cause gpu/cpu ooms

Hardening / Future Pre Changes

  • During oneshot preprocessing, convert from_accelerate if possible. This guards against users who load their model outside of the load_offloaded_model context
  • Remove offload_device arguemnt from dispatch_for_sequential to avoid deprecation warning
    • dispatch_for_sequential now always respects the device the model was loaded on

Testing

  • Ran Qwen/Qwen3-0.6B example to completion
  • [IN PROGRESS] Run unsloth/Kimi-K2-Instruct-0905-BF16 example to completion

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kylesayrs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for disk offloading, enabling the quantization of very large models that exceed available CPU or GPU memory. The changes streamline the offloading process by removing redundant dispatch handling, integrating accelerate offloading conversion during model saving, and ensuring proper offloading conversion during preprocessing. This significantly enhances the ability to work with memory-intensive models by leveraging disk storage efficiently.

Highlights

  • Removed remove_dispatch post-processing: The remove_dispatch call, previously used to prevent conflicts between dispatch_for_sequential and dispatch_for_generation, has been eliminated as these functions are now directly compatible.
  • Integrated to_accelerate into save_pretrained_wrapper: The save_pretrained_wrapper now converts the model to accelerate offloading before saving, improving compatibility with save_pretrained and reducing memory usage.
  • Added from_accelerate conversion during preprocessing: Models loaded with accelerate offloading are now converted to compressed-tensors offloading during oneshot preprocessing, ensuring proper handling even if not loaded via load_offloaded_model.
  • Simplified dispatch_for_sequential: The offload_device argument has been removed from dispatch_for_sequential to avoid deprecation warnings, as the function now implicitly respects the model's loaded device.
  • Introduced disk offloading examples: New examples (kimi_k2_example.py, qwen3_example.py) demonstrate how to use disk offloading for large model quantization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/disk_offloading/kimi_k2_example.py
    • Added a new example script demonstrating disk offloading for the unsloth/Kimi-K2-Instruct-0905-BF16 model using NVFP4 quantization.
  • examples/disk_offloading/qwen3_example.py
    • Added a new example script demonstrating disk offloading for the Qwen/Qwen3-0.6B model with NVFP4 quantization, including an emulation of limited CPU memory.
  • src/llmcompressor/entrypoints/utils.py
    • Removed the import and call to compressed_tensors.utils.remove_dispatch.
    • Imported compressed_tensors.offload.from_accelerate.
    • Added a call to from_accelerate in pre_process to convert accelerate offloaded models to compressed-tensors offloading.
  • src/llmcompressor/pipelines/sequential/helpers.py
    • Modified the dispatch_for_sequential function signature to make offload_device optional.
  • src/llmcompressor/pipelines/sequential/pipeline.py
    • Updated the call to dispatch_for_sequential to no longer pass the offload_device argument.
  • src/llmcompressor/transformers/compression/compressed_tensors_utils.py
    • Imported compressed_tensors.offload.to_accelerate.
    • Added a call to to_accelerate within save_pretrained_wrapper to convert the model to accelerate offloading before saving.
Activity
  • No human activity (comments, reviews, progress updates) was provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 17, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for disk offloading for large models, a valuable feature. The changes are well-structured and include updates to handle model state conversion for offloading, modifications to the saving process, and the addition of helpful example scripts. My review focuses on a few minor areas for improvement to enhance code clarity and remove redundancy. Specifically, I've pointed out a redundant argument in the new example files and an unused variable resulting from the refactoring. Addressing these points will make the code cleaner and more maintainable. Overall, this is a solid contribution.

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs force-pushed the kylesayrs/support-disk-offloading branch from 84a7f80 to 43bdec4 Compare February 17, 2026 17:54
@kylesayrs kylesayrs marked this pull request as ready for review February 17, 2026 17:59
@kylesayrs kylesayrs added the ready When a PR is ready for review label Feb 17, 2026
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs removed the documentation Improvements or additions to documentation label Feb 17, 2026
@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 17, 2026
@HDCharles HDCharles removed the documentation Improvements or additions to documentation label Feb 17, 2026
@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments