[Offloading] Support Disk Offloading by kylesayrs · Pull Request #2373 · vllm-project/llm-compressor

kylesayrs · 2026-02-17T00:27:13Z

Purpose

Support disk offloading for very large models

Prerequisites

[Offload] Convert accelerate for loading/saving

Examples

Add examples/disk_offloading/qwen3_example.py
Add examples/disk_offloading/kimi_k2_example.py

Changes

Required

Remove post-processing step where remove_dispatch is called
- Previously, this was used to avoid conflicts between dispatch_for_sequential and dispatch_for_generation.
- Now, the two functions are directly compatible: you don't need to remove the dispatch of one to use the other
Add to_accelerate to save_pretrained_wrapper
- This ensures that the model is converted to accelerate offloading before saving
- This ensures the best compatibility with save_pretrained, and reduces excess memory usage which would cause gpu/cpu ooms

Hardening / Future Pre Changes

During oneshot preprocessing, convert from_accelerate if possible. This guards against users who load their model outside of the load_offloaded_model context
Remove offload_device arguemnt from dispatch_for_sequential to avoid deprecation warning
- dispatch_for_sequential now always respects the device the model was loaded on

Testing

Ran Qwen/Qwen3-0.6B example to completion
[IN PROGRESS] Run unsloth/Kimi-K2-Instruct-0905-BF16 example to completion

github-actions · 2026-02-17T00:27:21Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-02-17T00:27:33Z

Summary of Changes

Hello @kylesayrs, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for disk offloading, enabling the quantization of very large models that exceed available CPU or GPU memory. The changes streamline the offloading process by removing redundant dispatch handling, integrating accelerate offloading conversion during model saving, and ensuring proper offloading conversion during preprocessing. This significantly enhances the ability to work with memory-intensive models by leveraging disk storage efficiently.

Highlights

Removed remove_dispatch post-processing: The remove_dispatch call, previously used to prevent conflicts between dispatch_for_sequential and dispatch_for_generation, has been eliminated as these functions are now directly compatible.
Integrated to_accelerate into save_pretrained_wrapper: The save_pretrained_wrapper now converts the model to accelerate offloading before saving, improving compatibility with save_pretrained and reducing memory usage.
Added from_accelerate conversion during preprocessing: Models loaded with accelerate offloading are now converted to compressed-tensors offloading during oneshot preprocessing, ensuring proper handling even if not loaded via load_offloaded_model.
Simplified dispatch_for_sequential: The offload_device argument has been removed from dispatch_for_sequential to avoid deprecation warnings, as the function now implicitly respects the model's loaded device.
Introduced disk offloading examples: New examples (kimi_k2_example.py, qwen3_example.py) demonstrate how to use disk offloading for large model quantization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/disk_offloading/kimi_k2_example.py
- Added a new example script demonstrating disk offloading for the unsloth/Kimi-K2-Instruct-0905-BF16 model using NVFP4 quantization.
examples/disk_offloading/qwen3_example.py
- Added a new example script demonstrating disk offloading for the Qwen/Qwen3-0.6B model with NVFP4 quantization, including an emulation of limited CPU memory.
src/llmcompressor/entrypoints/utils.py
- Removed the import and call to compressed_tensors.utils.remove_dispatch.
- Imported compressed_tensors.offload.from_accelerate.
- Added a call to from_accelerate in pre_process to convert accelerate offloaded models to compressed-tensors offloading.
src/llmcompressor/pipelines/sequential/helpers.py
- Modified the dispatch_for_sequential function signature to make offload_device optional.
src/llmcompressor/pipelines/sequential/pipeline.py
- Updated the call to dispatch_for_sequential to no longer pass the offload_device argument.
src/llmcompressor/transformers/compression/compressed_tensors_utils.py
- Imported compressed_tensors.offload.to_accelerate.
- Added a call to to_accelerate within save_pretrained_wrapper to convert the model to accelerate offloading before saving.

Activity

No human activity (comments, reviews, progress updates) was provided in the context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for disk offloading for large models, a valuable feature. The changes are well-structured and include updates to handle model state conversion for offloading, modifications to the saving process, and the addition of helpful example scripts. My review focuses on a few minor areas for improvement to enhance code clarity and remove redundancy. Specifically, I've pointed out a redundant argument in the new example files and an unused variable resulting from the refactoring. Addressing these points will make the code cleaner and more maintainable. Overall, this is a solid contribution.

examples/disk_offloading/kimi_k2_example.py

examples/disk_offloading/qwen3_example.py

src/llmcompressor/pipelines/sequential/pipeline.py

Signed-off-by: Kyle Sayers <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Feb 17, 2026

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

examples/disk_offloading/kimi_k2_example.py Show resolved Hide resolved

examples/disk_offloading/qwen3_example.py Show resolved Hide resolved

src/llmcompressor/pipelines/sequential/pipeline.py Show resolved Hide resolved

kylesayrs added 3 commits February 17, 2026 12:54

changes

2db91e2

Signed-off-by: Kyle Sayers <[email protected]>

add examples

87c79d8

Signed-off-by: Kyle Sayers <[email protected]>

clean up example

43bdec4

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/support-disk-offloading branch from 84a7f80 to 43bdec4 Compare February 17, 2026 17:54

kylesayrs marked this pull request as ready for review February 17, 2026 17:59

kylesayrs requested review from HDCharles, brian-dellabetta and dsikka as code owners February 17, 2026 17:59

kylesayrs added the ready When a PR is ready for review label Feb 17, 2026

explicit offload folders

5133d63

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs removed the documentation Improvements or additions to documentation label Feb 17, 2026

mergify bot added the documentation Improvements or additions to documentation label Feb 17, 2026

HDCharles approved these changes Feb 17, 2026

View reviewed changes

HDCharles removed the documentation Improvements or additions to documentation label Feb 17, 2026

mergify bot added the documentation Improvements or additions to documentation label Feb 17, 2026

brian-dellabetta approved these changes Feb 18, 2026

View reviewed changes

kylesayrs mentioned this pull request Feb 19, 2026

Please, tell. When we can do quantization Deepseek V3.2-BF16 to NVFP4? #2274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Offloading] Support Disk Offloading#2373

[Offloading] Support Disk Offloading#2373
kylesayrs wants to merge 4 commits intomainfrom
kylesayrs/support-disk-offloading

kylesayrs commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

kylesayrs commented Feb 17, 2026

Purpose

Prerequisites

Examples

Changes

Required

Hardening / Future Pre Changes

Testing

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot commented Feb 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments