[GPTQ][ddp] enabling DDP for GPTQ by HDCharles · Pull Request #2333 · vllm-project/llm-compressor

HDCharles · 2026-02-06T04:50:48Z

After the changes in vllm-project/compressed-tensors#572 vllm-project/compressed-tensors#534 #2340 we're ready to start rolling out DDP implementations of various modifiers

API:

The Api we've landed on attempts to maintain the normal flow with minimal changes necessary to enable DDP:

the user will call torchrun --nproc_per_node==<num_threads> script.py to start the script
the user will initialize the distributed context, (they can use the helper init_dist to do this)
the user will load the model using the new context manager, setting the device map as outlined here. (For most users this will be "auto_offload")
(optional) the user can partition the dataset at load time using get_rank_partition or just load as normal and oneshot will partition the data later (will load 1 copy of dataset into cpu memory for each rank which may be onerous)

from compressed_tensors.offload import load_offloaded_model, init_dist
init_dist()
with load_offloaded_model(): 
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto_offload")
...
ds = load_dataset(
    DATASET_ID, split=get_rank_partition(DATASET_SPLIT, NUM_CALIBRATION_SAMPLES)

Implementation

Adding the DDP process to GPTQ has relatively straightforward though optimizing it for speed was a bit trickier. There are 4 steps

assigning each module to a rank which it will be compressed by
for each module assigned to a rank, having all hessian information sent by other ranks to the assigned rank
each rank compresses the modules that it was assigned
broadcast the final quantized values to all ranks

Step 1 required the largest optimization, without any load balancing, we ran into situations where 1 rank could be doing twice as much work as another. Thus we implemented basic load balancing and time estimation that seems to be working well in practice. The other major optimization was using asynchronous ops for thread to thread communication. Before these optimizations, 2 thread GPTQ was as fast as 1 thread GPTQ for llama3-8B, afterward it results in a 27% speedup despite being a relatively small model.

TODO insert benchmarks here

GPTQ Changes

while validating numerical accuracy of the DDP technique, we noticed that accuracy improved significantly for each thread added. After some debugging we realized this was because the existing hessian calculation was causing an accumulation of floating point errors. By rewriting the hessian calculation to sum the intermediate hessians and only divide by num_samples at the end, we improved the GSM8K evaluation from (.67, .66) to (.71, .71). You can repro these results here

TODO remove test code and add an example script

github-actions · 2026-02-06T04:50:57Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-02-06T04:51:06Z

Summary of Changes

Hello @HDCharles, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for running GPTQ quantization with Distributed Data Parallel (DDP). The core changes involve distributing the quantization workload across multiple processes, where each process handles a portion of the model's modules. It also includes necessary adjustments for dataset partitioning, model loading, and saving in a distributed environment, aiming to enable more efficient quantization of large language models.

Highlights

Distributed GPTQ Quantization: The GPTQModifier has been updated to support Distributed Data Parallel (DDP) quantization, allowing multiple ranks to collectively quantize model modules by distributing the workload, reducing Hessians, and broadcasting results.
Dataset Partitioning for DDP: A new utility function get_rank_partition was added to src/llmcompressor/datasets/utils.py to enable partitioning of datasets across DDP ranks, ensuring each rank processes a distinct subset of calibration samples.
Distributed Model Loading and Saving: Modifications were made to save_pretrained_wrapper to handle DDP environments, ensuring only the main rank saves the model and temporarily disabling offloading during the save process. A new ct_offload context manager was introduced to patch from_pretrained for distributed loading.
Dynamic Device Assignment: The get_main_device utility now dynamically assigns CUDA or XPU devices based on the current DDP rank, improving device utilization in distributed setups.
DDP Test Case: A new test file test_ddp.py has been added to demonstrate and validate the GPTQ quantization process within a DDP environment, including distributed model loading, dataset preparation, and quantized model saving.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/datasets/init.py
- Exports the newly added get_rank_partition function.
src/llmcompressor/datasets/utils.py
- Adds get_rank_partition function to divide a dataset split into partitions for each DDP rank, ensuring the last rank gets any remainder.
src/llmcompressor/modifiers/quantization/gptq/base.py
- Imports torch.distributed as dist for DDP functionalities.
- Refactors compress_modules to support both non-distributed and distributed quantization workflows.
- Introduces a distributed logic where ranks collectively quantize modules: each rank processes a subset of modules, reduces Hessians, and then broadcasts quantization parameters.
- Extracts the single module compression logic into a new compress_single_module method for reusability.
src/llmcompressor/transformers/compression/compressed_tensors_utils.py
- Imports OffloadCache for managing offloaded tensors.
- Modifies save_pretrained_wrapper to ensure only rank 0 performs the model saving in a distributed setup.
- Adds a context manager OffloadCache.disable_onloading() around the original_save_pretrained call to prevent materializing the entire model on device during saving.
src/llmcompressor/utils/dev.py
- Updates get_main_device to use the specific CUDA/XPU device corresponding to the current DDP rank, rather than always 0.
test_ddp.py
- Adds a new test file for GPTQ with DDP.
- Includes init_dist for DDP setup and is_ddp utility.
- Defines convert_to_ct_offload to handle model offloading based on device_map configurations.
- Implements patch_from_pretrained to modify from_pretrained behavior for distributed loading, adjusting device_map for each rank.
- Introduces ct_offload context manager to wrap from_pretrained calls, manage DDP initialization, and apply patching/cleanup.
- Demonstrates loading a model, partitioning a dataset using get_rank_partition, applying GPTQModifier, and saving the quantized model in a DDP environment.

Activity

This pull request is marked as a Proof of Concept (PoC) for GPTQ with DDP, indicating it's an initial implementation.
The author notes that there are 'some specifics to work through as apis are updated in compressed tensors', suggesting potential future refinements or dependencies.
A test plan is provided, instructing users to run torchrun --nproc_per_node=2 test_ddp.py to validate the changes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

mergify · 2026-02-06T04:51:28Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

gemini-code-assist

Code Review

This pull request introduces a Proof of Concept for GPTQ with Distributed Data Parallel (DDP). The changes are mainly in the GPTQ modifier to handle distributed computation of Hessians and quantization. My review has identified a critical bug in the non-distributed path that would lead to incomplete quantization, as well as a high-severity issue in the new distributed logic that could cause a runtime error. I've also provided suggestions to improve code clarity and remove redundant or temporary code sections.

src/llmcompressor/modifiers/quantization/gptq/base.py

gemini-code-assist · 2026-02-06T04:54:04Z

src/llmcompressor/modifiers/quantization/gptq/base.py

The key for module is already removed from self._num_samples on line 283 using pop(). This second call to pop() is redundant and can be removed.

src/llmcompressor/modifiers/quantization/gptq/base.py

src/llmcompressor/transformers/compression/compressed_tensors_utils.py

mergify · 2026-02-09T21:48:24Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-02-09T21:50:32Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-02-11T20:47:51Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-02-13T06:10:39Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-02-13T15:08:43Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-02-17T18:08:05Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

some specifics to work through as apis are updated in compressed tensors Summary Signed-off-by: HDCharles <[email protected]>

Summary Signed-off-by: HDCharles <[email protected]>

src/llmcompressor/modifiers/quantization/gptq/base.py

kylesayrs · 2026-02-18T03:46:22Z

src/llmcompressor/modifiers/quantization/gptq/base.py

                    module=module,
                    quant_args=quant_args,
-                    hessians_dict=self._hessians,
+                    hessian=self._hessians[module] / self._num_samples[module],


Consider passing num_samples as an arg to quantize_weight

i'm unsure why this was implemented by passing entire dicts originally. Seems like i'd rather make the function more explicit on what its acting on i.e. what i have here.

I agree that not passing a dictionary is cleaner, but it comes at a memory cost since we cannot "move" the hessian memory. This is an instance where I feel like (better behavior) is preferable to (cleaner code)

Spoke offline, we agreed to pop the value from the dict.

Summary Signed-off-by: HDCharles <[email protected]>

mergify · 2026-02-18T04:47:30Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

kylesayrs

Just small programming nits, otherwise looks excellent

kylesayrs · 2026-02-19T03:27:01Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+                if rank == target_rank:
+                    wait_for_comms(pending_comms)
+                    self._hessians[module] = H
+                    self._num_samples[module] = n


I don't fully understand this logic. Why is this step needed? Can't you just write to the memory address directly using

for module in module_list: h_comm = dist.reduce( self._hessians[module], op=dist.ReduceOp.SUM, dst=target_rank, async_op=True ) pending_comms.append(h_comm) wait_for_comms(pending_comms)

This way seems to maximize throughput more than how it's written now, right?

It seems like you do a similar approach when broadcasting.

kylesayrs · 2026-02-19T03:30:32Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+
+            # Broadcast each tensor asynchronously
+            # note: update in place, since compress_module_list updated the offload
+            for tensor in to_broadcast:


What's the benefit of splitting into two for loops? Why not just write

for module in module_list: for attr in _GPTQ_Q_PARAMS: if (tensor := getattr(module, attr, None) is not None: pending_comms.append(dist.broadcast(tensor, ...))

kylesayrs · 2026-02-19T03:32:16Z

src/llmcompressor/modifiers/quantization/gptq/base.py

                    module=module,
                    quant_args=quant_args,
-                    hessians_dict=self._hessians,
+                    hessian=self._hessians[module] / self._num_samples[module],


Spoke offline, we agreed to pop the value from the dict.

kylesayrs · 2026-02-19T03:33:37Z

src/llmcompressor/utils/dist.py

+T = TypeVar("T", bound=Hashable)
+
+
+def greedy_bin_packing(


kylesayrs · 2026-02-19T03:34:38Z

examples/quantization_w4a16/llama3_ddp_example.py

+torch.cuda.reset_peak_memory_stats()
+start_time = time.time()


Did you mean to keep these?

HDCharles requested review from brian-dellabetta, dsikka and kylesayrs as code owners February 6, 2026 04:50

HDCharles added enhancement New feature or request gptq For any PR / issue related to GPTQ support labels Feb 6, 2026

mergify bot added the quality-failed label Feb 6, 2026

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

mergify bot removed the quality-failed label Feb 9, 2026

mergify bot added quality-failed and removed quality-failed labels Feb 9, 2026

mergify bot added the quality-failed label Feb 9, 2026

HDCharles force-pushed the 94_ddp_api branch from 8bd98ff to 15776e2 Compare February 11, 2026 20:47

mergify bot removed the quality-failed label Feb 11, 2026

mergify bot added quality-failed and removed quality-failed labels Feb 11, 2026

mergify bot added quality-failed and removed quality-failed labels Feb 13, 2026

mergify bot added the quality-failed label Feb 13, 2026

HDCharles mentioned this pull request Feb 13, 2026

[Performance Refactor] Extend modifiers to support weight-parallel optimization - GPTQModifier #2218

Open

mergify bot removed the quality-failed label Feb 16, 2026

HDCharles force-pushed the 94_ddp_api branch 2 times, most recently from 9698027 to 2f84e2e Compare February 16, 2026 15:06

mergify bot added quality-failed and removed quality-failed labels Feb 16, 2026

mergify bot added quality-failed and removed quality-failed labels Feb 17, 2026

HDCharles changed the title ~~[GPTQ][ddp] PoC for GPTQ with DDP~~ [GPTQ][ddp] enabling DDP for GPTQ Feb 18, 2026

HDCharles added ready When a PR is ready for review dist Work pertaining to distributed work labels Feb 18, 2026

HDCharles added 13 commits February 18, 2026 03:23

PoC for DDP in GPTQ

d1d5fdc

some specifics to work through as apis are updated in compressed tensors Summary Signed-off-by: HDCharles <[email protected]>

separate things into 2 PRs

2b39840

Summary Signed-off-by: HDCharles <[email protected]>

update tests

99b4d93

Summary Signed-off-by: HDCharles <[email protected]>

test

d518f43

Summary Signed-off-by: HDCharles <[email protected]>

fixes

5b1f97f

Summary Signed-off-by: HDCharles <[email protected]>

testing

0ac3849

Summary Signed-off-by: HDCharles <[email protected]>

tests

b149219

Summary Signed-off-by: HDCharles <[email protected]>

improvements

9895f8f

Summary Signed-off-by: HDCharles <[email protected]>

consolidate ddp impl

3d69d46

Summary Signed-off-by: HDCharles <[email protected]>

reorganize functions

d20839b

Summary Signed-off-by: HDCharles <[email protected]>

finalize base.py

a9bf848

Summary Signed-off-by: HDCharles <[email protected]>

tests ready

9cfeba5

Summary Signed-off-by: HDCharles <[email protected]>

format

e79343c

Summary Signed-off-by: HDCharles <[email protected]>

HDCharles force-pushed the 94_ddp_api branch from e15e615 to 80389cc Compare February 18, 2026 03:41

mergify bot added the documentation Improvements or additions to documentation label Feb 18, 2026

kylesayrs reviewed Feb 18, 2026

View reviewed changes

remove test code and add example

7c1e74b

Summary Signed-off-by: HDCharles <[email protected]>

HDCharles force-pushed the 94_ddp_api branch from 80389cc to 7c1e74b Compare February 18, 2026 04:45

mergify bot added the quality-failed label Feb 18, 2026

kylesayrs mentioned this pull request Feb 19, 2026

[Performance Refactor] Extend modifiers to support weight-parallel optimization - QuantizationModifier #2220

Open

kylesayrs approved these changes Feb 19, 2026

View reviewed changes

		torch.cuda.reset_peak_memory_stats()
		start_time = time.time()

Conversation

HDCharles commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API:

Implementation

GPTQ Changes

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

mergify bot commented Feb 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 9, 2026

Uh oh!

mergify bot commented Feb 9, 2026

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

mergify bot commented Feb 13, 2026

Uh oh!

mergify bot commented Feb 13, 2026

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 18, 2026

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

HDCharles commented Feb 6, 2026 •

edited

Loading

kylesayrs Feb 18, 2026 •

edited

Loading