Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @HDCharles, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces initial support for running GPTQ quantization with Distributed Data Parallel (DDP). The core changes involve distributing the quantization workload across multiple processes, where each process handles a portion of the model's modules. It also includes necessary adjustments for dataset partitioning, model loading, and saving in a distributed environment, aiming to enable more efficient quantization of large language models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
The quality checks have failed. Please run |
There was a problem hiding this comment.
Code Review
This pull request introduces a Proof of Concept for GPTQ with Distributed Data Parallel (DDP). The changes are mainly in the GPTQ modifier to handle distributed computation of Hessians and quantization. My review has identified a critical bug in the non-distributed path that would lead to incomplete quantization, as well as a high-severity issue in the new distributed logic that could cause a runtime error. I've also provided suggestions to improve code clarity and remove redundant or temporary code sections.
src/llmcompressor/transformers/compression/compressed_tensors_utils.py
Outdated
Show resolved
Hide resolved
src/llmcompressor/transformers/compression/compressed_tensors_utils.py
Outdated
Show resolved
Hide resolved
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
8bd98ff to
15776e2
Compare
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
9698027 to
2f84e2e
Compare
|
The quality checks have failed. Please run |
some specifics to work through as apis are updated in compressed tensors Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
Summary Signed-off-by: HDCharles <[email protected]>
e15e615 to
80389cc
Compare
| module=module, | ||
| quant_args=quant_args, | ||
| hessians_dict=self._hessians, | ||
| hessian=self._hessians[module] / self._num_samples[module], |
There was a problem hiding this comment.
Consider passing num_samples as an arg to quantize_weight
There was a problem hiding this comment.
i'm unsure why this was implemented by passing entire dicts originally. Seems like i'd rather make the function more explicit on what its acting on i.e. what i have here.
There was a problem hiding this comment.
I agree that not passing a dictionary is cleaner, but it comes at a memory cost since we cannot "move" the hessian memory. This is an instance where I feel like (better behavior) is preferable to (cleaner code)
There was a problem hiding this comment.
Spoke offline, we agreed to pop the value from the dict.
Summary Signed-off-by: HDCharles <[email protected]>
80389cc to
7c1e74b
Compare
|
The quality checks have failed. Please run |
kylesayrs
left a comment
There was a problem hiding this comment.
Just small programming nits, otherwise looks excellent
| if rank == target_rank: | ||
| wait_for_comms(pending_comms) | ||
| self._hessians[module] = H | ||
| self._num_samples[module] = n |
There was a problem hiding this comment.
I don't fully understand this logic. Why is this step needed? Can't you just write to the memory address directly using
for module in module_list:
h_comm = dist.reduce(
self._hessians[module],
op=dist.ReduceOp.SUM,
dst=target_rank,
async_op=True
)
pending_comms.append(h_comm)
wait_for_comms(pending_comms)This way seems to maximize throughput more than how it's written now, right?
There was a problem hiding this comment.
It seems like you do a similar approach when broadcasting.
|
|
||
| # Broadcast each tensor asynchronously | ||
| # note: update in place, since compress_module_list updated the offload | ||
| for tensor in to_broadcast: |
There was a problem hiding this comment.
What's the benefit of splitting into two for loops? Why not just write
for module in module_list:
for attr in _GPTQ_Q_PARAMS:
if (tensor := getattr(module, attr, None) is not None:
pending_comms.append(dist.broadcast(tensor, ...))| module=module, | ||
| quant_args=quant_args, | ||
| hessians_dict=self._hessians, | ||
| hessian=self._hessians[module] / self._num_samples[module], |
There was a problem hiding this comment.
Spoke offline, we agreed to pop the value from the dict.
| T = TypeVar("T", bound=Hashable) | ||
|
|
||
|
|
||
| def greedy_bin_packing( |
| torch.cuda.reset_peak_memory_stats() | ||
| start_time = time.time() |
There was a problem hiding this comment.
Did you mean to keep these?
After the changes in vllm-project/compressed-tensors#572 vllm-project/compressed-tensors#534 #2340 we're ready to start rolling out DDP implementations of various modifiers
API:
The Api we've landed on attempts to maintain the normal flow with minimal changes necessary to enable DDP:
Implementation
Adding the DDP process to GPTQ has relatively straightforward though optimizing it for speed was a bit trickier. There are 4 steps
Step 1 required the largest optimization, without any load balancing, we ran into situations where 1 rank could be doing twice as much work as another. Thus we implemented basic load balancing and time estimation that seems to be working well in practice. The other major optimization was using asynchronous ops for thread to thread communication. Before these optimizations, 2 thread GPTQ was as fast as 1 thread GPTQ for llama3-8B, afterward it results in a 27% speedup despite being a relatively small model.
TODO insert benchmarks here
GPTQ Changes
while validating numerical accuracy of the DDP technique, we noticed that accuracy improved significantly for each thread added. After some debugging we realized this was because the existing hessian calculation was causing an accumulation of floating point errors. By rewriting the hessian calculation to sum the intermediate hessians and only divide by num_samples at the end, we improved the GSM8K evaluation from (.67, .66) to (.71, .71). You can repro these results here
TODO remove test code and add an example script