[Performance Refactor] Extend modifiers to support weight-parallel optimization  - QuantizationModifier

See [[RFC] [Distributed] Sequential Onloading with Data-Parallel Calibration and Weight-Parallel Optimization #2180](https://github.com/vllm-project/llm-compressor/issues/2180)

Any modules which perform activation quantization should use the following update strategy. First, minimum and maximum values are calculated on local ranks with respect to local data subsets. Then, these min/max statistics are reduced across all ranks. The reduce min/max statistics are then used to calculate and update qparams within each rank, for each parameter. While work is technically duplicated across ranks, the workload of calculating qparams is very small, and almost certainly outweighs the cost of an additional distributed reduce of the final parameters.

In order to reduce excessive distributed communication, the reduce and qparam update steps should be done in batches, or at the end of sequential layers.

<img width="1279" height="823" alt="Image" src="https://github.com/user-attachments/assets/c8faff38-85e4-42d2-83f8-2b66c563aea7" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance Refactor] Extend modifiers to support weight-parallel optimization - QuantizationModifier #2220

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance Refactor] Extend modifiers to support weight-parallel optimization - QuantizationModifier #2220

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions