See [RFC] [Distributed] Sequential Onloading with Data-Parallel Calibration and Weight-Parallel Optimization #2180
Any modules which perform activation quantization should use the following update strategy. First, minimum and maximum values are calculated on local ranks with respect to local data subsets. Then, these min/max statistics are reduced across all ranks. The reduce min/max statistics are then used to calculate and update qparams within each rank, for each parameter. While work is technically duplicated across ranks, the workload of calculating qparams is very small, and almost certainly outweighs the cost of an additional distributed reduce of the final parameters.
In order to reduce excessive distributed communication, the reduce and qparam update steps should be done in batches, or at the end of sequential layers.
