feat: implement column parallel for lm head to improve performance. by wxh571001500 · Pull Request #1145 · jd-opensource/xllm

wxh571001500 · 2026-03-31T09:31:35Z

lmhead performance improved by 3%

…b. (jd-opensource#1037)" This reverts commit d56c50a.

gemini-code-assist

Code Review

This pull request refactors the LmHead to utilize ColumnParallelLinearImpl and introduces support for vocabulary padding to ensure alignment during tensor parallel operations. The review feedback identifies several critical issues: a regression caused by hardcoding quantization arguments in a general-purpose linear layer constructor, memory inefficiencies when sharding padded tensors, and a potential bug in state dict lookups using incorrect keys. Additionally, there is a recommendation to deduplicate the vocabulary padding calculation logic into a shared utility to improve maintainability.

gemini-code-assist · 2026-03-31T09:38:14Z

xllm/core/layers/common/linear.cpp

                               /*bias=*/false,
                               /*gather_output=*/true,
-                               context.get_quant_args(),
+                               QuantArgs{},


Hardcoding QuantArgs{} in this constructor is a regression. This constructor is a general-purpose helper for ColumnParallelLinearImpl and should respect the quantization settings provided in the ModelContext. By forcing it to be unquantized, you break quantization for any other model layers that might use this constructor. Instead, keep this constructor generic and have the LmHead class explicitly pass QuantArgs{} when it instantiates the linear layer.

Suggested change

QuantArgs{},

context.get_quant_args(),

Because LMhead does not support quantization.

gemini-code-assist · 2026-03-31T09:38:14Z

xllm/core/layers/common/lm_head.h

-            QuantArgs{},  // do not use quantization for lm_head!
-            context.get_parallel_args().tp_group_,
-            context.get_tensor_options())) {}
+      : ModuleHolder(std::make_shared<ColumnParallelLinearImpl>(context)) {}


Since lm_head must remain unquantized for numerical stability, you should explicitly pass QuantArgs{} to the ColumnParallelLinearImpl constructor here, rather than relying on a modified generic constructor in linear.cpp.

: ModuleHolder(std::make_shared<ColumnParallelLinearImpl>( context.get_model_args().hidden_size(), context.get_model_args().vocab_size(), /*bias=*/false, /*gather_output=*/true, QuantArgs{}, // Explicitly disable quantization for lm_head context.get_parallel_args().tp_group_, context.get_tensor_options())) {}

The corresponding constructor function has been modified.

gemini-code-assist · 2026-03-31T09:38:14Z

xllm/core/layers/npu/loader/base_loader.cpp

+      if (padded_vocab_size > tensor.size(0)) {
+        mutable_tensor = pad_vocab_tensor(tensor, padded_vocab_size);
+        mutable_tensor =
+            shard_padded_tensor(mutable_tensor, dim, rank, world_size);


This implementation is highly memory-inefficient. It allocates a full padded vocabulary tensor on every rank before sharding it. For large vocabularies (e.g., 128k), this causes a significant and unnecessary host memory spike. When sharding on the vocabulary dimension (dim == 0), you should calculate the specific row range required for the current rank and only slice/pad that portion.

if (padded_vocab_size > tensor.size(0)) { if (dim == 0) { int64_t shard_size = padded_vocab_size / world_size; int64_t start_idx = rank * shard_size; int64_t end_idx = (rank + 1) * shard_size; if (start_idx >= tensor.size(0)) { mutable_tensor = torch::zeros({shard_size, tensor.size(1)}, tensor.options()); } else { auto valid_part = tensor.slice(0, start_idx, std::min(end_idx, tensor.size(0))); if (valid_part.size(0) < shard_size) { mutable_tensor = torch::zeros({shard_size, tensor.size(1)}, tensor.options()); mutable_tensor.slice(0, 0, valid_part.size(0)).copy_(valid_part); } else { mutable_tensor = valid_part.clone(); } } } else { mutable_tensor = pad_vocab_tensor(tensor, padded_vocab_size); mutable_tensor = shard_padded_tensor(mutable_tensor, dim, rank, world_size); } }

xllm/core/layers/npu/loader/base_loader.cpp

xllm/core/layers/npu/npu_lm_head_impl.cpp

XuZhang99 · 2026-03-31T10:10:19Z

xllm/core/layers/common/linear.cpp

                               /*bias=*/false,
                               /*gather_output=*/true,
-                               context.get_quant_args(),
+                               QuantArgs{},


why change context.get_quant_args() to QuantArgs{}?

Because LMhead does not support quantization.

Comments have been added

Super User and others added 2 commits March 28, 2026 16:32

Revert "feat: lm head uses row parallel to support more sizes of voca…

891a5c7

…b. (jd-opensource#1037)" This reverts commit d56c50a.

feat: implement column parallel for lm head to improve performance.

2e4d00b

wxh571001500 requested review from DongheJin, JimHsiung, RobbieLeung, XuZhang99, liutongxuan, walsonyang and yq33victor as code owners March 31, 2026 09:31

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

RobbieLeung previously approved these changes Mar 31, 2026

View reviewed changes

XuZhang99 reviewed Mar 31, 2026

View reviewed changes

feat: reduce the memory consumption of the pad.

558852e

wxh571001500 dismissed RobbieLeung’s stale review via 558852e April 1, 2026 03:41

wxh571001500 requested review from RobbieLeung and XuZhang99 April 1, 2026 03:42

XuZhang99 approved these changes Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement column parallel for lm head to improve performance.#1145

feat: implement column parallel for lm head to improve performance.#1145
wxh571001500 wants to merge 3 commits intojd-opensource:mainfrom
wxh571001500:main

wxh571001500 commented Mar 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

wxh571001500 Apr 1, 2026

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

wxh571001500 Apr 1, 2026

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

XuZhang99 Mar 31, 2026 •

edited

Loading

Uh oh!

wxh571001500 Apr 1, 2026

Uh oh!

wxh571001500 Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wxh571001500 commented Mar 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

wxh571001500 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

wxh571001500 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

XuZhang99 Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wxh571001500 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

wxh571001500 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XuZhang99 Mar 31, 2026 •

edited

Loading