Skip to content

[Observers] Allow for case when weight shape and block size are not evenly divisble#2283

Open
brian-dellabetta wants to merge 19 commits intomainfrom
bdellabe/observer-incompatible-block-size
Open

[Observers] Allow for case when weight shape and block size are not evenly divisble#2283
brian-dellabetta wants to merge 19 commits intomainfrom
bdellabe/observer-incompatible-block-size

Conversation

@brian-dellabetta
Copy link
Collaborator

@brian-dellabetta brian-dellabetta commented Jan 23, 2026

SUMMARY:
Update observer logic for block strategy when weight shape is not divisible by block size

Prerequisite:

TEST PLAN:

Run script below with

  • VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=1 VLLM_USE_DEEP_GEMM=1 for flashinfer
  • VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=0 VLLM_USE_DEEP_GEMM=1 for deepgemm
  • VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=0 VLLM_USE_DEEP_GEMM=0 for default
if __name__ == "__main__":
    from vllm import LLM, SamplingParams

    prompts = ["The Swiss Alps are", "Brad Marchand is", "The Toronto Maple Leafs are"]

    # Create a sampling params object for greedy sampling
    sampling_params = SamplingParams(
        temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10
    )
    llm = LLM(
        "bdellabe/DeepSeek-V2-Lite-FP8-BLOCK",
        max_model_len=4096,
        enforce_eager=True,
    )
    output = llm.generate(prompts, sampling_params)
    for out in output:
        print(out.outputs[0].text)

    print("COMPLETE")

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @brian-dellabetta, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the observer logic within the llmcompressor library, specifically for block quantization strategies. It introduces a mechanism to gracefully handle scenarios where the dimensions of a weight tensor are not perfectly divisible by the specified block size. By implementing a padding strategy that uses the median values of existing blocks, the system can now process such tensors without error, ensuring broader applicability of the block strategy. Additionally, error handling has been improved by converting assert False statements into explicit ValueError exceptions.

Highlights

  • Robust Block Strategy Handling: Implemented padding logic in the _flatten_weight function to correctly handle cases where the weight tensor's dimensions are not evenly divisible by the specified block size, preventing errors in block quantization.
  • Median Value Padding: The padding mechanism utilizes the median values of existing blocks to fill the newly added rows and columns, aiming to maintain data integrity and statistical properties.
  • Improved Error Handling: Replaced assert False statements with raise ValueError across _flatten_weight, _flatten_activation, and _flatten_attention functions for unknown strategies, providing more explicit and catchable error messages.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify
Copy link
Contributor

mergify bot commented Jan 23, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Signed-off-by: Brian Dellabetta <[email protected]>
@mergify mergify bot removed the quality-failed label Jan 23, 2026
@mergify
Copy link
Contributor

mergify bot commented Jan 23, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to handle cases where weight shapes are not evenly divisible by the block size for block quantization. The core idea of padding the tensor is correct, but the implementation in _flatten_weight has several critical bugs related to padding calculation, tensor creation, and the padding fill logic itself. I've provided a detailed comment with a suggested fix that uses a more standard and robust approach with torch.nn.functional.pad. The other changes to replace assert False with raise ValueError are a good improvement.

Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta
Copy link
Collaborator Author

/gemini review

Signed-off-by: Brian Dellabetta <[email protected]>
@mergify mergify bot removed the quality-failed label Jan 26, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to handle cases where weight shapes are not evenly divisible by the block size in block quantization. This is achieved by padding the tensors with the mean of the values in the block being padded, which is a sensible approach to avoid distorting quantization parameters. The changes include a new helper function for padding and updates to the weight flattening logic. The accompanying tests are thorough and cover various scenarios. My main feedback is on the complexity of the new padding function, for which I've suggested a minor refactoring to improve readability.

@brian-dellabetta
Copy link
Collaborator Author

Closing in favor of #2290

brian-dellabetta and others added 6 commits February 17, 2026 18:27
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta marked this pull request as ready for review February 18, 2026 16:26
@brian-dellabetta brian-dellabetta added the ready When a PR is ready for review label Feb 18, 2026
Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine

Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments