Skip to content

[Bug]: GPTQ shows poor Throughput #2382

@namgyu-youn

Description

@namgyu-youn

⚙️ Your current environment

The output of python collect_env.py
Operating System: `Linux-5.15.0-97-generic-x86_64-with-glibc2.35`
Python Version: `3.10.15 (main, Oct 19 2024, 16:16:22) [GCC 11.4.0]`
llm-compressor Version: `None`
compressed-tensors Version: `0.13.0`
transformers Version: `4.57.3`
torch Version: `2.9.1+cu128`
CUDA Devices: `['NVIDIA A100 80GB PCIe MIG 1g.10gb']`
AMD Devices: `None`
NPU Devices: `None

1.2B LLM Quantization shows poor thoughput/perplexity. We used AWQ for MLP layer, and GPTQ for attention block. 10 or higher throughput (requests/s) is expected, but we are getting only 5

Perplexity:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.0098|±  |0.0044|
|     |       |strict-match    |     5|exact_match||0.0000|±  |0.0000|

Throughput: 5.01 requests/s, 5774.56 total tokens/s, 641.62 output tokens/s

🐛 Describe the bug

AWQ+GPTQ shows poor throughput (less than half of expected value).

🛠️ Steps to reproduce

tokenizer = AutoTokenizer.from_pretrained(
    "LGAI-EXAONE/EXAONE-4.0-1.2B", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "LGAI-EXAONE/EXAONE-4.0-1.2B", dtype=torch.bfloat16, trust_remote_code=True
)

# dataset: https://huggingface.co/datasets/LGAI-EXAONE/MANTA-1M
calibration_ds = prepare_calibration_data(tokenizer)

recipe = [
    # AWQModifier: owns MLP layers exclusively.
    AWQModifier(
        targets=[
            "re:.*gate_proj$", "re:.*up_proj$", "re:.*down_proj$",
        ],
        ignore=["lm_head"],
        scheme="W4A16",
        duo_scaling="both",
        n_grid=40,
        mappings=[
            # MLP: post-attention layernorm → gate/up
            AWQMapping(
                "re:.*post_attention_layernorm$",
                ["re:.*gate_proj$", "re:.*up_proj$"],
            ),
            # MLP: up_proj output → down_proj input
            AWQMapping(
                "re:.*up_proj$",
                ["re:.*down_proj$"],
            ),
        ],
    ),
    # GPTQ for Attention Layer
    GPTQModifier(
        targets=[
            "re:.*q_proj$", "re:.*k_proj$",
            "re:.*v_proj$", "re:.*o_proj$",
        ],
        ignore=["embed_tokens", "lm_head"],
        scheme="W4A16",
        dampening_frac=0.01,
        kv_cache_scheme={
            "num_bits": 8, "type": "float",
            "strategy": "tensor", "dynamic": False, "symmetric": True,
        },
    ),
]

oneshot(
    model=model,
    dataset=calibration_ds,
    recipe=recipe,
    max_seq_length=512,
    num_calibration_samples=512,
    shuffle_calibration_samples=False,
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions