-
Notifications
You must be signed in to change notification settings - Fork 442
Open
Labels
bugSomething isn't workingSomething isn't working
Description
⚙️ Your current environment
The output of python collect_env.py
Operating System: `Linux-5.15.0-97-generic-x86_64-with-glibc2.35`
Python Version: `3.10.15 (main, Oct 19 2024, 16:16:22) [GCC 11.4.0]`
llm-compressor Version: `None`
compressed-tensors Version: `0.13.0`
transformers Version: `4.57.3`
torch Version: `2.9.1+cu128`
CUDA Devices: `['NVIDIA A100 80GB PCIe MIG 1g.10gb']`
AMD Devices: `None`
NPU Devices: `None
1.2B LLM Quantization shows poor thoughput/perplexity. We used AWQ for MLP layer, and GPTQ for attention block. 10 or higher throughput (requests/s) is expected, but we are getting only 5
- Checkpoint: https://huggingface.co/namgyu-youn/EXAONE-4.0-1.2B-LLMC-AWQ-W4
- benchmark result
Perplexity:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.0098|± |0.0044|
| | |strict-match | 5|exact_match|↑ |0.0000|± |0.0000|
Throughput: 5.01 requests/s, 5774.56 total tokens/s, 641.62 output tokens/s🐛 Describe the bug
AWQ+GPTQ shows poor throughput (less than half of expected value).
🛠️ Steps to reproduce
tokenizer = AutoTokenizer.from_pretrained(
"LGAI-EXAONE/EXAONE-4.0-1.2B", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"LGAI-EXAONE/EXAONE-4.0-1.2B", dtype=torch.bfloat16, trust_remote_code=True
)
# dataset: https://huggingface.co/datasets/LGAI-EXAONE/MANTA-1M
calibration_ds = prepare_calibration_data(tokenizer)
recipe = [
# AWQModifier: owns MLP layers exclusively.
AWQModifier(
targets=[
"re:.*gate_proj$", "re:.*up_proj$", "re:.*down_proj$",
],
ignore=["lm_head"],
scheme="W4A16",
duo_scaling="both",
n_grid=40,
mappings=[
# MLP: post-attention layernorm → gate/up
AWQMapping(
"re:.*post_attention_layernorm$",
["re:.*gate_proj$", "re:.*up_proj$"],
),
# MLP: up_proj output → down_proj input
AWQMapping(
"re:.*up_proj$",
["re:.*down_proj$"],
),
],
),
# GPTQ for Attention Layer
GPTQModifier(
targets=[
"re:.*q_proj$", "re:.*k_proj$",
"re:.*v_proj$", "re:.*o_proj$",
],
ignore=["embed_tokens", "lm_head"],
scheme="W4A16",
dampening_frac=0.01,
kv_cache_scheme={
"num_bits": 8, "type": "float",
"strategy": "tensor", "dynamic": False, "symmetric": True,
},
),
]
oneshot(
model=model,
dataset=calibration_ds,
recipe=recipe,
max_seq_length=512,
num_calibration_samples=512,
shuffle_calibration_samples=False,
)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working