Skip to content

Latest commit

 

History

History
43 lines (27 loc) · 2.05 KB

File metadata and controls

43 lines (27 loc) · 2.05 KB

KLAWQ: KL-Aware Weight Quantization for Edge AI

A novel post-training quantization framework that enhances GPTQ by integrating KL divergence for better accuracy preservation when deploying Large Language Models on edge devices. 1

Core Innovation

KLAWQ extends GPTQ by adding a KL divergence term to align quantized model outputs with the original model's distribution: 2

L(Q) = L_MSE(Q) + β * L_KL(Q)

The algorithm modifies the Hessian computation as H_tot = H + βA, where A is the KL Hessian matrix. 3

Key Components

  • Configuration: Hyperparameters (β, τ) in KLAWQ/gptqmodel/quantization/config.py 4
  • Core Algorithm: KL Hessian computation in KLAWQ/kl-aware-quant/quantization/gptq.py 5
  • Quantization Engine: Low-level operations in KLAWQ/kl-aware-quant/quantization/quantizer.py 6
  • Analysis Notebooks: Experimental validation in kl-hessian-gptq-*.ipynb files 7

Quick Start

  1. Clone and Setup: 8

    git clone https://github.com/ha405/Compression-Framework-for-EdgeAI
    cd Compression-Framework-for-EdgeAI
  2. Install Dependencies: Install PyTorch, transformers, and other requirements from requirements.txt

  3. Run Quantization: Use the Jupyter notebooks for experimentation or integrate the KLAWQ modules directly

Results

Experiments on GPT-2 at 8-bit precision demonstrate improved perplexity scores compared to vanilla GPTQ while maintaining post-training quantization efficiency. 9

Notes

The framework is built on a comprehensive infrastructure stack including PyTorch >=2.4.1, transformers >=4.51.2, and FastAPI for model serving. The project structure shows a modular design with separate components for adapter functionality, model definitions, and processing loops, though the core KLAWQ innovation is concentrated in the quantization modules. 10

Wiki pages you might want to explore: