|
1 | | -# TorchAO Claude Instructions |
| 1 | +# TorchAO |
2 | 2 |
|
3 | | -Fill me in |
| 3 | +PyTorch-native library for quantization, sparsity, and low-precision training. Works with `torch.compile()` and `FSDP2`. |
| 4 | + |
| 5 | +## Quick Reference |
| 6 | + |
| 7 | +```python |
| 8 | +# Quantize a model to int4 |
| 9 | +from torchao.quantization import quantize_, Int4WeightOnlyConfig |
| 10 | +quantize_(model, Int4WeightOnlyConfig(group_size=32)) |
| 11 | + |
| 12 | +# Float8 dynamic quantization |
| 13 | +from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow |
| 14 | +quantize_(model, Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())) |
| 15 | + |
| 16 | +# Per-layer configs (different quantization per module) |
| 17 | +from torchao.quantization import FqnToConfig |
| 18 | +quantize_(model, FqnToConfig({"layers.0.attn": Int4WeightOnlyConfig(), "layers.0.mlp": Float8DynamicActivationFloat8WeightConfig()})) |
| 19 | + |
| 20 | +# Filter specific layers |
| 21 | +quantize_(model, Int4WeightOnlyConfig(), filter_fn=lambda mod, fqn: "mlp" in fqn) |
| 22 | + |
| 23 | +# QAT (prepare, train, then convert) |
| 24 | +from torchao.quantization import Int8DynamicActivationIntxWeightConfig, PerGroup |
| 25 | +from torchao.quantization.qat import QATConfig |
| 26 | +base_config = Int8DynamicActivationIntxWeightConfig(weight_dtype=torch.int4, weight_granularity=PerGroup(32)) |
| 27 | +quantize_(model, QATConfig(base_config, step="prepare")) |
| 28 | +# ... train ... |
| 29 | +quantize_(model, QATConfig(base_config, step="convert")) |
| 30 | + |
| 31 | +# Float8 training (H100/B200 required) |
| 32 | +from torchao.float8 import convert_to_float8_training |
| 33 | +convert_to_float8_training(model) |
| 34 | +``` |
| 35 | + |
| 36 | +## Config Classes |
| 37 | + |
| 38 | +All configs inherit from `AOBaseConfig`. Defined in `torchao/quantization/quant_api.py`: |
| 39 | + |
| 40 | +| Config | Description | |
| 41 | +|--------|-------------| |
| 42 | +| `Int4WeightOnlyConfig` | int4 weight-only (most common for inference) | |
| 43 | +| `Int8WeightOnlyConfig` | int8 weight-only | |
| 44 | +| `Int8DynamicActivationInt8WeightConfig` | int8 weights + int8 dynamic activations | |
| 45 | +| `Int8DynamicActivationIntxWeightConfig` | int8 activations + arbitrary int weight width | |
| 46 | +| `Float8WeightOnlyConfig` | float8 weight-only | |
| 47 | +| `Float8DynamicActivationFloat8WeightConfig` | float8 weights + float8 dynamic activations | |
| 48 | +| `Float8DynamicActivationInt4WeightConfig` | float8 activations + int4 weights | |
| 49 | +| `IntxWeightOnlyConfig` | arbitrary bit-width for edge/ExecuTorch | |
| 50 | +| `FqnToConfig` | map module names to different configs for per-layer quantization | |
| 51 | + |
| 52 | +### Granularity |
| 53 | + |
| 54 | +Controls how many elements share a quantization scale. Import from `torchao.quantization`: |
| 55 | +- `PerTensor` - one scale for the whole tensor |
| 56 | +- `PerRow` / `PerAxis` - one scale per row/axis (recommended for float8) |
| 57 | +- `PerGroup(group_size)` - one scale per group (e.g., group_size=32 for int4) |
| 58 | +- `PerBlock` - one scale per block |
| 59 | +- `PerToken` - one scale per token (for activations) |
| 60 | + |
| 61 | +### Prototype configs (in `torchao/prototype/mx_formats/`) |
| 62 | +- `MXDynamicActivationMXWeightConfig` - MXFP8/MXFP4 (H100/B200/MI350x) |
| 63 | +- `NVFP4DynamicActivationNVFP4WeightConfig` - NVIDIA FP4 (B200 Blackwell only) |
| 64 | +- `NVFP4WeightOnlyConfig` - NVFP4 weight-only (B200 Blackwell only) |
| 65 | + |
| 66 | +## Stable vs Prototype |
| 67 | + |
| 68 | +- **Stable** (`torchao/quantization/`, `torchao/float8/`, `torchao/sparsity/`, `torchao/optim/`): API stability guaranteed. Breaking changes go through deprecation cycle. |
| 69 | +- **Prototype** (`torchao/prototype/`): Experimental features, API may change without notice. Includes: `mx_formats/` (MXFP8, MXFP4, NVFP4), `moe_training/` (MoE mixed-precision), `awq/`, `hqq/`, `autoround/`, `quantized_training/`. |
| 70 | + |
| 71 | +## Architecture and Contributing |
| 72 | + |
| 73 | +For architecture details, tensor subclass design, and contributor guides, see the in-repo docs: |
| 74 | +- [Quantization Overview](docs/source/contributing/quantization_overview.rst) - full stack walkthrough, tensor subclasses, quantization flows |
| 75 | +- [Contributor Guide](docs/source/contributing/contributor_guide.rst) - how to add tensors, kernels, configs |
| 76 | +- [Workflows Matrix](docs/source/workflows/index.md) - dtype x hardware status table |
| 77 | + |
| 78 | +These same files render at https://docs.pytorch.org/ao/main/contributing/index.html |
| 79 | + |
| 80 | +## Deprecated APIs |
| 81 | + |
| 82 | +Do not use or recommend these: |
| 83 | +- `AffineQuantizedTensor` (AQT) in `torchao/dtypes/` - old v1 system, being removed. New tensor types inherit from `TorchAOBaseTensor` in `torchao/utils.py` |
| 84 | +- `autoquant()` - deleted |
| 85 | +- Layout registration system (`PlainLayout`, `Float8Layout`, `TensorCoreTiledLayout`, etc.) - deleted |
| 86 | +- `TorchAODType` - deprecated |
| 87 | +- `change_linear_weights_to_int4_woqtensors` - deleted, use `quantize_(model, Int4WeightOnlyConfig())` |
| 88 | + |
| 89 | +## Development |
| 90 | + |
| 91 | +```bash |
| 92 | +# Setup |
| 93 | +USE_CPP=0 pip install -e . --no-build-isolation # CPU-only |
| 94 | +USE_CUDA=1 pip install -e . --no-build-isolation # With CUDA |
| 95 | + |
| 96 | +# Lint (ruff v0.11.6, rules: F and I) |
| 97 | +ruff check --fix && ruff format . |
| 98 | + |
| 99 | +# Test (mirrors source structure) |
| 100 | +pytest test/quantization/test_quant_api.py |
| 101 | +pytest test/float8/ |
| 102 | +pytest test/prototype/mx_formats/ |
| 103 | +``` |
0 commit comments