Skip to content

Releases: pytorch/ao

v0.3.1

26 Jun 20:36
Compare
Choose a tag to compare

v0.3.1

Highlights

We are excited to announce the 0.3 release of torchao! This release adds support for a new quantize API, MX format, FP6 dtype and bitpacking, 2:4 sparse accelerated training and benchmarking infra for llama2/llama3 models.

quantize API (#256)

We added a tensor subclass based quantization API, see docs and README for details on usage, this is planned to replace all existing quantization APIs in torchao for torch 2.4 and later.

Accelerated training with 2:4 sparsity (#184)

You can now accelerate training with 2:4 sparsity, using the runtime pruning + compression kernels written by xFormers. These kernels process a 4x4 sub-tile to be 2:4 sparse in both directions, to handle both the forward and backward pass when training. We see a 1.3x speedup for the MLP layers of ViT-L across a forward and backwards pass.

MX support (#264)

We added prototype support for MX format for training and inference with a reference native PyTorch implementation of training and inference primitives for using MX accelerated matrix multiplications. The MX numerical formats are new low precision formats with recent acceptance into the OCP spec:
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Benchmarking (#276, #374)

We added a stable way to benchmark llama2 and llama3 models that includes perf/accuracy comparisons. See torchao/_models/llama/benchmarks.sh for more details.

🌟 💥 Community Contributions 🌟 💥

FP6 support (#279, #283, #358)

@gau-nernst Added support for FP6 dtype and mixed matmul FP16 x FP6 kernel with support for torch.compile. Benchmark results show a 2.3x speedup over BF16 baseline for meta-llama/Llama-2-7b-chat-hf

Bitpacking (#307, #282)

@vayuda, @melvinebenezer @CoffeeVampir3 @andreaskoepf Added support for packing/unpacking lower bit dtypes leveraging torch.compile to generate the kernels for this and added UInt2 and Bitnet tensor based on this approach.

FP8 split-gemm kernel #263

Added the kernel written by @AdnanHoque to torchao with speedups compared to the cuBLAS kernel for batch size <=16

BC Breaking

Deprecations

  • Deprecate top level quantization APIs #344

1. int8 weight only quantization

apply_weight_only_int8_quant(model) or change_linear_weights_to_int8_woqtensors(model)

-->

# for torch 2.4+
from torchao.quantization import quantize, int8_weight_only
quantize(model, int8_weight_only())

# for torch 2.2.2 and 2.3
from torchao.quantization.quant_api import change_linear_weights_to_int8_woqtensors
change_linear_weights_to_int8_woqtensors(model)

2. int8 dynamic quantization

apply_dynamic_quant(model) or change_linear_weights_to_int8_dqtensors(model)

-->

# Fuse the int8*int8 -> int32 matmul and subsequent mul op avoiding materialization of the int32 intermediary tensor
torch._inductor.config.force_fuse_int_mm_with_mul = True

# for torch 2.4+
from torchao.quantization import quantize, int8_dynamic_activation_int8_weight
quantize(model, int8_dynamic_activation_int8_weight())

# for torch 2.2.2 and 2.3
from torchao.quantization.quant_api import change_linear_weights_to_int8_dqtensors
change_linear_weights_to_int8_dqtensors(model)

3. int4 weight only quantization

change_linear_weights_to_int4_wotensors(model)

-->

# for torch 2.4+
from torchao.quantization import quantize, int4_weight_only
quantize(model, int4_weight_only())

# for torch 2.2.2 and 2.3
from torchao.quantization.quant_api import change_linear_weights_to_int4_woqtensors
change_linear_weights_to_int4_woqtensors(model)

New Features

  • Add quantize #256
  • Add a prototype of MX format training and inference #264
  • [FP6-LLM] Port splitK map from DeepSpeed #283
  • Improve FP6-LLM 2+4bit weight splitting + user API #279
  • Bitpacking #291
  • training acceleration via runtime semi-structured sparsity #184
  • Bitpackingv2 #307
  • Add FP6-LLM doc and move FP6-LLM to prototype #358
  • Added first bits of Uint2Tensor and BitnetTensor #282

Improvements

  • Improve primitives for FP6 quant #248
  • Extract eval code from GPTQ for more general usage #275
  • Factor out the specific configurations to helper functions #286
  • Add support for AQTLayout, PlainAQTLayout and TensorCoreTiledAQTLayout #278
  • Graceful handling of cpp extensions #296
  • Refactor int8 dynamic quantization with call to quantize #294
  • [NF4][FSDP] return contiguous quantization_factor #298
  • Refactor int4 and int8 weight only quantization to use quantize #301
  • Adding a quick way for users to test model eval for hf models #328
  • Wrap torch.ops.quantized_decomposed to improve import errors #310
  • [NF4Tensor] Switch to save for backward since are now a tensor input #323
  • Refactor rest of tinygemm quant primitive ops #321
  • Move some util functions from quantization.utils to torchao.utils #337
  • Clean up FP6-LLM #304
  • Move quant ops to utils.py #331
  • FP6-LLM clean up (again) #339
  • Improving hf_eval.py #342
  • Generalize Model Size Code #364
  • Minor upgrades to bit pack #347
  • Factor out dispatch and layout registration table #360
  • Add register_apply_tensor_subclass #366
  • Refactor custom FPx cast #363
  • Remove all dependencies except torch #369
  • Enable a test for loading state_dict with tensor subclasses #389
  • 073 scripts for benchmarks #372
  • Add WOQ int8 test with Inductor Freeze #362
  • Benchmarking updates for semi-structured sparse training #398
  • add FSDP QLoRA test and revert failing PR #403
  • Refactor the API for quant method argument for quantize function #400
  • eval script fixes #414

Bug Fixes

  • Fixed the HQQ import skip #262
  • fixing autoquant bug #265
  • Fix eval import after #275 #290
  • Fixed f-string printing of NF4Tensors #297
  • Check and fix dequantize_affine is idempotent #309
  • Update old pretrained TorchVision API in ao tutorials (#313) #314
  • Fix dimension issues for int4 weight only quant path #330
  • Fix compile in hf_eval.py #341
  • task_list to tasks in hf_eval #343
  • fixing peak memory stats for benchmark #353
  • Fix inductor config BC change #382
  • fixing scripts #395

Performance

  • FP8 splitgemm user defined triton kernel #263
  • sparse benchmarking numbers #303
  • Fix FP6-LLM benchmark #312
  • Adding Llama to TorchAO #276
  • Generalize Model Size Code #364
  • eval script for llama #374
  • 077 autoquant gpt fast #361

Docs

  • add static folder for images + fix links #271
  • Fix Readme and remove unused kernel #270
  • Kernel docs #274
  • Quantization Docstrings #273
  • Add AffineQuantizedTensor based workflow doc and examples #277
  • Add AUTOQUANT_CACHE docs for reusing the same quantization plan #329
  • Update nightly build instructions #334
  • add link to benchmarking script #355
  • New README #392
  • Minor README updates #401
  • Add quantize to ...
Read more

v0.2.0

20 May 20:52
f0f00ce
Compare
Choose a tag to compare

What's Changed

Highlights

Custom CPU/CUDA extension to ship CPU/CUDA binaries.

PyTorch core has recently shipped a new custom op registration mechanism with torch.library with the benefit being that custom ops will compose with as many PyTorch subsystems as possible most notably NOT graph breaking with torch.compile()

We'd added some documentation for how you could register your own custom ops https://github.com/pytorch/ao/tree/main/torchao/csrc and if you learn better via example you can follow this PR #135 to add your own custom ops to torchao.

Most notably these instructions were leveraged by @gau-nernst to integrate some new custom ops for fp6 support #223

One key benefit of integrating your kernels in torchao directly is we thanks to our manylinux GPU support can ensure that CPU/CUDA kernels that you've added will work on as many devices and cuda versions as possible #176

A lot of prototype and community contributions

@jeromeku was our community champion merging support for

  1. GaLore our first pretraining kernel that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch
  2. DoRA which has been shown to yield superior fine-tuning accuracy results than QLoRA. This is an area where the community can help us benchmark more thoroughly https://github.com/pytorch/ao/tree/main/torchao/prototype/dora
  3. Fused int4/fp16 quantized matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq

@gau-nernst merged fp6 support showing up to 8x speedups on an fp16 baseline for small batch size inference #223

NF4 support for upcoming FSDP2

@weifengpy merged support for composing FSDP2 with NF4 which makes it easy to implement algorithms like QLoRA + FSDP without writing any CUDA or C++ code. This work also provides a blueprint for how to compose smaller dtypes with FSDP #150 most notably by implementing torch.chunk(). We hope the broader community uses this work to experiment more heavily at the intersection of distributed and quantization research and inspires many more studies such as the ones done by Answer.ai https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

BC breaking

Deprecations

New Features

Improvements

  • FSDP2 support for NF4Tensor (#118, #150, #207)
  • Add save/load of int8 weight only quantized model (#122)
  • Add int_scaled_mm on CPU (#121)
  • Add cpu and gpu in int4wo and int4wo-gptq quantizer (#131)
  • Add torch.export support to int8_dq, int8_wo, int4_wo subclasses (#146, #226, #213)
  • Remove is_gpt_fast specialization from GTPQ (#172)
  • Common benchmark and profile utils (#238)

Bug fixes

  • Fix padding in GPTQ (#119, #120)
  • Fix Int8DynActInt4WeightLinear module swap (#151)
  • Fix NF4Tensor.to to use device kwarg (#158)
  • Fix quantize_activation_per_token_absmax perf regression (#253)

Performance

  • Chunk NF4Tensor construction to reduce memory spike (#196)
  • Fix intmm benchmark script (#141)

Docs

CI

Not user facing

Security

Untopiced

  • Version bumps (#125, #234)
  • Don't import _C in fbcode (#218)

New Contributors

Full Changelog: v0.2.0...v0.2.1

We were able to close about half of tasks for 0.2.0, which will now spill over into upcoming releases. We will post a list for 0.3.0 next, which we aim to release at the end of May 2024. We want to follow a monthly release cadence until further notice.

TorchAO 0.1.0: First Release

04 Apr 23:18
Compare
Choose a tag to compare

Highlights

We’re excited to announce the release of TorchAO v0.1.0! TorchAO is a repository to host architecture optimization techniques such as quantization and sparsity and performance kernels on different backends such as CUDA and CPU. In this release, we added support for a few quantization techniques like int4 weight only GPTQ quantization, added nf4 dtype support for QLoRA and sparsity features like WandaSparsifier, we also added autotuner that can tune triton integer matrix multiplication kernels on cuda.

Note: TorchAO is currently in a pre-release state and under extensive development. The public APIs should not be considered stable. But we welcome you to try out our APIs and offerings and provide any feedback on your experience.

torchao 0.1.0 will be compatible with PyTorch 2.2.2 and 2.3.0, ExecuTorch 0.2.0 and TorchTune 0.1.0.

New Features

Quantization

  • Added tensor subclass based quantization APIs: change_linear_weights_to_int8_dqtensors, change_linear_weights_to_int8_woqtensors and change_linear_weights_to_int4_woqtensors (#1)
  • Added module based quantization APIs for int8 dynamic and weight only quantization apply_weight_only_int8_quant and apply_dynamic_quant (#1)
  • Added module swap version of int4 weight only quantization Int4WeightOnlyQuantizer and Int4WeightOnlyGPTQQuantizer used in TorchTune (#119, #116)
  • Added int8 dynamic activation and int4 weight quantization Int8DynActInt4WeightQuantizer and Int8DynActInt4WeightGPTQQuantizer, used in ExecuTorch (#74) (available after torch 2.3.0 and later)

Sparsity

  • Added WandaSparsifier that prunes both weights and activations (#22)

Kernels

  • Added autotuner for int mm Triton kernels (#41)

dtypes

  • nf4 tensor subclass and nf4 linear (#37, #40, #62)
  • Added uint4 dtype tensor subclass (#13)

Improvements

  • Setup github workflow for regression testing (#50)
  • Setup github workflow for torchao-nightly release (#54)

Documentation

  • Added tutorials for quantizing vision transformer model (#60)
  • Added tutorials for how to add an op for nf4 tensor (#54)

Notes

  • we are still debugging the accuracy problem for Int8DynActInt4WeightGPTQQuantizer
  • Save and load does not work well for tensor subclass based APIs yet
  • We will consolidate tensor subclass and module swap based quantization APIs later
  • uint4 tensor subclass is going to be merged into pytorch core in the future
  • Quantization ops in quant_primitives.py will be deduplicated with similar quantize/dequantize ops in PyTorch later