Skip to content

Commit cfb11b6

Browse files
authored
Merge branch 'main' into dedup-1
2 parents 0464b30 + 808a397 commit cfb11b6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+7305
-78
lines changed

.github/workflows/doc_build.yml

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -43,16 +43,9 @@ jobs:
4343
python -m pip install -e .
4444
cd docs
4545
python -m pip install -r requirements.txt
46-
- name: Get the torchtune version
47-
run: |
48-
# Get the github.ref_name and save into the
49-
# REF_NAME variable. This will be passed in
50-
# conf.py to display the version in the
51-
# site dropdown
52-
GITHUB_REF=${{ github.ref }}
53-
TORCHAO_VERSION_DOCS="${GITHUB_REF}"
54-
echo "$TORCHAO_VERSION_DOCS"
5546
- name: Build docs
47+
env:
48+
TORCHAO_VERSION_DOCS: ${{ github.ref }}
5649
run: |
5750
cd docs
5851
make html
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
name: PyTorch CUDA Nightly Smoke Test
2+
3+
on:
4+
schedule:
5+
# 6 am PST every day
6+
- cron: "0 14 * * *"
7+
workflow_dispatch:
8+
9+
concurrency:
10+
group: regression_test-${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
11+
cancel-in-progress: true
12+
13+
env:
14+
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
15+
16+
jobs:
17+
test:
18+
strategy:
19+
fail-fast: false
20+
matrix:
21+
include:
22+
- name: CUDA Nightly
23+
runs-on: linux.g5.12xlarge.nvidia.gpu
24+
torch-spec: '--pre torch --index-url https://download.pytorch.org/whl/nightly/cu121'
25+
gpu-arch-type: "cuda"
26+
gpu-arch-version: "12.1"
27+
28+
29+
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
30+
with:
31+
runner: ${{ matrix.runs-on }}
32+
gpu-arch-type: ${{ matrix.gpu-arch-type }}
33+
gpu-arch-version: ${{ matrix.gpu-arch-version }}
34+
script: |
35+
python -m pip install --upgrade pip
36+
pip install ${{ matrix.torch-spec }}
37+
pip install -r requirements.txt
38+
pip install -r dev-requirements.txt
39+
python setup.py install
40+
pytest test --verbose -s

.github/workflows/regression_test.yml

Lines changed: 31 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -22,44 +22,45 @@ jobs:
2222
matrix:
2323
include:
2424
- name: CUDA 2.2.2
25-
runs-on: 4-core-ubuntu-gpu-t4
25+
runs-on: linux.g5.12xlarge.nvidia.gpu
2626
torch-spec: 'torch==2.2.2'
27-
- name: CUDA 2.3 RC
28-
runs-on: 4-core-ubuntu-gpu-t4
29-
torch-spec: 'torch==2.3.0 --index-url https://download.pytorch.org/whl/test/cu121'
30-
- name: CUDA Nightly
31-
runs-on: 4-core-ubuntu-gpu-t4
32-
torch-spec: '--pre torch --index-url https://download.pytorch.org/whl/nightly/cu121'
27+
gpu-arch-type: "cuda"
28+
gpu-arch-version: "12.1"
29+
- name: CUDA 2.3
30+
runs-on: linux.g5.12xlarge.nvidia.gpu
31+
torch-spec: 'torch==2.3.0'
32+
gpu-arch-type: "cuda"
33+
gpu-arch-version: "12.1"
34+
- name: CUDA 2.4.0.dev20240421
35+
runs-on: linux.g5.12xlarge.nvidia.gpu
36+
torch-spec: '--pre torch==2.4.0.dev20240421+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121'
37+
gpu-arch-type: "cuda"
38+
gpu-arch-version: "12.1"
3339
- name: CPU 2.2.2
34-
runs-on: 32-core-ubuntu
40+
runs-on: linux.4xlarge
3541
torch-spec: 'torch==2.2.2 --index-url https://download.pytorch.org/whl/cpu'
36-
- name: CPU 2.3 RC
37-
runs-on: 32-core-ubuntu
38-
torch-spec: 'torch==2.3.0 --index-url https://download.pytorch.org/whl/test/cpu'
42+
gpu-arch-type: "cpu"
43+
gpu-arch-version: ""
44+
- name: CPU 2.3
45+
runs-on: linux.4xlarge
46+
torch-spec: 'torch==2.3.0 --index-url https://download.pytorch.org/whl/cpu'
47+
gpu-arch-type: "cpu"
48+
gpu-arch-version: ""
3949
- name: Nightly CPU
40-
runs-on: 32-core-ubuntu
50+
runs-on: linux.4xlarge
4151
torch-spec: '--pre torch --index-url https://download.pytorch.org/whl/nightly/cpu'
42-
43-
runs-on: ${{ matrix.runs-on }}
44-
steps:
45-
- uses: actions/checkout@v2
52+
gpu-arch-type: "cpu"
53+
gpu-arch-version: ""
4654

47-
- name: Set up Python
48-
uses: actions/setup-python@v2
49-
with:
50-
python-version: '3.9'
51-
52-
- name: Install dependencies
53-
run: |
55+
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
56+
with:
57+
runner: ${{ matrix.runs-on }}
58+
gpu-arch-type: ${{ matrix.gpu-arch-type }}
59+
gpu-arch-version: ${{ matrix.gpu-arch-version }}
60+
script: |
5461
python -m pip install --upgrade pip
5562
pip install ${{ matrix.torch-spec }}
5663
pip install -r requirements.txt
5764
pip install -r dev-requirements.txt
58-
59-
- name: Install package
60-
run: |
61-
pip install .
62-
63-
- name: Run tests
64-
run: |
65+
python setup.py install
6566
pytest test --verbose -s

README.md

Lines changed: 102 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,133 @@
11
# torchao: PyTorch Architecture Optimization
22

3-
**Note: This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an github issue**
3+
[![](https://dcbadge.vercel.app/api/server/cudamode?style=flat)](discord.gg/cudamode)
4+
5+
This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues)
46

57
## Introduction
6-
torchao is a PyTorch native library for optimizing your models using lower precision dtypes, techniques like quantization and sparsity and performant kernels.
8+
`torchao` is a PyTorch library for quantization and sparsity.
79

810
## Get Started
9-
To try out our APIs, you can check out API examples in [quantization](./torchao/quantization) (including `autoquant`), [sparsity](./torchao/sparsity), [dtypes](./torchao/dtypes).
1011

11-
## Installation
12-
**Note: this library makes liberal use of several new features in pytorch, its recommended to use it with the current nightly or latest stable version of PyTorch.**
12+
### Installation
13+
`torchao` makes liberal use of several new features in pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.
1314

14-
1. From PyPI:
15+
Stable Release
1516
```Shell
1617
pip install torchao
1718
```
1819

19-
2. From Source:
20+
Nightly Release
21+
```Shell
22+
pip install torchao-nightly
23+
```
24+
25+
From source
2026

2127
```Shell
22-
git clone https://github.com/pytorch-labs/ao
28+
git clone https://github.com/pytorch/ao
2329
cd ao
24-
pip install -e .
30+
python setup.py develop
31+
```
32+
33+
### Quantization
34+
35+
```python
36+
import torch
37+
import torchao
38+
39+
# inductor settings which improve torch.compile performance for quantized modules
40+
torch._inductor.config.force_fuse_int_mm_with_mul = True
41+
torch._inductor.config.use_mixed_mm = True
42+
43+
# Plug in your model and example input
44+
model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
45+
input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda')
46+
47+
# perform autoquantization
48+
torchao.autoquant(model, (input))
49+
50+
# compile the model to recover performance
51+
model = torch.compile(model, mode='max-autotune')
52+
model(input)
2553
```
2654

27-
## Key Features
28-
The library provides
29-
1. Support for lower precision [dtypes](./torchao/dtypes) such as nf4, uint4 that are torch.compile friendly
30-
2. [Quantization algorithms](./torchao/quantization) such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile.
31-
* Int8 dynamic activation quantization
32-
* Int8 and int4 weight-only quantization
33-
* Int8 dynamic activation quantization with int4 weight quantization
34-
* [GPTQ](https://arxiv.org/abs/2210.17323) and [Smoothquant](https://arxiv.org/abs/2211.10438)
35-
* High level `autoquant` API and kernel auto tuner targeting SOTA performance across varying model shapes on consumer/enterprise GPUs.
36-
3. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks
37-
4. Integration with other PyTorch native libraries like [torchtune](https://github.com/pytorch/torchtune) and [ExecuTorch](https://github.com/pytorch/executorch)
55+
### Sparsity
56+
57+
```python
58+
import torch
59+
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
60+
from torch.ao.pruning import WeightNormSparsifier
61+
62+
# bfloat16 CUDA model
63+
model = torch.nn.Sequential(torch.nn.Linear(64, 64)).cuda().to(torch.bfloat16)
64+
65+
# Accuracy: Finding a sparse subnetwork
66+
sparse_config = []
67+
for name, mod in model.named_modules():
68+
if isinstance(mod, torch.nn.Linear):
69+
sparse_config.append({"tensor_fqn": f"{name}.weight"})
70+
71+
sparsifier = WeightNormSparsifier(sparsity_level=1.0,
72+
sparse_block_shape=(1,4),
73+
zeros_per_block=2)
74+
75+
# attach FakeSparsity
76+
sparsifier.prepare(model, sparse_config)
77+
sparsifier.step()
78+
sparsifier.squash_mask()
79+
# now we have dense model with sparse weights
80+
81+
# Performance: Accelerated sparse inference
82+
for name, mod in model.named_modules():
83+
if isinstance(mod, torch.nn.Linear):
84+
mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight))
85+
```
86+
87+
To learn more try out our APIs, you can check out API examples in
88+
* [quantization](./torchao/quantization)
89+
* [sparsity](./torchao/sparsity)
90+
* [dtypes](./torchao/dtypes)
91+
92+
93+
## Supported Features
94+
1. [Quantization algorithms](./torchao/quantization)
95+
- Int4 weight-only quantization TODO: Where is this?
3896

97+
- [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization
98+
- [Int4 weight-only](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu) quantization
99+
- [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) and [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) for low latency inference
100+
- High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs
101+
2. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks
102+
3. Support for lower precision [dtypes](./torchao/dtypes) such as
103+
- [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code
104+
- [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py)
105+
4. [Bleeding Edge Kernels](./torchao/prototype/) for experimental kernels without backwards compatibility guarantees
106+
- [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning
107+
- [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads
39108

40109
## Our Goals
41-
torchao embodies PyTorch’s design philosophy [details](https://pytorch.org/docs/stable/community/design.html), especially "usability over everything else". Our vision for this repository is the following:
42110

43-
* Composability: Native solutions for optimization techniques that compose with both `torch.compile` and `FSDP`
44-
* For example, for QLoRA for new dtypes support
45-
* Interoperability: Work with the rest of the PyTorch ecosystem such as torchtune, gpt-fast and ExecuTorch
46-
* Transparent Benchmarks: Regularly run performance benchmarking of our APIs across a suite of Torchbench models and across hardware backends
111+
* Composability with `torch.compile`: We rely heavily on `torch.compile` to write pure PyTorch code and codegen efficient kernels. There are however limits to what a compiler can do so we don't shy away from writing our custom CUDA/Triton kernels
112+
* Composability with `FSDP`: The new support for FSDP per parameter sharding means engineers and researchers alike can experiment with different quantization and distributed strategies concurrently.
113+
* Performance: We measure our performance on every commit using an A10G. We also regularly run performance benchmarks on the [torchbench](https://github.com/pytorch/benchmark) suite
47114
* Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch).
48-
* Infrastructure Support: Release packaging solution for kernels and a CI/CD setup that runs these kernels on different backends.
115+
* Packaging kernels should be easy: We support custom [CUDA and Triton extensions](./torchao/csrc/) so you can focus on writing your kernels and we'll ensure that they work on most operating systems and devices
49116

50-
## Interoperability with PyTorch Libraries
117+
## Integrations
51118

52-
torchao has been integrated with other repositories to ease usage
119+
torchao has been integrated with other libraries including
53120

54-
* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) is integrated with 8 and 4 bit weight-only quantization techniques with and without GPTQ.
55-
* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) is integrated with GPTQ for both 8da4w (int8 dynamic activation, with int4 weight) and int4 weight only quantization.
121+
* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) leverages our 8 and 4 bit weight-only quantization techniques with optional support for GPTQ
122+
* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) leverages our GPTQ implementation for both 8da4w (int8 dynamic activation with int4 weight) and int4 weight-only quantization.
123+
* [HQQ](https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py) leverages our int4mm kernel for low latency inference
56124

57125
## Success stories
58-
Our kernels have has been used to achieve SOTA inference performance on
126+
Our kernels have been used to achieve SOTA inference performance on
59127

60-
1. Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai)
61-
2. Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2)
62-
3. Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3)
128+
* Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai)
129+
* Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2)
130+
* Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3)
63131

64132
## License
65133

0 commit comments

Comments
 (0)