Skip to content

Conversation

vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Sep 3, 2025

Summary:

Short term fix for #2932. If torchao was built without CUDA 10.0 (such as in our CI), ensures that:
a. only callsites which actually use the mxfp8 dim1 kernel see the error message. Using NVFP4 no longer hits this error.
b. make the error message point to github issue for more info on the workaround (for now, build from souce).

Test Plan:

  1. hardcode mxfp8 kernel from being built:

    ao/setup.py

    Line 641 in 8555713

    if mxfp8_src_files_exist and build_for_sm100a:
  2. build torchao from source, verify torchao/prototype does not have any .so files
  3. run nvfp4 tests, verify they now pass: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -s -x
  4. run mxfp8 linear tests, verify the new error message is displayed for dim1 kernel tests: pytest test/prototype/mx_formats/test_mx_linear.py -s -x -k test_linear_eager_vs_hp
  5. undo the change in (1), rebuild torchao, verify all mx tests pass: pytest test/prototype/mx_formats/ -s -x

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:

Short term fix for #2932.
If torchao was build without CUDA 10.0 (such as in our CI), ensures
that:
a. only callsites which actually use the mxfp8 dim1 kernel see the error
message. Using NVFP4 no longer hits this error.
b. make the error message point to github issue for more info on the
workaround (for now, build from souce).

Test Plan:

1. hardcode mxfp8 kernel from being built:
https://github.com/pytorch/ao/blob/85557135c93d3429320a4a360c0ee9cb49f84a00/setup.py#L641
2. build torchao from source, verify `torchao/prototype` does not have
   any `.so` files
3. run nvfp4 tests, verify they now pass: `pytest test/prototype/mx_formats/test_nvfp4_tensor.py -s -x`
4. run mxfp8 linear tests, verify the new error message is displayed for
   dim1 kernel tests: `pytest test/prototype/mx_formats/test_mx_linear.py -s -x -k test_linear_eager_vs_hp`
5. undo the change in (1), rebuild torchao, verify all mx tests pass: `pytest test/prototype/mx_formats/ -s -x`

Reviewers:

Subscribers:

Tasks:

Tags:
Copy link

pytorch-bot bot commented Sep 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2933

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 4 Pending

As of commit c3f0e65 with merge base 8555713 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 3, 2025
@vkuzo vkuzo added the topic: bug fix Use this tag for PRs that fix bugs label Sep 3, 2025
@vkuzo vkuzo merged commit f35ae41 into main Sep 3, 2025
18 of 20 checks passed
vkuzo added a commit that referenced this pull request Sep 3, 2025
Summary:

Short term fix for #2932.
If torchao was build without CUDA 10.0 (such as in our CI), ensures
that:
a. only callsites which actually use the mxfp8 dim1 kernel see the error
message. Using NVFP4 no longer hits this error.
b. make the error message point to github issue for more info on the
workaround (for now, build from souce).

Test Plan:

1. hardcode mxfp8 kernel from being built:
https://github.com/pytorch/ao/blob/85557135c93d3429320a4a360c0ee9cb49f84a00/setup.py#L641
2. build torchao from source, verify `torchao/prototype` does not have
   any `.so` files
3. run nvfp4 tests, verify they now pass: `pytest test/prototype/mx_formats/test_nvfp4_tensor.py -s -x`
4. run mxfp8 linear tests, verify the new error message is displayed for
   dim1 kernel tests: `pytest test/prototype/mx_formats/test_mx_linear.py -s -x -k test_linear_eager_vs_hp`
5. undo the change in (1), rebuild torchao, verify all mx tests pass: `pytest test/prototype/mx_formats/ -s -x`

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: bug fix Use this tag for PRs that fix bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants