Skip to content

Conversation

Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Jul 8, 2025

Summary
This PR adds Float8OpaqueTensor for dynamic float8 act float8 weight quantization on X86 CPU.
It adds

  • A new tensor subclass: Float8OpaqueTensor
  • Two new ops: float8_linear_prepack_cpu and float8_linear_cpu
  • CPP kernels for the two new ops

The kernel computes FP8 GEMM with supported ISA.

Test plan

pytest -sv test/quantization/quantize_/workflows/float8/test_float8_opaque_tensor.py

Copy link

pytorch-bot bot commented Jul 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2505

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5e75764 with merge base a951643 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 8, 2025
@Xia-Weiwen Xia-Weiwen added topic: new feature Use this tag if this PR adds a new feature cpu labels Jul 8, 2025
@Xia-Weiwen
Copy link
Collaborator Author

Hi @chunyuan-w @mingfeima Could you please review this PR? Thanks.

@chunyuan-w
Copy link

@Xia-Weiwen
Copy link
Collaborator Author

Should we move the conversion vec code to this file? https://github.com/pytorch/pytorch/blob/cd995bfb2aac8891465809be3ce29543bd524287/aten/src/ATen/cpu/vec/vec512/vec512_float8.h

Similar to this PR: pytorch/pytorch#152417

Thanks for the comment. If we move it to PyTorch, a problem might be that we need to check if the function is available at compile time. We may do it step by step, and for now it might be better that we keep it here.

@Xia-Weiwen Xia-Weiwen requested a review from chunyuan-w July 11, 2025 10:10
@jerryzh168
Copy link
Contributor

  1. I think we want to add this with the new design
  2. I feel this should probably be added to prototype first, and if we get wider adoptions we can move to official API

@Xia-Weiwen Xia-Weiwen requested review from Copilot and removed request for chunyuan-w September 9, 2025 06:38
@Xia-Weiwen Xia-Weiwen changed the title [CPU] Add support for dynamic float8 act float8 weight on CPU [CPU] Add Float8OpaqueTensor for dynamic float8 act float8 weight on CPU Sep 9, 2025
@Xia-Weiwen Xia-Weiwen changed the title [CPU] Add Float8OpaqueTensor for dynamic float8 act float8 weight on CPU [CPU] Add Float8OpaqueTensor for dynamic float8 act float8 weight Sep 9, 2025
Copilot

This comment was marked as outdated.

@Xia-Weiwen Xia-Weiwen requested a review from Copilot September 9, 2025 13:52
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds Float8OpaqueTensor for dynamic float8 activation and weight quantization on X86 CPU. This introduces a CPU-optimized tensor subclass that uses opaque memory layout for better performance on supported CPU ISAs.

  • Adds Float8OpaqueTensor subclass with reordered memory layout for CPU optimization
  • Implements two new CPU operators: float8_linear_prepack_cpu and float8_linear_cpu
  • Extends Float8DynamicActivationFloat8WeightConfig to support opaque packing format

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
float8_packing_format.py Defines Float8PackingFormat enum with PLAIN and OPAQUE options
float8_opaque_tensor.py New Float8OpaqueTensor subclass implementation with CPU optimizations
quant_api.py Extends config to support opaque packing format and CPU device checks
ops.py Adds float8_linear_prepack_cpu and float8_linear_cpu operator definitions
float8_linear.cpp CPU kernel implementation for float8 linear operations
observer.py Adds PerGroup support to get_block_size function
init.py files Updates module exports
test_float8_opaque_tensor.py Comprehensive test suite for new functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@mingfeima mingfeima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, just some minor places to address.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR again? Thanks.

@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review September 18, 2025 01:36
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR again? Thanks.

block_size[granularity.axis] = 1
return tuple(block_size)
elif isinstance(granularity, PerRow):
elif isinstance(granularity, (PerRow, PerToken)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe create a separate PR to merge these two:

def get_block_size(
?

and move it to torchao/quantization/utils ?

Copy link
Contributor

@jerryzh168 jerryzh168 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the pt2e one is not used that much, so should be relatively easy to remove

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will have another PR to do it.

(act_granularity, weight_granularity) = _normalize_granularity(
base_config.granularity
)
assert act_granularity == weight_granularity and isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split fake quant changes to a separate file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jerryzh168 This file is about fake quant. Do you mean a separate file or a separate PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this PR needs this change for fake quant because this PR moves the checks from inside _normalize_granularity out. And the same for similar changes elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, separate PR

kernel_preference: KernelPreference = KernelPreference.AUTO
set_inductor_config: bool = True
version: int = 2
packing_format: Float8PackingFormat = Float8PackingFormat.PLAIN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: packing_format --> float8_packing_format

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated.

return x


class TestDynamicFloat8Linear(TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: TestFloat8OpaqueTensor?

@common_utils.parametrize("x_dim", [2, 3])
@common_utils.parametrize("bias", [True, False])
@common_utils.parametrize("bs", [1, 128])
def test_dynamic_float8_linear_per_tensor_cpu(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is per tensor activation? might be good to clarify

with torch.no_grad():
quantize_(
m,
get_config([PerRow(), PerGroup(group_size)]),
Copy link
Contributor

@jerryzh168 jerryzh168 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these tests not combined into the same one? seems all of them are very similar

]
],
) -> Tuple[FP8Granularity, FP8Granularity]:
supported_granularities = (PerTensor, PerRow, PerGroup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only supported for CPU? I think we also rely on this in the cuda path, and it will be surprising if it says supported here but error out somewhere else

jerryzh168

This comment was marked as duplicate.

jerryzh168

This comment was marked as duplicate.

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks mostly fine, I'd suggest to split the PR to

  • get_block_size changes
  • add a normalize_granularity for cpu in quantization/quantize_/workflows/float8/? since cuda doesn't support PerBlock yet
  • PR to add float8 linear op? - also probably need some tests for the op itself
  • then the PR for Float8Tensor

@Xia-Weiwen
Copy link
Collaborator Author

looks mostly fine, I'd suggest to split the PR to

  • get_block_size changes
  • add a normalize_granularity for cpu in quantization/quantize_/workflows/float8/? since cuda doesn't support PerBlock yet
  • PR to add float8 linear op? - also probably need some tests for the op itself
  • then the PR for Float8Tensor

Thanks. I will split it into multiple PRs.

@Xia-Weiwen
Copy link
Collaborator Author

@Xia-Weiwen Xia-Weiwen closed this Sep 26, 2025
@Xia-Weiwen Xia-Weiwen deleted the float8_da8w8 branch September 26, 2025 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. cpu topic: new feature Use this tag if this PR adds a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants