-
Notifications
You must be signed in to change notification settings - Fork 342
[CPU] Add Float8OpaqueTensor for dynamic float8 act float8 weight #2505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2505
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5e75764 with merge base a951643 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @chunyuan-w @mingfeima Could you please review this PR? Thanks. |
Should we move the conversion vec code to this file? https://github.com/pytorch/pytorch/blob/cd995bfb2aac8891465809be3ce29543bd524287/aten/src/ATen/cpu/vec/vec512/vec512_float8.h Similar to this PR: pytorch/pytorch#152417 |
Thanks for the comment. If we move it to PyTorch, a problem might be that we need to check if the function is available at compile time. We may do it step by step, and for now it might be better that we keep it here. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds Float8OpaqueTensor for dynamic float8 activation and weight quantization on X86 CPU. This introduces a CPU-optimized tensor subclass that uses opaque memory layout for better performance on supported CPU ISAs.
- Adds Float8OpaqueTensor subclass with reordered memory layout for CPU optimization
- Implements two new CPU operators: float8_linear_prepack_cpu and float8_linear_cpu
- Extends Float8DynamicActivationFloat8WeightConfig to support opaque packing format
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
float8_packing_format.py | Defines Float8PackingFormat enum with PLAIN and OPAQUE options |
float8_opaque_tensor.py | New Float8OpaqueTensor subclass implementation with CPU optimizations |
quant_api.py | Extends config to support opaque packing format and CPU device checks |
ops.py | Adds float8_linear_prepack_cpu and float8_linear_cpu operator definitions |
float8_linear.cpp | CPU kernel implementation for float8 linear operations |
observer.py | Adds PerGroup support to get_block_size function |
init.py files | Updates module exports |
test_float8_opaque_tensor.py | Comprehensive test suite for new functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
torchao/quantization/quantize_/workflows/float8/float8_packing_format.py
Outdated
Show resolved
Hide resolved
torchao/quantization/quantize_/workflows/float8/float8_opaque_tensor.py
Outdated
Show resolved
Hide resolved
…tensor.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, just some minor places to address.
Hi @jerryzh168 Could you please review this PR again? Thanks. |
Hi @jerryzh168 Could you please review this PR again? Thanks. |
block_size[granularity.axis] = 1 | ||
return tuple(block_size) | ||
elif isinstance(granularity, PerRow): | ||
elif isinstance(granularity, (PerRow, PerToken)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe create a separate PR to merge these two:
ao/torchao/quantization/pt2e/observer.py
Line 1783 in 18dbe87
def get_block_size( |
and move it to torchao/quantization/utils
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like the pt2e one is not used that much, so should be relatively easy to remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will have another PR to do it.
(act_granularity, weight_granularity) = _normalize_granularity( | ||
base_config.granularity | ||
) | ||
assert act_granularity == weight_granularity and isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split fake quant changes to a separate file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jerryzh168 This file is about fake quant. Do you mean a separate file or a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this PR needs this change for fake quant because this PR moves the checks from inside _normalize_granularity
out. And the same for similar changes elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, separate PR
torchao/quantization/quant_api.py
Outdated
kernel_preference: KernelPreference = KernelPreference.AUTO | ||
set_inductor_config: bool = True | ||
version: int = 2 | ||
packing_format: Float8PackingFormat = Float8PackingFormat.PLAIN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: packing_format
--> float8_packing_format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated.
return x | ||
|
||
|
||
class TestDynamicFloat8Linear(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: TestFloat8OpaqueTensor
?
@common_utils.parametrize("x_dim", [2, 3]) | ||
@common_utils.parametrize("bias", [True, False]) | ||
@common_utils.parametrize("bs", [1, 128]) | ||
def test_dynamic_float8_linear_per_tensor_cpu( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is per tensor activation? might be good to clarify
with torch.no_grad(): | ||
quantize_( | ||
m, | ||
get_config([PerRow(), PerGroup(group_size)]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these tests not combined into the same one? seems all of them are very similar
] | ||
], | ||
) -> Tuple[FP8Granularity, FP8Granularity]: | ||
supported_granularities = (PerTensor, PerRow, PerGroup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this only supported for CPU? I think we also rely on this in the cuda path, and it will be surprising if it says supported here but error out somewhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks mostly fine, I'd suggest to split the PR to
get_block_size
changes- add a normalize_granularity for cpu in quantization/quantize_/workflows/float8/? since cuda doesn't support PerBlock yet
- PR to add float8 linear op? - also probably need some tests for the op itself
- then the PR for Float8Tensor
Thanks. I will split it into multiple PRs. |
We split this PR into the following smaller ones:
Closing this one. |
Summary
This PR adds Float8OpaqueTensor for dynamic float8 act float8 weight quantization on X86 CPU.
It adds
Float8OpaqueTensor
float8_linear_prepack_cpu
andfloat8_linear_cpu
The kernel computes FP8 GEMM with supported ISA.
Test plan