Wmma support for grouped convolution bwd weight #2947

EnricoDeg · 2025-09-29T13:03:50Z

Proposed changes

Summary:

Modify gridwise implementation to work with convolution (grid descriptors are not created internally but passed from the device level)
Add device level implementation: DeviceGroupedConvBwdWeight_Wmma_CShuffleV3 , DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3 and DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
Add device implementation of batched gemm multiple Ds (needed for explicit gemm - conv bwd weight)
Adapt existing device implementation of explicit gemm to work for both xdl and wmma implementations of batched gemm multiple Ds
Add support for occupancy-based splitk for one stage and two stage implementations of grouped conv bwd weight
Create instances
Add examples
Remove old instances (they don't support splitk)
Add tests for bwd weight scale

The implementations are based on CShuffleV3 but the functionality is the same as xdl.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

…/conv_bwd_weight_wmma' Convolution bwd weight device implementation See merge request amd/ai/composable_kernel!38

- rdna3 compilation error - gridwise layouts (need to be correct to ensure that CheckValidaity() works correctly)

…ation

…e tests

…re/conv_bwd_weight_wmma' Grouped conv: Instances and example bwd weight See merge request amd/ai/composable_kernel!47

Based on batched gemm multiple D

Device implementation of explicit gemm for grouped conv bwd weight See merge request amd/ai/composable_kernel!52

Copilot

Pull Request Overview

This PR introduces WMMA (Wave Matrix Multiply Accumulate) support for grouped convolution backward weight operations, adding a comprehensive alternative to the existing XDL implementation for enhanced performance on supported hardware.

Adds complete WMMA device-level implementations for grouped convolution backward weight operations
Introduces batched GEMM with multiple D tensors for explicit GEMM-based convolution implementations
Extends support for occupancy-based split-k optimization to both one-stage and two-stage implementations

Reviewed Changes

Copilot reviewed 128 out of 128 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
test/grouped_convnd_bwd_weight/test_grouped_convnd_bwd_weight_scale.cpp	New comprehensive test suite for grouped convolution backward weight scale operations
test/grouped_convnd_bwd_weight/test_grouped_convnd_bwd_weight_bilinear.cpp	Enhanced test implementation with improved error threshold calculations for split-k operations
test/grouped_convnd_bwd_weight/test_grouped_convnd_bwd_weight.cpp	Simplified test configuration by removing GPU architecture-specific constraints
test/grouped_convnd_bwd_weight/CMakeLists.txt	Updated build configuration to include new test executables and reorganize dependencies
profiler/src/CMakeLists.txt	Reorganized device instance dependencies in profiler build system
library/src/tensor_operation_instance/gpu/grouped_convnd_bwd_weight/explicit_xdl/	Function name updates to distinguish XDL implementations from WMMA variants
library/src/tensor_operation_instance/gpu/grouped_convnd_bwd_weight/explicit_wmma/	New WMMA-based explicit GEMM implementations for fp16 and bf16 data types
library/src/tensor_operation_instance/gpu/grouped_conv*d_bwd_weight/wmma/	Comprehensive WMMA device implementations for 1D, 2D, and 3D grouped convolutions
library/include/ck/library/tensor_operation_instance/gpu/	Updated header files with new WMMA instance declarations and factory methods

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-29T21:30:14Z

test/grouped_convnd_bwd_weight/test_grouped_convnd_bwd_weight_bilinear.cpp

+
+                using AccDataType = float;
+                float max_accumulated_value =
+                    *std::max_element(wei_host.mData.begin(),wei_host.mData.end());


Missing space after comma in std::max_element call. Should be 'wei_host.mData.begin(), wei_host.mData.end()'.

Suggested change

*std::max_element(wei_host.mData.begin(),wei_host.mData.end());

*std::max_element(wei_host.mData.begin(), wei_host.mData.end());

Copilot · 2025-09-29T21:30:14Z

test/grouped_convnd_bwd_weight/test_grouped_convnd_bwd_weight_scale.cpp

+
+                using AccDataType = float;
+                float max_accumulated_value =
+                    *std::max_element(wei_host.mData.begin(),wei_host.mData.end());


Missing space after comma in std::max_element call. Should be 'wei_host.mData.begin(), wei_host.mData.end()'.

Suggested change

*std::max_element(wei_host.mData.begin(),wei_host.mData.end());

*std::max_element(wei_host.mData.begin(), wei_host.mData.end());

Copilot · 2025-09-29T21:30:15Z

...hwgk/device_grouped_conv2d_bwd_weight_wmma_nhwgc_gkyxc_nhwgk_f16_default_pipev1_instance.cpp

+                                       ConvBwdWeightDefault,
+                                       BlockGemmPipelineScheduler::Intrawave,
+                                       BlockGemmPipelineVersion::v1>{});
+    ;


Stray semicolon after function call. This should be removed.

Suggested change

;

Copilot · 2025-09-29T21:30:15Z

..._nhwgk/device_grouped_conv2d_bwd_weight_wmma_nhwgc_gkyxc_nhwgk_bf16_pad0_pipev1_instance.cpp

+                                                           PassThrough,
+                                                           PassThrough>>>& instances)
+{
+


Extra blank line before add_device_operation_instances call. Should be removed for consistency.

Suggested change

Copilot · 2025-09-29T21:30:15Z

...nclude/ck/library/tensor_operation_instance/gpu/grouped_convolution_backward_weight_wmma.inc

                                                           PassThrough>>>& instances);

-void add_device_grouped_conv3d_bwd_weight_wmma_gndhwc_gkzyxc_gndhwk_f16_1x1s1p0_instances(
+void add_device_grouped_conv3d_bwd_weight_wmma_ndhwgc_gkzyxc_ndhwgk_f16_instances(


[nitpick] Function declaration ordering is inconsistent. Two-stage functions should be grouped together after the regular instance functions for better readability and maintainability.

Copilot · 2025-09-29T21:30:16Z

...nclude/ck/library/tensor_operation_instance/gpu/grouped_convolution_backward_weight_wmma.inc

                                                           PassThrough,
                                                           PassThrough,
                                                           PassThrough>>>& instances);



[nitpick] Function declaration ordering is inconsistent. Two-stage functions should be grouped together after the regular instance functions for better readability and maintainability.

Suggested change

bartekxk

Review will be continued

bartekxk · 2025-09-29T21:29:26Z

example/20_grouped_conv_bwd_weight/grouped_conv_bwd_weight_v3_wmma_bf16.cpp

+        1,                    // CShuffleMRepeatPerShuffle
+        1,                    // CShuffleNRepeatPerShuffle
+        S<1, 32, 1, 4>, // CShuffleBlockTransferClusterLengths_MBlock_MPerBlock_NBlock_NPerBlock
+        128 / (sizeof(WeiDataType) * CHAR_BIT)>; // CShuffleBlockTransferScalarPerVector_NPerBlock


Can we dont extend GNWHC layout format? Please change this example to NHWGC

bartekxk · 2025-09-29T21:29:54Z

example/20_grouped_conv_bwd_weight/grouped_conv_bwd_weight_v3_wmma_fp16.cpp

+                                      ck::tensor_layout::convolution::GNDHWC>>,
+        ck::tuple_element_t<NDimSpatial - 1,
+                            ck::Tuple<ck::tensor_layout::convolution::GKXC,
+                                      ck::tensor_layout::convolution::GKYXC,


Please change this example to NHWGC

bartekxk · 2025-09-29T21:31:13Z

example/20_grouped_conv_bwd_weight/run_grouped_conv_bwd_weight_example.inc

+            ck::utils::get_absolute_threshold<InDataType, WeiDataType, AccDataType>(
+                max_accumulated_value / num_accums_split_k,
+                num_accums / num_accums_split_k);
+


Please use default check_err for split_k ==1 because calculated threshold could hide some errors

bartekxk · 2025-09-29T21:33:03Z

include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_wmma_cshuffle_v3.hpp

+#if defined(__gfx11__)
+    // gfx11 does not support *_atomic_pk_add_f16/bf16 instructions
+    using e_data_type = remove_cvref_t<remove_pointer_t<decltype(karg.p_e_grid)>>;
+    if constexpr(!(EGlobalMemoryDataOperation == InMemoryDataOperationEnum::AtomicAdd &&


Such condition is not causing some register spills? I remember such issue with if constexpr inside global func

bartekxk · 2025-09-29T21:33:26Z

include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_wmma_cshuffle_v3.hpp

+#if(defined(__gfx11__) || defined(__gfx12__))
+#if defined(__gfx11__)
+    // gfx11 does not support *_atomic_pk_add_f16/bf16 instructions
+    using e_data_type = remove_cvref_t<remove_pointer_t<decltype(karg.p_e_grid)>>;


Suggested change

using e_data_type = remove_cvref_t<remove_pointer_t<decltype(karg.p_e_grid)>>;

using EDataType= remove_cvref_t<remove_pointer_t<decltype(karg.p_e_grid)>>;

EnricoDeg added 28 commits September 29, 2025 08:20

Convolution bwd weight device implementation

9d7a01f

Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature…

37b6d28

…/conv_bwd_weight_wmma' Convolution bwd weight device implementation See merge request amd/ai/composable_kernel!38

Fix bug and disable splitK=-1 tests for wmma

9dbbb07

Add generic instances for bf16 f32 bf16

6514b15

check gridwise level validity in device impl for 1 stage D0

2ba2c5d

Fix bugs in device implementation:

8213390

- rdna3 compilation error - gridwise layouts (need to be correct to ensure that CheckValidaity() works correctly)

Add padding in conv to gemm transformers for 1x1Stride1Pad0 specializ…

305cbbc

…ation

Remove workaround for 1x1Stride1Pad0 conv specialization

b1c6973

Add instances for xdl parity (for pipeline v1)

0b7f0cb

Add two stage instances (xdl parity)

e6b7d5e

Add multiple Ds instances

c71f2f2

Add examples

23c9189

Uncomment scale instances

ca078f8

Fix copyright

8ec5908

Fix examples compilation

202cc22

Add atomic add float4

a783028

Fix compilation error

23ccaee

Fix instances

0dc8f8e

Compute tolerances in examples instead of using default ones

671fb7f

Compute tolerances instead of using default ones in bilinear and scal…

7c1c070

…e tests

Merge branch 'grouped_conv_bwd_weight_instances_examples' into 'featu…

207cc39

…re/conv_bwd_weight_wmma' Grouped conv: Instances and example bwd weight See merge request amd/ai/composable_kernel!47

Device implementation of explicit gemm for grouped conv bwd weight

70238ca

Based on batched gemm multiple D

Add instances for pipeline v1 and v3

45b3d26

Add support for occupancy-based splitk

b56e9f6

Fix ckProfiler dependencies

80f7239

Review fixes

85570f9

Merge branch 'explicit_bwd_weight' into 'feature/conv_bwd_weight_wmma'

1221921

Device implementation of explicit gemm for grouped conv bwd weight See merge request amd/ai/composable_kernel!52

Fix cmake file for tests

79bee7c

EnricoDeg requested review from illsilin and carlushuang as code owners September 29, 2025 13:03

EnricoDeg requested review from qianfengz, aosewski, poyenc, geyyer, bartekxk, andriy-ca, afagaj, asleepzzz, ThomasNing, coderfeli, aska-0096, cgmillette, shumway, vidyasagar-amd, a team and tenpercent as code owners September 29, 2025 13:03

bartekxk requested a review from Copilot September 29, 2025 21:28

Copilot AI reviewed Sep 29, 2025

View reviewed changes

bartekxk reviewed Sep 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wmma support for grouped convolution bwd weight #2947

Wmma support for grouped convolution bwd weight #2947

EnricoDeg commented Sep 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 29, 2025

Uh oh!

Copilot AI Sep 29, 2025

Uh oh!

Copilot AI Sep 29, 2025

Uh oh!

Copilot AI Sep 29, 2025

Uh oh!

Copilot AI Sep 29, 2025

Uh oh!

Copilot AI Sep 29, 2025

Uh oh!

bartekxk left a comment

Uh oh!

bartekxk Sep 29, 2025

Uh oh!

bartekxk Sep 29, 2025

Uh oh!

bartekxk Sep 29, 2025

Uh oh!

bartekxk Sep 29, 2025

Uh oh!

bartekxk Sep 29, 2025

Uh oh!

Uh oh!

	*std::max_element(wei_host.mData.begin(),wei_host.mData.end());
	*std::max_element(wei_host.mData.begin(), wei_host.mData.end());

	using e_data_type = remove_cvref_t<remove_pointer_t<decltype(karg.p_e_grid)>>;
	using EDataType= remove_cvref_t<remove_pointer_t<decltype(karg.p_e_grid)>>;

Wmma support for grouped convolution bwd weight #2947

Are you sure you want to change the base?

Wmma support for grouped convolution bwd weight #2947

Conversation

EnricoDeg commented Sep 29, 2025

Proposed changes

Checklist

Discussion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

bartekxk left a comment

Choose a reason for hiding this comment

Uh oh!

bartekxk Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

bartekxk Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

bartekxk Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

bartekxk Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

bartekxk Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!