tests: add unit tests for grouped_gemm_multi_d persistent kernels #2941

AviralGoelAMD · 2025-09-26T20:50:07Z

This PR extends unit test coverage to cover grouped_gemm_multi_d when persistent == true in GemmConfig.

This PR should only be merged when #2933 is merged into develop. It is ready for review though.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged. -> Yes, if Extend Grouped GEMM with MultiD (Single & Double Shared Memory) feature to use persistent kernel option #2933 is merged.

Discussion

…emm and multi_d_gemm feature

…ments

…t segfaults

…emm and multi_d_gemm feature

…ments

test/ck_tile/grouped_gemm_multi_d/test_grouped_gemm_multi_d.cpp

* fix wp gemm bug when permuteN is false * code clean --------- Co-authored-by: valarLip <340077269@qq.com>

…2939)

…ine (#2934) * Fix validation of rotary embedding with time_kernel_ When rotary embedding is used, the appendkv kernel modifies the q tensor (multiple times when time_kernel_ is set). We need to reset the q buffer and rerun all kernels. * Fix synchronization issue in splitkv combine pipeline Different warps can read and then rewrite the same values of lse_acc_lds. Sometimes warps progress at different speeds, one warp can rewrite values that are still being read by another warp. Running the tests multiple times and, preferably, with multiple processes on the same GPU helps to trigger this issue: bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/*KV*"

* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout Add comments with dropout implementation details Fix performance regression of fwd+dropout * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox; * "scalarize" seed and offset, they may come either from kernel args or from device memory (presumably loaded with vector loads). These changes help the compiler to procude more optimal code and reduce register spilling. Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding Use code based on BlockDropout in BlockDropoutBwd Refactor BlockDropout (fwd) Implement BlockDropout (fwd) for WMMA Originally BlockDropout only supported 32x32 tiles (IsWG32 = true), this version supports 16x16 tiles. If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly to BlockDropoutBwd. Implement BlockDropoutBwd for WMMA Remove MakeRandValLds* functions unused in BlockDropoutBwd Remove unused Run overload from BlockDropoutBwd * Fix regression with philox seed and offset when they exceed 32-bit int __builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset are 64-bit so they get truncated. * Add F32 MFMA warp gemms * Support f32 in fwd FMHA * Implement transpose_vectors for 4-byte types (float) * Fix unexpected implicit f32->uint32 cast in buffer_store<4> __builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint). mbuf_t types in other buffer_store<> are changed for consistency. * Support F32 in bwd FMHA hdim = 256 is disabled for now because it uses too much memory on gfx90a * Support Headdim = 48 (divisible by 16) in fwd * Add fp32-specific receipts (800 and 801) * Tune fwd tiles * Tune bwd tiles * Use small tiles only for small seqlen_q * Fix after rebasing * Fix selection of a fallback tile based on bm0 The assumption that the largest bm0 == 128 is not always true for current fp32 tiles. * Remove constraints and adjust filtering for fp32 Custom constraints are no longer needed because now the smallest tile is selected automtically based on seqlen_q. Filters related to qr_async_trload disabled valid fp32 tiles. * Add fp32 tests * Make splitkv and appendkv compile for fp32 only There are no instances yet, but API still must compile when only fp32 is requested. * Remove unimportant f32 instances * Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS * Replace magic numbers with a constant, improve comments for dropout * Update changelog * Fix condition that dq_acc must be set to zero when mask is used The change was introduced in #2799 * Replace warp_uniform with recently added amd_wave_read_first_lane * Add hdim = 96 and 192 to fwd

This change ensures that the files being selected for clang format validation are exactly the ones tracked by the git repo we are testing. This protects against an known issue where the repo being tested contained "stray files" from a previous test.

* Change the return type of run_gemm_combinations in the basic tests * Change the return type of run_gemm_combinations in the universal tests * Add universal GEMM tests for bf16 x pk_i4 and fp16 x pk_i4 * Add universal GEMM test for fp8 x pk_i4 * Add basic GEMM tests for bf16 x pk_i4, fp16 x pk_i4 and fp8 x pk_i4. * Add missing GemmTypeConfig<ck_tile::fp8_t, ck_tile::pk_int4_t, ck_tile::half_t> * Add missing GemmTypeConfig<ck_tile::bf16_t, ck_tile::pk_int4_t, ck_tile::bf16_t> * No need for utility in test_ck_tile_elementwise_1d * Fix conversion from pk_int4x4_t to bf16x8_t in PassThroughPack8 * Avoid union-based type punning in float_to_bf16_truc_raw to make it constexpr compliant * For consistency also make float_to_bf16_truc_nan_raw constexpr compliant by removing the union * Use a static_cast to bfloat16_t only when CK_TILE_USE_LLVM_BUILTIN_BF16 is enforced * Convert from float to bf16 during compilation rather than using magic values * Fix conversion from pk_int4x4_t to fp8x8_t in PassThroughPack8 * Comment out the basic test for fp16 x pk_i4 as it does not pass * Add missing GemmTypeConfig<ck_tile::bf8_t, ck_tile::pk_int4_t, ck_tile::half_t> * Fix conversion from pk_int4x4_t to bf8x8_t in PassThroughPack8 * Add basic and universal GEMM tests for bf8 x pk_i4 * Switch back to amd_assembly_i4_to_fp8x8 in PassThroughPack8 as it works now * Switch back to amd_assembly_i4_to_bf8x8 in PassThroughPack8 as it works now * Remove the inefficient fallbacks for fp8 and bf8 in elementwise/unary_element_wise_operation.hpp * Use explicit macros for enabling and disabling the the constexpr lookup based converters * Fix two failing tests * Avoid union-based type punning in float_to_bf16_rtn_raw to make it constexpr compliant * Use float_to_bf16_rtn_raw instead of float_to_bf16 to create the bf16 lookup table for use in conversions from pk_int4 to bf16 * On ROCm 7.0.1 we need an explicit cast to from uint16_t to bf16_t

* Grouped Conv Bwd Data index calculation optimizations * fixes * refactor instances * gfx12 fixes * temporary disable splitK for gfx12

root cause: AK1 and BK1 may different in class template. so we need calculate k0 per block separately when ksplit is not 1.

* fix:tf32:fix build fail for all supported targets * new fix code

…2884) * [CK][Examples] Extending support for rdna3/4 in following examples: -example_gemm_xdl_splitk_reduce_multi_d_fp16 -example_gemm_xdl_splitk_reduce_multi_d_bf16 -example_gemm_xdl_splitk_reduce_bf16A_i8B -example_gemm_xdl_splitk_reduce_bfp16 -example_splitk_gemm_bias_e_permute_xdl_fp32 -example_gemm_add_multiply_xdl_fp16 -example_complex_contraction_bilinear_xdl_fp32 -example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16 -example_batched_gemm_bias_e_permute_xdl_fp16 -example_gemm_xdl_fp16 -example_gemm_xdl_fp16_av2 -example_gemm_xdl_wavelet_fp16 -example_gemm_add_add_fastgelu_xdl_bf16 -example_gemm_add_add_fastgelu_xdl_fp16 -example_gemm_add_add_fastgelu_xdl_fp32 -example_grouped_gemm_xdl_fp32 -example_grouped_gemm_xdl_fp16 -example_grouped_gemm_xdl_bf16 -example_cgemm_xdl_bf16 -example_cgemm_xdl_fp16 Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> * [CK][Examples] Extending support for rdna3/4 in following examples: -example_gemm_xdl_splitk_reduce_multi_d_fp16 -example_gemm_xdl_splitk_reduce_multi_d_bf16 -example_gemm_xdl_splitk_reduce_bf16A_i8B -example_gemm_xdl_splitk_reduce_bfp16 -example_splitk_gemm_bias_e_permute_xdl_fp32 -example_gemm_add_multiply_xdl_fp16 -example_complex_contraction_bilinear_xdl_fp32 -example_grouped_gemm_lower_triangle_scale_softmax_gemm_permute_xdl_fp16 -example_batched_gemm_bias_e_permute_xdl_fp16 -example_gemm_xdl_fp16 -example_gemm_xdl_fp16_av2 -example_gemm_xdl_wavelet_fp16 -example_gemm_add_add_fastgelu_xdl_bf16 -example_gemm_add_add_fastgelu_xdl_fp16 -example_gemm_add_add_fastgelu_xdl_fp32 -example_grouped_gemm_xdl_fp32 -example_grouped_gemm_xdl_fp16 -example_grouped_gemm_xdl_bf16 -example_cgemm_xdl_bf16 -example_cgemm_xdl_fp16 Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> --------- Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>

* hot fix check eid range * fix clang format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

* initial commit * remove extra files * fixing errors * updated ReadMe file for mapping of diff quants with diff configs * addressing review comments * addressing review comments * Resolved merge conflicts * [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled The get_preshuffle_or was not working as expected, which led to incorrect behavior in the quantization preshuffle process. This change replaces it with the more reliable is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied. * initial commit * debugging * working fp8 for init constant * fp8 working with all inits * updated block level code with comments * changing the loop iter * debugging * debugging * debugging * code fix * code clean up * clang formatted * Add comment * code cleanup * clang formatted * merge conflicts fixes * applying the latest int4 changes to the piepline * fixing test code for updated traits * Adding gtest * review comments addressed * addressing review comments * remove c++20 code * added flush cache changes --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>

Addition of initial CK Tile Stream-K example for bf16 and fp16. These examples are minimal. As more functionality and gtests are added for Stream-K (coming in future PRs), these examples will be expanded.

The following changes were made: - Clean-up of variable namings - Addition of README - Removal of num_cu and occupancy args; such options are meant for testing purposes and should not be exposed to the user - Removal of CK_TILE_PIPELINE_MEMORY macro and PipelineTypeTraits class since we only support one pipeline at the moment.

…emm and multi_d_gemm feature

…t segfaults

…_buffer_multi_d

AviralGoelAMD added 16 commits September 26, 2025 18:26

feat(grouped_gemm_multi_d): add new example that integrates grouped_g…

1ad2c9a

…emm and multi_d_gemm feature

refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

aad24d9

tests(grouped_gemm): grouped_gemm test suite passes with minor adjust…

81cc462

…ments

fix: segfault fix by passing correct parameters for d tensors

98021d9

style: clang format

9b6222a

WIP: host code for grouped_gemm_multi_d persistent kernel compiles bu…

7598d27

…t segfaults

feat(grouped_gemm_multi_d): add functionality to run persistant kernel

c7151d5

feat(grouped_gemm_multi_d): add new example that integrates grouped_g…

d4d9894

…emm and multi_d_gemm feature

refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel

0b53962

tests(grouped_gemm): grouped_gemm test suite passes with minor adjust…

3477e1f

…ments

fix: segfault fix by passing correct parameters for d tensors

3f51a92

style: clang format

8596a2d

fix: incorrect validation method and Dtensor layout in test suite

7b4c31e

docs: improved README text based on review comments

8c58908

fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint

32c321c

tests: add unit tests for grouped_gemm_multi_d persistent kernels

2b58bf7

AviralGoelAMD requested review from illsilin, carlushuang, qianfengz, aosewski, poyenc, geyyer, bartekxk, andriy-ca, afagaj, asleepzzz, ThomasNing, coderfeli, aska-0096 and cgmillette as code owners September 26, 2025 20:50

AviralGoelAMD requested review from shumway, vidyasagar-amd, ddembeckAMD, a team and tenpercent as code owners September 26, 2025 20:50

docs: updated changelog with new feature info

90de208

ThomasNing reviewed Sep 29, 2025

View reviewed changes

test/ck_tile/grouped_gemm_multi_d/test_grouped_gemm_multi_d.cpp Show resolved Hide resolved

lalala-sh and others added 22 commits September 30, 2025 00:02

fix wp gemm bug when permuteN is false (#2935)

ac77d00

* fix wp gemm bug when permuteN is false * code clean --------- Co-authored-by: valarLip <340077269@qq.com>

fix copy-paste bug in get_matrix_b; re-enable all tests in multi_abd (#…

a59de88

…2939)

Grouped Conv Bwd Data out index calculation optimizations (#2917)

b77b1fc

* Grouped Conv Bwd Data index calculation optimizations * fixes * refactor instances * gfx12 fixes * temporary disable splitK for gfx12

[CK] Fix example_grouped_conv_bwd_data_xdl_fp16 with ksplit = 2 (#2943)

b83bcb3

root cause: AK1 and BK1 may different in class template. so we need calculate k0 per block separately when ksplit is not 1.

fix:tf32:fix build fail for all supported targets (#2942)

de7b8dd

* fix:tf32:fix build fail for all supported targets * new fix code

hot fix check eid range (#2924)

63e097d

* hot fix check eid range * fix clang format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

increase time limit for AITER tests (#2948)

f062764

Add CK Tile Stream-K bf16 and fp16 examples

cdc2af5

Addition of initial CK Tile Stream-K example for bf16 and fp16. These examples are minimal. As more functionality and gtests are added for Stream-K (coming in future PRs), these examples will be expanded.

Fix timing issue in CK_TILE GEMM example (#2940)

ac9fa76

feat(grouped_gemm_multi_d): add new example that integrates grouped_g…

7c43bb6

…emm and multi_d_gemm feature

WIP: host code for grouped_gemm_multi_d persistent kernel compiles bu…

edb9362

…t segfaults

feat(grouped_gemm_multi_d): add functionality to run persistant kernel

c0032a6

fix: parameterize NumDTensor in GroupedGemmHostArgs and remove lint

52a50d0

Merge branch 'develop' into aviralgoel/grouped_gemm_persistent_double…

95c9e50

…_buffer_multi_d

style: clang format

75a59ed

AviralGoelAMD requested a review from ThomasNing September 30, 2025 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tests: add unit tests for grouped_gemm_multi_d persistent kernels #2941

tests: add unit tests for grouped_gemm_multi_d persistent kernels #2941

AviralGoelAMD commented Sep 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tests: add unit tests for grouped_gemm_multi_d persistent kernels #2941

Are you sure you want to change the base?

tests: add unit tests for grouped_gemm_multi_d persistent kernels #2941

Conversation

AviralGoelAMD commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Discussion

Uh oh!

Uh oh!

Uh oh!

AviralGoelAMD commented Sep 26, 2025 •

edited

Loading