Grouped Conv Bwd Weight Direct Load #3648

bartekxk · 2026-01-26T10:25:09Z

Proposed changes

Add Grouped Conv Bwd Weight Direct Load implementation and instances

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

jakpiase · 2026-01-26T11:05:57Z

include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_conv_v3.hpp

+        if constexpr(DirectLoad)
+        {
+            return make_naive_tensor_descriptor(
+                make_tuple(AK0Number, Number<NPerBlock>{}, AK1Number),


I believe that BK0, BK1Number and such are proper, since it's B layout

You are right, thanks

Copilot

Pull request overview

This PR adds support for direct load implementation in grouped convolution backward weight operations, introducing a hardware-optimized memory transfer path for gfx950 devices. The implementation uses the ThreadGroupTensorSliceTransfer_DirectLoad mechanism with specific handling for F16 and BF16 data types.

Changes:

Added direct load instances for F16 and BF16 grouped conv backward weight operations
Introduced DirectLoad and LdsScalarLoad template parameters throughout the pipeline
Implemented device-specific validation to restrict direct load to gfx950 architecture

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
device_grouped_conv2d_bwd_weight_xdl_nhwgc_gkyxc_nhwgk_f16_direct_load.cpp	Instantiates F16 direct load device operations for grouped conv backward weight
device_grouped_conv2d_bwd_weight_xdl_nhwgc_gkyxc_nhwgk_bf16_direct_load.cpp	Instantiates BF16 direct load device operations for grouped conv backward weight
CMakeLists.txt	Registers new F16 and BF16 direct load source files in build system
grouped_convolution_backward_weight_xdl.inc	Declares F16 and BF16 direct load instance functions
grouped_convolution_backward_weight.hpp	Integrates direct load instances into factory pattern
device_grouped_conv_bwd_weight_v3_xdl_instance.hpp	Defines F16 and BF16 direct load instance configurations with true flag
gridwise_gemm_xdl_cshuffle_conv_v3.hpp	Adds direct load conditional logic with specialized block descriptors and transfer handling
device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp	Implements direct load support with gfx950 validation and group merging logic
thread_group_tensor_slice_transfer_direct_load.hpp	Removes destination vector dimension constraint for direct load
blockwise_gemm_pipeline_xdlops_v1.hpp	Adds LdsScalarLoad parameter to direct load pipeline
blockwise_gemm_pipeline_xdlops_selector.hpp	Adds LdsScalarLoad selection logic with validation
blockwise_gemm_pipeline_xdlops_base.hpp	Implements scalar load logic for LDS transfers when enabled

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp

include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_conv_v3.hpp

include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_base.hpp

aosewski · 2026-01-27T15:05:51Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp

+#if defined(__gfx950__)
+            DispatchSplitKHack<GridwiseGemm,
+                               AGridDesc_AK0_M_K1,
+                               BGridDesc_BK0_N_K1,
+                               CGridDesc_MBlock_MPerBlock_NBlock_NPerBlock,
+                               HasMainKBlockLoop,
+                               CGlobalMemoryDataOperation,
+                               TailNum>(karg.p_a_grid + a_batch_offset + split_k_offset_a,
+                                        karg.p_b_grid + b_batch_offset + split_k_offset_b,
+                                        karg.p_c_grid + e_batch_offset,
+                                        p_shared,
+                                        karg,
+                                        a_grid_desc_ak0_m_ak1,
+                                        b_grid_desc_bk0_n_bk1,
+                                        c_grid_desc_mblock_mperblock_nblock_nperblock,
+                                        k_idx * num_k_per_block,
+                                        gridDim.y,
+                                        split_k_offset_hack);
+#endif
+        }
+        else
+        {
+            DispatchSplitKHack<GridwiseGemm,


What's the difference ? Both invocations seem to pass identical parameter set.

I need to disable this on other archs

include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp

Grouped Conv Bwd Weight Direct Load

20b056d

bartekxk self-assigned this Jan 26, 2026

bartekxk requested review from a team, Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 26, 2026 10:25

jakpiase reviewed Jan 26, 2026

View reviewed changes

bartekxk and others added 3 commits January 26, 2026 13:16

Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp

25cb028

Implement group merging for bwd_weight and add instances

a9db310

Link direct load instances

b51d7ae

aosewski requested a review from Copilot January 26, 2026 14:47

Copilot AI reviewed Jan 26, 2026

View reviewed changes

include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp Show resolved Hide resolved

include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_conv_v3.hpp Show resolved Hide resolved

bartekxk added 3 commits January 27, 2026 09:47

Merge branch 'develop' into barkocot/direct-load-conv-wrw

f121cc2

builder fixes

b4d3fda

fix

82c1791

aosewski reviewed Jan 27, 2026

View reviewed changes

bartekxk added 3 commits January 27, 2026 10:32

fixes

ad99f5a

Merge branch 'develop' into barkocot/direct-load-conv-wrw

8e33613

fix

19d96f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouped Conv Bwd Weight Direct Load #3648

Grouped Conv Bwd Weight Direct Load #3648

bartekxk commented Jan 26, 2026

Uh oh!

jakpiase Jan 26, 2026

Uh oh!

bartekxk Jan 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aosewski Jan 27, 2026

Uh oh!

bartekxk Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Grouped Conv Bwd Weight Direct Load #3648

Are you sure you want to change the base?

Grouped Conv Bwd Weight Direct Load #3648

Conversation

bartekxk commented Jan 26, 2026

Proposed changes

Checklist

Discussion

Uh oh!

jakpiase Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

bartekxk Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aosewski Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

bartekxk Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants