Congma/ck tile/preshuffle b #3645

CongMa13 · 2026-01-24T00:01:43Z

This PR will improve the performance of the preshuffle_b

Introduces a constraint on the per-thread load size along the K dimension from global memory.
Each thread now loads either:
- 16 bytes (a single dwordx4 instruction), or
- Exactly the K required by the MFMA instruction when 16 bytes is inadequate.
In the 16-byte mode, data from one dwordx4 load can be consumed by one or multiple MFMA instructions.
In the MFMA-K mode, multiple dwordx4 loads may be consumed by a single MFMA instruction (e.g., f8_16x16x128 on gfx950).
Both modes are tuned to deliver optimal performance.
Adds a helper function get_k_warp_tile_for_preshuffle_b to compute the per-lane load size.

Copilot

Pull request overview

This PR refactors how K_Warp_Tile is chosen for preshuffle_b to better align per-lane global-memory load sizes with MFMA K requirements, aiming to improve preshuffle_b performance across architectures.

Changes:

Added get_k_warp_tile_for_preshuffle_b and updated multiple configs (tests/examples) to use it for preshuffle-B kernels.
Simplified B-shuffle host reference layouts and adjusted warp-lane factoring in tensor_shuffle_utils.hpp.
Updated WP pipeline policy KB-per-load computation.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle_util.hpp	Switches grouped preshuffle tests to the new `get_k_warp_tile_for_preshuffle_b` helper.
test/ck_tile/gemm_weight_preshuffle/test_gemm_pipeline_util.hpp	Updates test configs to compute `K_Warp_Tile` via helper logic and adds needed include.
test/ck_tile/gemm_multi_abd/test_gemm_multi_abd_util.hpp	Removes local `get_k_warp_tile` helper and relies on shared header.
test/ck_tile/gemm_block_scale/test_gemm_quant_fixtures.hpp	Removes duplicated `K_Warp_Tile` derivation helpers and centralizes on shared header.
test/ck_tile/gemm_block_scale/test_gemm_quant_base.hpp	Derives `K_Warp_Tile` dynamically based on whether preshuffle-B is enabled.
include/ck_tile/ops/gemm/pipeline/wp_pipeline_agmem_bgmem_creg_base_policy.hpp	Alters how KB-per-load is computed for the weight preshuffle pipeline policy.
include/ck_tile/ops/gemm/pipeline/tile_gemm_shape.hpp	Introduces `get_k_warp_tile_for_preshuffle_b`.
include/ck_tile/host/tensor_shuffle_utils.hpp	Updates host reference shuffling for B to use warp-lane factoring (k-lane-per-warp).
example/ck_tile/38_block_scale_gemm/gemm_utils.hpp	Updates example configs to use `get_k_warp_tile_for_preshuffle_b`.
example/ck_tile/17_grouped_gemm/quant_grouped_gemm_config.hpp	Updates grouped GEMM quant config to use new preshuffle-B K sizing.
example/ck_tile/17_grouped_gemm/grouped_gemm.hpp	Updates grouped GEMM preshuffle configs (incl. WMMA variant) to use new helper.
example/ck_tile/03_gemm/gemm_weight_preshuffle.cpp	Prints CLI help on argument-parse failure.
example/ck_tile/03_gemm/gemm_utils.hpp	Updates GEMM preshuffle configs to use `get_k_warp_tile_for_preshuffle_b`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/ck_tile/gemm_multi_abd/test_gemm_multi_abd_util.hpp

test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle_util.hpp

include/ck_tile/ops/gemm/pipeline/tile_gemm_shape.hpp

include/ck_tile/ops/gemm/pipeline/wp_pipeline_agmem_bgmem_creg_base_policy.hpp

…reshuffle_b

CongMa13 and others added 4 commits January 22, 2026 12:36

[CK TILE] Add new function get_k_warp_tile_for_preshuffle_b

080fa14

[CK TILE] simplify function GetKBPerLoad

dc83e28

[CK TILE] Update get_k_warp_tile_for_preshuffle_b for MI350

109bfa1

[CK TILE] Apply get_k_warp_tile_for_preshuffle_b in examples and tests

bc91bb7

CongMa13 requested review from Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 24, 2026 00:01

CongMa13 requested a review from Copilot January 24, 2026 00:01

Copilot started reviewing on behalf of CongMa13 January 24, 2026 00:02 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

CongMa13 added 5 commits January 26, 2026 16:26

[CK TIEL] Fix a const type qualifier error

70bd8f8

[CK TIEL] Fix type error

89d4d51

[CK TILE] set proper K_Warp_Tile for quant gemm tests

6ba8427

[CK TILE] disable tests on gfx950

ed0eadb

Merge remote-tracking branch 'upstream/develop' into congma/ck_tile/p…

0fa0fdc

…reshuffle_b

CongMa13 force-pushed the congma/ck_tile/preshuffle_b branch from 3e91c74 to 0fa0fdc Compare January 27, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Congma/ck tile/preshuffle b #3645

Congma/ck tile/preshuffle b #3645

CongMa13 commented Jan 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Congma/ck tile/preshuffle b #3645

Are you sure you want to change the base?

Congma/ck tile/preshuffle b #3645

Conversation

CongMa13 commented Jan 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants