[CK_TILE] Stream-K XCD remapping #3652

arai713 · 2026-01-26T18:32:17Z

Proposed changes

This PR adds support for XCD remapping as detailed in this document. On gfx942, workgroups are typically scheduled round-robin across XCDs, which can lead to poor locality. We will use a remapping to assign workgroups to contiguous tiles in the XCDs improving the locality and the cache hit rate. This is done through a function that computes this contiguous mapping from this PR, which we have added to the StreamKTilePartitioner. This will require minimal changes to the Stream-K algorithm, only requiring a remap at the time the workgroups are partitioned. Through this approach we can improve the data locality by improving cache hits therefore closing performance gaps that are seen with the default scheduling. There have been unit tests added to verify the function in isolation. This is an optimization that is not specialized to just Stream-K GEMM and can be applied across GEMM.

Note: This only applies to the gfx942 as they introduce the XCDs.

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

This change adds in a function to remap block ids from their original round robin assignment to a contiguous layout across XCDs. This function is added to the StreamKTilePartitioner and called in the operator() functions. There are also unit tests to verify the correctness of the function on minimal arrays. These changes should improve locality and the cache hit rate, therefore improving performance overall.

ecamartins · 2026-01-27T23:28:56Z

include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_tile_partitioner.hpp

+     * @param NUM_XCDS          number of XCDs
+     * @return index_t  The id after XCD remap
+     */
+    CK_TILE_HOST_DEVICE index_t RemapXCD(index_t block_1d_id,


For the sake of keeping things consistent, can we up lower case snake for this function name?

I think there is a ticket to update all casing in this class, but for now, I think it might be best to keep the style the same within the file.

ecamartins · 2026-01-27T23:29:22Z

include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_tile_partitioner.hpp

+     *
+     * @param block_1d_id       grid 1D id
+     * @param total_num_tiles   size of the 1D grid
+     * @param NUM_XCDS          number of XCDs


Can we make this param lower case snake?

ecamartins · 2026-01-27T23:37:14Z

include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_kernel.hpp

        index_t dp_ctas     = kargs.tile_partitioner.get_dp_ctas();
        bool is_dp_ctas     = block_idx < kargs.tile_partitioner.get_dp_ctas();

+        block_idx = kargs.tile_partitioner.RemapXCD(block_idx, grid_size);


Since this change should only apply to gfx942 and gfx950, we likely need some kind of logic to determine whether we want to apply the remap or not. Also, depending on the variant of gfx942 there may be fewer than 8 XCDs, so I think some additional logic will be required to determine what value for num_xcds we want to use (rather than solely relying on the default of 8).

We want to avoid using preprocessor macros, so depending on what there is available in the HIP api, maybe we can query the number of XCDs from the device?

If not, we may need to consider other options. Perhaps we could consider some logic that follows a similar pattern to: include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_coherency.hpp.

ecamartins · 2026-01-27T23:37:53Z

include/ck_tile/ops/gemm/kernel/streamk_gemm/streamk_gemm_kernel.hpp

-        for(index_t tile_idx = block_idx; tile_idx < kargs.tile_partitioner.get_dp_tiles();
-            tile_idx += kargs.tile_partitioner.get_grid())
+        block_idx =
+            kargs.tile_partitioner.RemapXCD(block_idx, grid_size)


Same comment about how we shouldn't do this on all architectures. (See above)

ecamartins · 2026-01-27T23:38:58Z

test/ck_tile/gemm_streamk/test_streamk_tile_partitioner_common.hpp

    EXPECT_EQ(tile_local_cta_idx, expected_tile_local_cta_idx);
 }

+template <typename GemmShape>


Can we also add a test for when we don't use the default number of XCDs?

arai713 self-assigned this Jan 26, 2026

ecamartins reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK_TILE] Stream-K XCD remapping #3652

[CK_TILE] Stream-K XCD remapping #3652

arai713 commented Jan 26, 2026 •

edited

Loading

Uh oh!

ecamartins Jan 27, 2026

Uh oh!

ecamartins Jan 27, 2026

Uh oh!

ecamartins Jan 27, 2026

Uh oh!

ecamartins Jan 27, 2026

Uh oh!

ecamartins Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CK_TILE] Stream-K XCD remapping #3652

Are you sure you want to change the base?

[CK_TILE] Stream-K XCD remapping #3652

Conversation

arai713 commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Uh oh!

ecamartins Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

ecamartins Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

ecamartins Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

ecamartins Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

ecamartins Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arai713 commented Jan 26, 2026 •

edited

Loading