[gfx1250] Add cluster launch support with TDM multicast bandwidth tests by jli-melchior · Pull Request #750 · ROCm/FlyDSL

jli-melchior · 2026-06-26T02:15:45Z

Dual to repo write permission changed , the original PR move to this new one.

Motivation

gfx1250 (MI450) introduces cluster launch and TDM (Tensor Data Mover) hardware features for inter-workgroup async DMA and L2 multicast. The existing mgpuLaunchClusterKernel in FlyDSL runtime uses #ifdef hipLaunchAttributeClusterDimension for conditional compilation, but this is an enum value, not a preprocessor macro, making the guard unreliable. Additionally, there are no end-to-end tests or performance benchmarks for cluster launch and TDM multicast.

Technical Details

Runtime fix (lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp):

Replace #ifdef hipLaunchAttributeClusterDimension with #if defined(HIP_VERSION) && (HIP_VERSION >= 70200000)
Simplify error handling: remove the cluster=(1,1,1) fallback logic — hipDrvLaunchKernelEx should work correctly on ROCm 7.2+
The #else branch for HIP < 7.2 retains the fallback to hipModuleLaunchKernel with no behavioral change

Cluster launch tests (tests/unit/test_cluster_launch_gfx1250.py):

vec_add smoke test: verifies hipDrvLaunchKernelEx + cluster dims end-to-end correctness
cluster_barrier test: verifies cluster_barrier() cross-WG synchronization

TDM multicast correctness tests (tests/unit/test_tdm_mcast_add_gfx1250.py):

TDM 2D async load + LDS add + buffer store, parametrized over cluster configs: (2,1), (1,2), (2,2)
Bandwidth comparison benchmark (@pytest.mark.benchmark)

TDM bandwidth benchmark (tests/perf/bench_tdm_bandwidth_gfx1250.py):

Three modes: read-only (pure TDM HBM read BW), unique (TDM load+add+store R/W BW), multicast (cluster multicast L2→LDS
throughput)
L2 flush + CUDA event timing + IQR median, sweeping multiple grid and cluster configurations
Default --mode all runs all modes in sequence

Cluster multicast GEMM (tests/unit/test_cluster_mcast_gemm_gfx1250.py):

WMMA GEMM + TDM multicast prototype, currently @pytest.mark.skip (JIT compilation hangs with cluster params, deferred to a follow-up PR)

Test Plan

python -m pytest tests/unit/test_cluster_launch_gfx1250.py -v --tb=short — cluster launch smoke + barrier
python -m pytest tests/unit/test_tdm_mcast_add_gfx1250.py -v --tb=short — TDM multicast correctness
python tests/perf/bench_tdm_bandwidth_gfx1250.py — three-mode bandwidth benchmark
Verify mgpuLaunchClusterKernel does not affect existing kernel launches on non-gfx1250 architectures

Test Result

Tested on gfx1250 hardware:

Cluster launch: vec_add and barrier tests pass
TDM multicast: correctness verified for (2,1), (1,2), (2,2) cluster configs
Bandwidth benchmark: read-only mode reaches ~19.9 TB/s at 256x256 grid (90.3% of 22 TB/s peak); unique mode reaches ~16 TB/s (72.8%)

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…tests Replace broken #ifdef on enum hipLaunchAttributeClusterDimension with mgpuLaunchClusterKernel, matching the CUDA-side CUDA_VERSION pattern. Add cluster launch tests (vec_add smoke + cluster_barrier) that exercise the hipDrvLaunchKernelEx runtime path end-to-end. TDM multicast GEMM test split to a separate file with @pytest.mark.skip (JIT compilation hangs with cluster params, deferred to another PR).

Add test_tdm_mcast_add_gfx1250.py exercising TDM async DMA loads with cluster multicast masks and elementwise add. Includes correctness tests parametrized over cluster configs and a bandwidth comparison test. Add bench_tdm_load_gfx1250.py with three modes: - read-only: pure TDM HBM read bandwidth (no store) - unique: TDM load + add + store with unique tiles per WG - shared: GEMM-like shared tiles with cluster multicast throughput

…tion for cluster launch The HIP_VERSION >= 70200000 threshold could not be verified from public ROCm releases — hipLaunchAttributeClusterDimension is absent from the public ROCm 7.2 headers. Use check_cxx_source_compiles to detect the API at build time, matching the approach used by CK and Tensile.

…le-time check Replace the CMake check_cxx_source_compiles / HIP_HAS_CLUSTER_LAUNCH compile-time guard with Triton-style runtime detection via dlsym. This ensures the same wheel/.so works across all HIP versions: - dlsym(RTLD_DEFAULT, "hipDrvLaunchKernelEx") resolves the symbol at runtime; result is cached in a static variable (single lookup). - Hardcode hipLaunchAttributeClusterDimension == 4 and write cluster dims through the pad[64] union field to avoid compile-time dependency on HIP headers that define the enum/struct. - Fall back to hipModuleLaunchKernel with a warning when the symbol is unavailable and cluster > 1 is requested. - Link ${CMAKE_DL_LIBS} for dlsym; remove check_cxx_source_compiles and target_compile_definitions from CMakeLists.txt.

jli-melchior and others added 5 commits June 17, 2026 04:00

Merge branch 'main' into feat/gfx1250-cluster-launch

e6295b0

jli-melchior mentioned this pull request Jun 26, 2026

[gfx1250] Add cluster launch support with TDM multicast bandwidth tests #699

Open

jli-melchior requested review from aoli26, coderfeli and sjfeng1999 and removed request for coderfeli June 26, 2026 02:22

coderfeli approved these changes Jun 27, 2026

View reviewed changes

coderfeli merged commit 5cb28f6 into ROCm:main Jun 27, 2026
9 checks passed

jli-melchior self-assigned this Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[gfx1250] Add cluster launch support with TDM multicast bandwidth tests#750

[gfx1250] Add cluster launch support with TDM multicast bandwidth tests#750
coderfeli merged 5 commits into
ROCm:mainfrom
jli-melchior:feat/gfx1250-cluster-launch

jli-melchior commented Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jli-melchior commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jli-melchior commented Jun 26, 2026 •

edited

Loading