[gfx1250] Add cluster launch support with TDM multicast bandwidth tests#750
Merged
Merged
Conversation
…tests Replace broken #ifdef on enum hipLaunchAttributeClusterDimension with mgpuLaunchClusterKernel, matching the CUDA-side CUDA_VERSION pattern. Add cluster launch tests (vec_add smoke + cluster_barrier) that exercise the hipDrvLaunchKernelEx runtime path end-to-end. TDM multicast GEMM test split to a separate file with @pytest.mark.skip (JIT compilation hangs with cluster params, deferred to another PR).
Add test_tdm_mcast_add_gfx1250.py exercising TDM async DMA loads with cluster multicast masks and elementwise add. Includes correctness tests parametrized over cluster configs and a bandwidth comparison test. Add bench_tdm_load_gfx1250.py with three modes: - read-only: pure TDM HBM read bandwidth (no store) - unique: TDM load + add + store with unique tiles per WG - shared: GEMM-like shared tiles with cluster multicast throughput
…tion for cluster launch The HIP_VERSION >= 70200000 threshold could not be verified from public ROCm releases — hipLaunchAttributeClusterDimension is absent from the public ROCm 7.2 headers. Use check_cxx_source_compiles to detect the API at build time, matching the approach used by CK and Tensile.
…le-time check
Replace the CMake check_cxx_source_compiles / HIP_HAS_CLUSTER_LAUNCH
compile-time guard with Triton-style runtime detection via dlsym.
This ensures the same wheel/.so works across all HIP versions:
- dlsym(RTLD_DEFAULT, "hipDrvLaunchKernelEx") resolves the symbol at
runtime; result is cached in a static variable (single lookup).
- Hardcode hipLaunchAttributeClusterDimension == 4 and write cluster
dims through the pad[64] union field to avoid compile-time dependency
on HIP headers that define the enum/struct.
- Fall back to hipModuleLaunchKernel with a warning when the symbol is
unavailable and cluster > 1 is requested.
- Link ${CMAKE_DL_LIBS} for dlsym; remove check_cxx_source_compiles
and target_compile_definitions from CMakeLists.txt.
coderfeli
approved these changes
Jun 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dual to repo write permission changed , the original PR move to this new one.
Motivation
gfx1250 (MI450) introduces cluster launch and TDM (Tensor Data Mover) hardware features for inter-workgroup async DMA and L2 multicast. The existing mgpuLaunchClusterKernel in FlyDSL runtime uses #ifdef hipLaunchAttributeClusterDimension for conditional compilation, but this is an enum value, not a preprocessor macro, making the guard unreliable. Additionally, there are no end-to-end tests or performance benchmarks for cluster launch and TDM multicast.
Technical Details
Runtime fix (lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp):
Cluster launch tests (tests/unit/test_cluster_launch_gfx1250.py):
TDM multicast correctness tests (tests/unit/test_tdm_mcast_add_gfx1250.py):
TDM bandwidth benchmark (tests/perf/bench_tdm_bandwidth_gfx1250.py):
throughput)
Cluster multicast GEMM (tests/unit/test_cluster_mcast_gemm_gfx1250.py):
Test Plan
Test Result
Tested on gfx1250 hardware:
Submission Checklist