[None][feat] Stack PRs for sweep perfing by Wanli-Jiang · Pull Request #13205 · NVIDIA/TensorRT-LLM

Wanli-Jiang · 2026-04-20T07:44:08Z

Don't review, don't merge

Stack PRs

#13064 flashinfer upgrade
#12731 Using SSD/SSU kernel from flashinfer
#13032 Optimize nemotron-h from python level
#13103 Optimize causal_conv1d prefill and decode kernels
#12884 Add CUTEDSL moe backend for nemotron-h

#12246 Refactor the routing part in trtllmgen
#13186 Update the deepseek routing

#13151 Fix mamba slot leak for aggregate mode

Bump flashinfer-python dependency to 0.6.8rc1. Also update nvidia-cutlass-dsl from 4.3.4 to 4.4.2 (required by flashinfer >=4.4.2). Updated version pins in requirements.txt, security_scanning/pyproject.toml, security_scanning/poetry.lock, and ATTRIBUTIONS-Python.md. Signed-off-by: Yihan Wang <yihwang@nvidia.com>

Force-reinstall nvidia-cutlass-dsl and nvidia-cutlass-dsl-libs-base in the Docker build to replace the stale 4.3.5 from the base image with 4.4.2. Add nvidia-cutlass-dsl>=4.4.2 to constraints.txt. Signed-off-by: Yihan Wang <yihwang@nvidia.com>

Point current_image_tags.properties to the CI tritondevel images built from PR NVIDIA#13064 (flashinfer + nvidia-cutlass-dsl upgrade). Signed-off-by: Yihan Wang <yihwang@nvidia.com>

The DLFW base image (pytorch:26.02-py3) ships nvidia-cutlass-dsl 4.3.5. When pip upgrades to 4.4.2 in-place, it corrupts shared namespace dirs. Add explicit uninstall + rm -rf cleanup before tensorrt_llm wheel install. Signed-off-by: Yihan Wang <yihwang@nvidia.com>

This reverts commit 0dd321a59fc067e5fd3124f1fb5c6b8aba1d7ad3.

pip installs dependency packages (nvidia-cutlass-dsl-libs-base) before uninstalling the old meta-wheel (nvidia-cutlass-dsl). Since both write to the same nvidia_cutlass_dsl/ directory, the uninstall step removes files that the deps just installed, breaking the package. Add scripts/clean_site_packages.py that uninstalls known problematic packages and removes leftover site-packages fragments before install. Call it from test_pip_install.py before both wheel and editable installs. This avoids Docker image changes — the cleanup runs at CI test time. Signed-off-by: Yihan Wang <yihwang@nvidia.com>

…k Test List The Check Test List CI stage runs pip install before calling clean_site_packages.py, causing nvidia-cutlass-dsl namespace corruption that breaks tensorrt_llm imports during pytest collection. - Add clean_site_packages.py call before pip install in launchTestListCheck - Guard cute_dsl_custom_ops imports with IS_CUTLASS_DSL_AVAILABLE in fused_moe_cute_dsl.py to prevent ImportError when cutlass-dsl is unavailable Signed-off-by: Yihan Wang <yihwang@nvidia.com>

Signed-off-by: Yihan Wang <yihwang@nvidia.com>

- Revert IS_CUTLASS_DSL_AVAILABLE guard in fused_moe_cute_dsl.py - Add clean_site_packages.py call in slurm_install.sh before pip install to prevent nvidia-cutlass-dsl namespace corruption on Multi-GPU jobs - Update CI image tags to PR-13064 staging images with flashinfer 0.6.8rc1 and nvidia-cutlass-dsl 4.4.2 Signed-off-by: Yihan Wang <yihwang@nvidia.com>

The new CI Docker images ship with a clean nvidia-cutlass-dsl install, making the pre-install cleanup script unnecessary. The DLFW image path already has its own inline cleanup in L0_Test.groovy. Signed-off-by: Yihan Wang <yihwang@nvidia.com>

Signed-off-by: Yihan Wang <yihwang@nvidia.com>

…ass-dsl 4.4.2 The three blockscaled cute_dsl kernels each monkey-patched cutlass.utils.PersistentTileSchedulerParams.__init__ and StaticPersistentTileScheduler._get_cluster_work_idx_with_fastdivmod with a TODO to remove once nvidia-cutlass-dsl 4.4 was released. cutlass-dsl 4.4.2 now natively provides public raster_along_m, cluster_shape_major_fdd/cluster_shape_minor_fdd, and the fastdivmod work-idx helper. The stale hook stored _raster_along_m (private) and cluster_shape_m_fdd/_n_fdd, which the library's own __extract_mlir_values__ does not know about, producing: AttributeError: 'PersistentTileSchedulerParams' object has no attribute 'raster_along_m'. Did you mean: '_raster_along_m'? on all CUTEDSL/DENSEGEMM MoE tests on B200. Deletes all three hook blocks; every private-name reference was confined to the hook itself. Signed-off-by: Yihan Wang <yihwang@nvidia.com>

…r cutlass-dsl 4.4.2 nvidia-cutlass-dsl 4.4.2 split _make_sync_object across sm90/sm100. PipelineAsync._make_sync_object (sm90) now rejects Blackwell ops like TCGen05Mma and ClcLoad, tripping "Invalid PipelineOp specified" in every custom pipeline defined here (PipelineTmaUmma, PipelineUmmaAsync, PipelineCpAsyncUmma), which all pair a producer/consumer with TCGen05Mma. The expanded factory lives on the sm100 PipelineTmaUmma. Alias it and route the six call sites through the alias so both pre-4.4 and 4.4.2 versions of cutlass-dsl work. Verified on B300 with cutlass-dsl 4.4.2: test_nvfp4_gather_grouped_gemm_swiglu_blackwell[1024-1-1-256] PASSED. Signed-off-by: Yihan Wang <yihwang@nvidia.com>

…lass-dsl 4.4.2 PipelineAsync.producer_tail is decorated with @dsl_user_op in cutlass-dsl 4.4.2 and forwards loc/ip kwargs to producer_acquire. The overrides in custom_pipeline.py did not accept these, raising DSLRuntimeError in test_fp4_linear_cute_dsl. Add loc=None, ip=None keyword-only parameters to producer_acquire, producer_commit, consumer_release, and producer_tail across PipelineTmaUmma, PipelineUmmaAsync, and PipelineCpAsyncUmma, and thread them through to the inner sync_object and cute.arch calls. Signed-off-by: Yihan Wang <yihwang@nvidia.com>

Signed-off-by: Yihan Wang <yihwang@nvidia.com>

Might need to pip install nvidia-cutlass-dsl-libs-base==4.4.2 --force-reinstall --no-deps in docker Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

* Add NemotronHPuzzleConfig. * Extend routing kernel for MoE. * Extend workspace allocation. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

* Remove restricted condition in mamba2_mixer. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

* Enable more c++ routing combinations. * Update mamba tensor operations. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Decode: - Add fast-path kernel for seqlen=1 with zero loops and fast math - Increase thread block from 64 to 128 to halve block count - Compile-time specialize conv_state_indices and silu branches Prefill: - Use 128 threads for varlen with long sequences - Enable VecLoad for varlen BS=1 (seq_start=0 is always aligned) - Move conv_state save before main loop, removing 80 lines of complex cross-chunk state extraction from smem_exchange - Compile-time specialize conv_state_indices and silu branches Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

The gather grouped-GEMM kernel now supports both SwiGLU (is_gated=True) and Relu2 (is_gated=False), so the "swiglu" naming is inaccurate. Rename the kernel file, Runner class, and torch ops to use "act_fusion", and generalize the top-level/class/forward docstrings to describe both activations. SwiGLU-specific code comments inside the is_gated=True branch are kept verbatim. No behavioral change: the non-gather and dense-FC1 SwiGLU kernels (which do not have an is_gated path) are untouched. Renames: - blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py -> blockscaled_contiguous_gather_grouped_gemm_act_fusion.py - Sm100BlockScaledContiguousGatherGroupedGemmSwigluFusionRunner -> Sm100BlockScaledContiguousGatherGroupedGemmActFusionRunner - trtllm::cute_dsl_nvfp4_gather_grouped_gemm_swiglu_blackwell[_multi_b] -> trtllm::cute_dsl_nvfp4_gather_grouped_gemm_act_fusion_blackwell[_multi_b] Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

- Filter all per-draft-len padding dummy IDs (CUDA_GRAPH_DUMMY_REQUEST_ID - runtime_draft_len), not just the sentinel, so the dummy never takes a permanent Mamba slot. - In update_mamba_states, slice self.state_indices by the real generation count; in get_state_indices, allocate padding slots only from mamba_cache_free_blocks. Prevents writes from aliasing slots owned by live requests outside the current batch. - Release Phase-1 CUDA-graph pools before the final KV cache allocation. Adds regression tests for the slot-borrowing paths. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

yihwang-nv and others added 30 commits April 19, 2026 21:22

[None][chore] Update CI image tags to PR-13064 staging images

8e7f120

Point current_image_tags.properties to the CI tritondevel images built from PR NVIDIA#13064 (flashinfer + nvidia-cutlass-dsl upgrade). Signed-off-by: Yihan Wang <yihwang@nvidia.com>

Revert "[None][chore] Update CI image tags to PR-13064 staging images"

7d82d57

This reverts commit 0dd321a59fc067e5fd3124f1fb5c6b8aba1d7ad3.

[None][chore] Regenerate security_scanning/poetry.lock content-hash

0036e8e

Signed-off-by: Yihan Wang <yihwang@nvidia.com>

[None][chore] Update flashinfer-python from 0.6.8rc1 to 0.6.8

4c4cc14

Signed-off-by: Yihan Wang <yihwang@nvidia.com>

[None][chore] Revert security_scanning/poetry.lock to origin/main

d69b0cc

Signed-off-by: Yihan Wang <yihwang@nvidia.com>

Revert docker image since the staging one is removed

c01a1eb

Might need to pip install nvidia-cutlass-dsl-libs-base==4.4.2 --force-reinstall --no-deps in docker Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

[None][feat] Add NemotronHPuzzleConfig modeling

125c775

* Add NemotronHPuzzleConfig. * Extend routing kernel for MoE. * Extend workspace allocation. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Upgrade flashinfer to nightly build

367a8b8

* Remove restricted condition in mamba2_mixer. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Fix reviewer's comment

39895b2

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Update mamba SSD to use flashinfer kernel

39845be

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Fix python deps for cutlass-dsl

8d555a8

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Fix error when rebasing commit

f479286

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

[None][feat] Optimize nemotron-h from python level

ab9441e

* Enable more c++ routing combinations. * Update mamba tensor operations. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Fix unittests of nemotron-h

8d78eb2

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

[TRTLLM-11585][feat] Add CUTEDSL moe backend for nemotron-h

a62c172

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Fix cutedsl upgrade issues

13041dd

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Refactor the routing part in trtllmgen

e889569

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

ChristinaZ and others added 3 commits April 20, 2026 00:08

Revise based on review

d8a2479

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

Update the deepseek routing

00ddc3d

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

github-actions bot assigned Wanli-Jiang Apr 20, 2026

Use updated docker image

7b2fd39

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Wanli-Jiang force-pushed the user/williamj/stack-pr-v2 branch from da7a28c to 7b2fd39 Compare April 21, 2026 06:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] Stack PRs for sweep perfing#13205

[None][feat] Stack PRs for sweep perfing#13205
Wanli-Jiang wants to merge 34 commits intoNVIDIA:mainfrom
Wanli-Jiang:user/williamj/stack-pr-v2

Wanli-Jiang commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Wanli-Jiang commented Apr 20, 2026

Stack PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants