Skip to content

[None][feat] Stack PRs for sweep perfing#13205

Draft
Wanli-Jiang wants to merge 34 commits intoNVIDIA:mainfrom
Wanli-Jiang:user/williamj/stack-pr-v2
Draft

[None][feat] Stack PRs for sweep perfing#13205
Wanli-Jiang wants to merge 34 commits intoNVIDIA:mainfrom
Wanli-Jiang:user/williamj/stack-pr-v2

Conversation

@Wanli-Jiang
Copy link
Copy Markdown
Collaborator

Don't review, don't merge

Stack PRs

#13064 flashinfer upgrade
#12731 Using SSD/SSU kernel from flashinfer
#13032 Optimize nemotron-h from python level
#13103 Optimize causal_conv1d prefill and decode kernels
#12884 Add CUTEDSL moe backend for nemotron-h

#12246 Refactor the routing part in trtllmgen
#13186 Update the deepseek routing

#13151 Fix mamba slot leak for aggregate mode

yihwang-nv and others added 30 commits April 19, 2026 21:22
Bump flashinfer-python dependency to 0.6.8rc1.
Also update nvidia-cutlass-dsl from 4.3.4 to 4.4.2 (required by flashinfer >=4.4.2).
Updated version pins in requirements.txt, security_scanning/pyproject.toml,
security_scanning/poetry.lock, and ATTRIBUTIONS-Python.md.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Force-reinstall nvidia-cutlass-dsl and nvidia-cutlass-dsl-libs-base in
the Docker build to replace the stale 4.3.5 from the base image with
4.4.2. Add nvidia-cutlass-dsl>=4.4.2 to constraints.txt.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Force-reinstall nvidia-cutlass-dsl and nvidia-cutlass-dsl-libs-base in
the Docker build to replace the stale 4.3.5 from the base image with
4.4.2. Add nvidia-cutlass-dsl>=4.4.2 to constraints.txt.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Point current_image_tags.properties to the CI tritondevel images
built from PR NVIDIA#13064 (flashinfer + nvidia-cutlass-dsl upgrade).

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
The DLFW base image (pytorch:26.02-py3) ships nvidia-cutlass-dsl 4.3.5.
When pip upgrades to 4.4.2 in-place, it corrupts shared namespace dirs.
Add explicit uninstall + rm -rf cleanup before tensorrt_llm wheel install.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
This reverts commit 0dd321a59fc067e5fd3124f1fb5c6b8aba1d7ad3.
pip installs dependency packages (nvidia-cutlass-dsl-libs-base) before
uninstalling the old meta-wheel (nvidia-cutlass-dsl).  Since both write
to the same nvidia_cutlass_dsl/ directory, the uninstall step removes
files that the deps just installed, breaking the package.

Add scripts/clean_site_packages.py that uninstalls known problematic
packages and removes leftover site-packages fragments before install.
Call it from test_pip_install.py before both wheel and editable installs.

This avoids Docker image changes — the cleanup runs at CI test time.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
…k Test List

The Check Test List CI stage runs pip install before calling
clean_site_packages.py, causing nvidia-cutlass-dsl namespace corruption
that breaks tensorrt_llm imports during pytest collection.

- Add clean_site_packages.py call before pip install in launchTestListCheck
- Guard cute_dsl_custom_ops imports with IS_CUTLASS_DSL_AVAILABLE in
  fused_moe_cute_dsl.py to prevent ImportError when cutlass-dsl is unavailable

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
- Revert IS_CUTLASS_DSL_AVAILABLE guard in fused_moe_cute_dsl.py
- Add clean_site_packages.py call in slurm_install.sh before pip install
  to prevent nvidia-cutlass-dsl namespace corruption on Multi-GPU jobs
- Update CI image tags to PR-13064 staging images with flashinfer 0.6.8rc1
  and nvidia-cutlass-dsl 4.4.2

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
The new CI Docker images ship with a clean nvidia-cutlass-dsl install,
making the pre-install cleanup script unnecessary. The DLFW image path
already has its own inline cleanup in L0_Test.groovy.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
…ass-dsl 4.4.2

The three blockscaled cute_dsl kernels each monkey-patched
cutlass.utils.PersistentTileSchedulerParams.__init__ and
StaticPersistentTileScheduler._get_cluster_work_idx_with_fastdivmod
with a TODO to remove once nvidia-cutlass-dsl 4.4 was released.

cutlass-dsl 4.4.2 now natively provides public raster_along_m,
cluster_shape_major_fdd/cluster_shape_minor_fdd, and the fastdivmod
work-idx helper. The stale hook stored _raster_along_m (private) and
cluster_shape_m_fdd/_n_fdd, which the library's own
__extract_mlir_values__ does not know about, producing:

  AttributeError: 'PersistentTileSchedulerParams' object has no
  attribute 'raster_along_m'. Did you mean: '_raster_along_m'?

on all CUTEDSL/DENSEGEMM MoE tests on B200.

Deletes all three hook blocks; every private-name reference was
confined to the hook itself.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
…r cutlass-dsl 4.4.2

nvidia-cutlass-dsl 4.4.2 split _make_sync_object across sm90/sm100.
PipelineAsync._make_sync_object (sm90) now rejects Blackwell ops like
TCGen05Mma and ClcLoad, tripping "Invalid PipelineOp specified" in
every custom pipeline defined here (PipelineTmaUmma, PipelineUmmaAsync,
PipelineCpAsyncUmma), which all pair a producer/consumer with
TCGen05Mma.

The expanded factory lives on the sm100 PipelineTmaUmma. Alias it and
route the six call sites through the alias so both pre-4.4 and 4.4.2
versions of cutlass-dsl work.

Verified on B300 with cutlass-dsl 4.4.2:
test_nvfp4_gather_grouped_gemm_swiglu_blackwell[1024-1-1-256] PASSED.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
…lass-dsl 4.4.2

PipelineAsync.producer_tail is decorated with @dsl_user_op in
cutlass-dsl 4.4.2 and forwards loc/ip kwargs to producer_acquire. The
overrides in custom_pipeline.py did not accept these, raising
DSLRuntimeError in test_fp4_linear_cute_dsl. Add loc=None, ip=None
keyword-only parameters to producer_acquire, producer_commit,
consumer_release, and producer_tail across PipelineTmaUmma,
PipelineUmmaAsync, and PipelineCpAsyncUmma, and thread them through to
the inner sync_object and cute.arch calls.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Might need to pip install nvidia-cutlass-dsl-libs-base==4.4.2 --force-reinstall --no-deps in docker

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
* Add NemotronHPuzzleConfig.
* Extend routing kernel for MoE.
* Extend workspace allocation.

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
* Remove restricted condition in mamba2_mixer.

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
* Enable more c++ routing combinations.
* Update mamba tensor operations.

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Decode:
- Add fast-path kernel for seqlen=1 with zero loops and fast math
- Increase thread block from 64 to 128 to halve block count
- Compile-time specialize conv_state_indices and silu branches

Prefill:
- Use 128 threads for varlen with long sequences
- Enable VecLoad for varlen BS=1 (seq_start=0 is always aligned)
- Move conv_state save before main loop, removing 80 lines of
  complex cross-chunk state extraction from smem_exchange
- Compile-time specialize conv_state_indices and silu branches

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
The gather grouped-GEMM kernel now supports both SwiGLU (is_gated=True)
and Relu2 (is_gated=False), so the "swiglu" naming is inaccurate. Rename
the kernel file, Runner class, and torch ops to use "act_fusion", and
generalize the top-level/class/forward docstrings to describe both
activations. SwiGLU-specific code comments inside the is_gated=True
branch are kept verbatim.

No behavioral change: the non-gather and dense-FC1 SwiGLU kernels (which
do not have an is_gated path) are untouched.

Renames:
- blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
    -> blockscaled_contiguous_gather_grouped_gemm_act_fusion.py
- Sm100BlockScaledContiguousGatherGroupedGemmSwigluFusionRunner
    -> Sm100BlockScaledContiguousGatherGroupedGemmActFusionRunner
- trtllm::cute_dsl_nvfp4_gather_grouped_gemm_swiglu_blackwell[_multi_b]
    -> trtllm::cute_dsl_nvfp4_gather_grouped_gemm_act_fusion_blackwell[_multi_b]

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
ChristinaZ and others added 3 commits April 20, 2026 00:08
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
- Filter all per-draft-len padding dummy IDs (CUDA_GRAPH_DUMMY_REQUEST_ID
  - runtime_draft_len), not just the sentinel, so the dummy never takes
  a permanent Mamba slot.

- In update_mamba_states, slice self.state_indices by the real generation
  count; in get_state_indices, allocate padding slots only from
  mamba_cache_free_blocks. Prevents writes from aliasing slots owned by
  live requests outside the current batch.

- Release Phase-1 CUDA-graph pools before the final KV cache allocation.

Adds regression tests for the slot-borrowing paths.

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/stack-pr-v2 branch from da7a28c to 7b2fd39 Compare April 21, 2026 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants