[None][feat] Stack PRs for sweep perfing#13205
Draft
Wanli-Jiang wants to merge 34 commits intoNVIDIA:mainfrom
Draft
[None][feat] Stack PRs for sweep perfing#13205Wanli-Jiang wants to merge 34 commits intoNVIDIA:mainfrom
Wanli-Jiang wants to merge 34 commits intoNVIDIA:mainfrom
Conversation
Bump flashinfer-python dependency to 0.6.8rc1. Also update nvidia-cutlass-dsl from 4.3.4 to 4.4.2 (required by flashinfer >=4.4.2). Updated version pins in requirements.txt, security_scanning/pyproject.toml, security_scanning/poetry.lock, and ATTRIBUTIONS-Python.md. Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Force-reinstall nvidia-cutlass-dsl and nvidia-cutlass-dsl-libs-base in the Docker build to replace the stale 4.3.5 from the base image with 4.4.2. Add nvidia-cutlass-dsl>=4.4.2 to constraints.txt. Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Force-reinstall nvidia-cutlass-dsl and nvidia-cutlass-dsl-libs-base in the Docker build to replace the stale 4.3.5 from the base image with 4.4.2. Add nvidia-cutlass-dsl>=4.4.2 to constraints.txt. Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Point current_image_tags.properties to the CI tritondevel images built from PR NVIDIA#13064 (flashinfer + nvidia-cutlass-dsl upgrade). Signed-off-by: Yihan Wang <yihwang@nvidia.com>
The DLFW base image (pytorch:26.02-py3) ships nvidia-cutlass-dsl 4.3.5. When pip upgrades to 4.4.2 in-place, it corrupts shared namespace dirs. Add explicit uninstall + rm -rf cleanup before tensorrt_llm wheel install. Signed-off-by: Yihan Wang <yihwang@nvidia.com>
This reverts commit 0dd321a59fc067e5fd3124f1fb5c6b8aba1d7ad3.
pip installs dependency packages (nvidia-cutlass-dsl-libs-base) before uninstalling the old meta-wheel (nvidia-cutlass-dsl). Since both write to the same nvidia_cutlass_dsl/ directory, the uninstall step removes files that the deps just installed, breaking the package. Add scripts/clean_site_packages.py that uninstalls known problematic packages and removes leftover site-packages fragments before install. Call it from test_pip_install.py before both wheel and editable installs. This avoids Docker image changes — the cleanup runs at CI test time. Signed-off-by: Yihan Wang <yihwang@nvidia.com>
…k Test List The Check Test List CI stage runs pip install before calling clean_site_packages.py, causing nvidia-cutlass-dsl namespace corruption that breaks tensorrt_llm imports during pytest collection. - Add clean_site_packages.py call before pip install in launchTestListCheck - Guard cute_dsl_custom_ops imports with IS_CUTLASS_DSL_AVAILABLE in fused_moe_cute_dsl.py to prevent ImportError when cutlass-dsl is unavailable Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
- Revert IS_CUTLASS_DSL_AVAILABLE guard in fused_moe_cute_dsl.py - Add clean_site_packages.py call in slurm_install.sh before pip install to prevent nvidia-cutlass-dsl namespace corruption on Multi-GPU jobs - Update CI image tags to PR-13064 staging images with flashinfer 0.6.8rc1 and nvidia-cutlass-dsl 4.4.2 Signed-off-by: Yihan Wang <yihwang@nvidia.com>
The new CI Docker images ship with a clean nvidia-cutlass-dsl install, making the pre-install cleanup script unnecessary. The DLFW image path already has its own inline cleanup in L0_Test.groovy. Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
…ass-dsl 4.4.2 The three blockscaled cute_dsl kernels each monkey-patched cutlass.utils.PersistentTileSchedulerParams.__init__ and StaticPersistentTileScheduler._get_cluster_work_idx_with_fastdivmod with a TODO to remove once nvidia-cutlass-dsl 4.4 was released. cutlass-dsl 4.4.2 now natively provides public raster_along_m, cluster_shape_major_fdd/cluster_shape_minor_fdd, and the fastdivmod work-idx helper. The stale hook stored _raster_along_m (private) and cluster_shape_m_fdd/_n_fdd, which the library's own __extract_mlir_values__ does not know about, producing: AttributeError: 'PersistentTileSchedulerParams' object has no attribute 'raster_along_m'. Did you mean: '_raster_along_m'? on all CUTEDSL/DENSEGEMM MoE tests on B200. Deletes all three hook blocks; every private-name reference was confined to the hook itself. Signed-off-by: Yihan Wang <yihwang@nvidia.com>
…r cutlass-dsl 4.4.2 nvidia-cutlass-dsl 4.4.2 split _make_sync_object across sm90/sm100. PipelineAsync._make_sync_object (sm90) now rejects Blackwell ops like TCGen05Mma and ClcLoad, tripping "Invalid PipelineOp specified" in every custom pipeline defined here (PipelineTmaUmma, PipelineUmmaAsync, PipelineCpAsyncUmma), which all pair a producer/consumer with TCGen05Mma. The expanded factory lives on the sm100 PipelineTmaUmma. Alias it and route the six call sites through the alias so both pre-4.4 and 4.4.2 versions of cutlass-dsl work. Verified on B300 with cutlass-dsl 4.4.2: test_nvfp4_gather_grouped_gemm_swiglu_blackwell[1024-1-1-256] PASSED. Signed-off-by: Yihan Wang <yihwang@nvidia.com>
…lass-dsl 4.4.2 PipelineAsync.producer_tail is decorated with @dsl_user_op in cutlass-dsl 4.4.2 and forwards loc/ip kwargs to producer_acquire. The overrides in custom_pipeline.py did not accept these, raising DSLRuntimeError in test_fp4_linear_cute_dsl. Add loc=None, ip=None keyword-only parameters to producer_acquire, producer_commit, consumer_release, and producer_tail across PipelineTmaUmma, PipelineUmmaAsync, and PipelineCpAsyncUmma, and thread them through to the inner sync_object and cute.arch calls. Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Might need to pip install nvidia-cutlass-dsl-libs-base==4.4.2 --force-reinstall --no-deps in docker Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
* Add NemotronHPuzzleConfig. * Extend routing kernel for MoE. * Extend workspace allocation. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
* Remove restricted condition in mamba2_mixer. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
* Enable more c++ routing combinations. * Update mamba tensor operations. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Decode: - Add fast-path kernel for seqlen=1 with zero loops and fast math - Increase thread block from 64 to 128 to halve block count - Compile-time specialize conv_state_indices and silu branches Prefill: - Use 128 threads for varlen with long sequences - Enable VecLoad for varlen BS=1 (seq_start=0 is always aligned) - Move conv_state save before main loop, removing 80 lines of complex cross-chunk state extraction from smem_exchange - Compile-time specialize conv_state_indices and silu branches Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
The gather grouped-GEMM kernel now supports both SwiGLU (is_gated=True)
and Relu2 (is_gated=False), so the "swiglu" naming is inaccurate. Rename
the kernel file, Runner class, and torch ops to use "act_fusion", and
generalize the top-level/class/forward docstrings to describe both
activations. SwiGLU-specific code comments inside the is_gated=True
branch are kept verbatim.
No behavioral change: the non-gather and dense-FC1 SwiGLU kernels (which
do not have an is_gated path) are untouched.
Renames:
- blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
-> blockscaled_contiguous_gather_grouped_gemm_act_fusion.py
- Sm100BlockScaledContiguousGatherGroupedGemmSwigluFusionRunner
-> Sm100BlockScaledContiguousGatherGroupedGemmActFusionRunner
- trtllm::cute_dsl_nvfp4_gather_grouped_gemm_swiglu_blackwell[_multi_b]
-> trtllm::cute_dsl_nvfp4_gather_grouped_gemm_act_fusion_blackwell[_multi_b]
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
- Filter all per-draft-len padding dummy IDs (CUDA_GRAPH_DUMMY_REQUEST_ID - runtime_draft_len), not just the sentinel, so the dummy never takes a permanent Mamba slot. - In update_mamba_states, slice self.state_indices by the real generation count; in get_state_indices, allocate padding slots only from mamba_cache_free_blocks. Prevents writes from aliasing slots owned by live requests outside the current batch. - Release Phase-1 CUDA-graph pools before the final KV cache allocation. Adds regression tests for the slot-borrowing paths. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
da7a28c to
7b2fd39
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Don't review, don't merge
Stack PRs
#13064 flashinfer upgrade
#12731 Using SSD/SSU kernel from flashinfer
#13032 Optimize nemotron-h from python level
#13103 Optimize causal_conv1d prefill and decode kernels
#12884 Add CUTEDSL moe backend for nemotron-h
#12246 Refactor the routing part in trtllmgen
#13186 Update the deepseek routing
#13151 Fix mamba slot leak for aggregate mode