[https://nvbugs/6085022][fix] Synchronize warmup skip decisions across ranks by chenfeiz0326 · Pull Request #13177 · NVIDIA/TensorRT-LLM

chenfeiz0326 · 2026-04-18T15:01:05Z

…s ranks to prevent hang

When one rank hits OOM during warmup (e.g., rank 0 with extra orchestrator overhead on a disaggregated CTX server), it catches the exception and moves on. However, other ranks remain stuck in collective operations inside the forward pass, eventually timing out and crashing with CUDA launch failures.

This fix adds _all_ranks_warmup_ready() which uses allreduce to synchronize the warmup decision across all ranks:

Ensures all ranks have a valid warmup batch before proceeding
Pre-checks free GPU memory on all ranks before large warmups to prevent OOM that would leave other ranks hanging in collective ops

This specifically fixes the DSV3.2 CTX server crash (TP=4, max_num_tokens=32784) where rank 0 OOMs during the 32K-token warmup forward pass while ranks 1-3 hang at the next collective op for 5 minutes before crashing.

Summary by CodeRabbit

Bug Fixes
- Enhanced warmup process for distributed GPU execution with synchronized readiness checks across all ranks.
- Added GPU memory availability validation before warmup to prevent out-of-memory errors in multi-GPU deployments.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-18T15:07:01Z

📝 Walkthrough

Walkthrough

Updated distributed communicator imports to include ReduceOp and modified the warmup process to perform cross-rank readiness synchronization. Added a new helper method that checks local readiness (considering GPU memory availability) and performs an allreduce operation to ensure all ranks are prepared before proceeding.

Changes

Cohort / File(s)	Summary
Distributed Warmup Synchronization `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Updated imports to include `ReduceOp`, modified `_general_warmup` to include cross-rank readiness checks, and added `_all_ranks_warmup_ready` helper that validates local readiness (including GPU memory pre-check with conservative per-token estimates) and synchronizes across all ranks using `allreduce` with `MIN` operation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description lacks the required template structure; it is missing dedicated sections for 'Description' and 'Test Coverage' with substantive content, and the checklist is not completed.	Add a clear 'Description' section explaining the issue and solution, a 'Test Coverage' section listing relevant tests, and complete the PR checklist items to match the template format.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: synchronizing warmup skip decisions across ranks, with the NVBugs ticket properly formatted and [fix] type specified.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 839-846: The readiness/memory check guarded by "if is_ready and
num_tokens > 256:" can leave smaller warmups vulnerable to OOMs that hang other
ranks; update the logic so the free_mem/min_free check
(torch.cuda.mem_get_info(), computation of min_free) runs for every warmup shape
instead of only when num_tokens > 256 — i.e., remove or change the >256 gate and
ensure the same free_mem < min_free handling is executed before calling
_general_warmup() (or invoke this memory check inside the warmup loop that
iterates warmup shapes) so that each call to _general_warmup() verifies
sufficient memory per-rank and fails locally if insufficient.
- Around line 854-860: The warmup sync currently passes a CUDA tensor into
self.dist.allreduce (ready_val = self.dist.allreduce(torch.tensor(...,
device='cuda'), op=ReduceOp.MIN)) which unnecessarily couples the barrier to GPU
memory and breaks MPIDist expectations; change the call to pass a plain Python
int (e.g., 1 or 0) into self.dist.allreduce and then interpret the returned
scalar to set all_ready (use the returned value > 0). Update references around
ready_val, self.dist.allreduce, ReduceOp.MIN and all_ready to handle a scalar
return instead of calling .item(), mirroring how other call sites (e.g.,
resource_manager.py) pass scalars and how TorchDist.allreduce/MPIDist.allreduce
accept base types.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: aa406598-e26f-4b9c-bf69-88b2343ce020

📥 Commits

Reviewing files that changed from the base of the PR and between 66d7711 and 6ea76b9.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py

…s ranks to prevent hang When one rank hits OOM during warmup (e.g., rank 0 with extra orchestrator overhead on a disaggregated CTX server), it catches the exception and moves on. However, other ranks remain stuck in collective operations inside the forward pass, eventually timing out and crashing with CUDA launch failures. This fix adds _all_ranks_warmup_ready() which uses allreduce to synchronize the warmup decision across all ranks: 1. Ensures all ranks have a valid warmup batch before proceeding 2. Pre-checks free GPU memory on all ranks before large warmups to prevent OOM that would leave other ranks hanging in collective ops This specifically fixes the DSV3.2 CTX server crash (TP=4, max_num_tokens=32784) where rank 0 OOMs during the 32K-token warmup forward pass while ranks 1-3 hang at the next collective op for 5 minutes before crashing. Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

chenfeiz0326 requested a review from a team as a code owner April 18, 2026 15:01

chenfeiz0326 requested a review from dongxuy04 April 18, 2026 15:01

github-actions bot assigned chenfeiz0326 Apr 18, 2026

coderabbitai bot reviewed Apr 18, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated

Comment thread tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated

chenfeiz0326 force-pushed the chenfeiz/fix-warmup-oom-multi-rank branch from 6ea76b9 to ff05569 Compare April 18, 2026 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6085022][fix] Synchronize warmup skip decisions across ranks#13177

[https://nvbugs/6085022][fix] Synchronize warmup skip decisions across ranks#13177
chenfeiz0326 wants to merge 1 commit intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/fix-warmup-oom-multi-rank

chenfeiz0326 commented Apr 18, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 18, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chenfeiz0326 commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 18, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chenfeiz0326 commented Apr 18, 2026 •

edited

Loading