[https://nvbugs/6085022][fix] Synchronize warmup skip decisions across ranks#13177
Open
chenfeiz0326 wants to merge 1 commit intoNVIDIA:mainfrom
Open
[https://nvbugs/6085022][fix] Synchronize warmup skip decisions across ranks#13177chenfeiz0326 wants to merge 1 commit intoNVIDIA:mainfrom
chenfeiz0326 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Contributor
📝 WalkthroughWalkthroughUpdated distributed communicator imports to include Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Contributor
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 839-846: The readiness/memory check guarded by "if is_ready and
num_tokens > 256:" can leave smaller warmups vulnerable to OOMs that hang other
ranks; update the logic so the free_mem/min_free check
(torch.cuda.mem_get_info(), computation of min_free) runs for every warmup shape
instead of only when num_tokens > 256 — i.e., remove or change the >256 gate and
ensure the same free_mem < min_free handling is executed before calling
_general_warmup() (or invoke this memory check inside the warmup loop that
iterates warmup shapes) so that each call to _general_warmup() verifies
sufficient memory per-rank and fails locally if insufficient.
- Around line 854-860: The warmup sync currently passes a CUDA tensor into
self.dist.allreduce (ready_val = self.dist.allreduce(torch.tensor(...,
device='cuda'), op=ReduceOp.MIN)) which unnecessarily couples the barrier to GPU
memory and breaks MPIDist expectations; change the call to pass a plain Python
int (e.g., 1 or 0) into self.dist.allreduce and then interpret the returned
scalar to set all_ready (use the returned value > 0). Update references around
ready_val, self.dist.allreduce, ReduceOp.MIN and all_ready to handle a scalar
return instead of calling .item(), mirroring how other call sites (e.g.,
resource_manager.py) pass scalars and how TorchDist.allreduce/MPIDist.allreduce
accept base types.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: aa406598-e26f-4b9c-bf69-88b2343ce020
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py
…s ranks to prevent hang When one rank hits OOM during warmup (e.g., rank 0 with extra orchestrator overhead on a disaggregated CTX server), it catches the exception and moves on. However, other ranks remain stuck in collective operations inside the forward pass, eventually timing out and crashing with CUDA launch failures. This fix adds _all_ranks_warmup_ready() which uses allreduce to synchronize the warmup decision across all ranks: 1. Ensures all ranks have a valid warmup batch before proceeding 2. Pre-checks free GPU memory on all ranks before large warmups to prevent OOM that would leave other ranks hanging in collective ops This specifically fixes the DSV3.2 CTX server crash (TP=4, max_num_tokens=32784) where rank 0 OOMs during the 32K-token warmup forward pass while ranks 1-3 hang at the next collective op for 5 minutes before crashing. Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
6ea76b9 to
ff05569
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…s ranks to prevent hang
When one rank hits OOM during warmup (e.g., rank 0 with extra orchestrator overhead on a disaggregated CTX server), it catches the exception and moves on. However, other ranks remain stuck in collective operations inside the forward pass, eventually timing out and crashing with CUDA launch failures.
This fix adds _all_ranks_warmup_ready() which uses allreduce to synchronize the warmup decision across all ranks:
This specifically fixes the DSV3.2 CTX server crash (TP=4, max_num_tokens=32784) where rank 0 OOMs during the 32K-token warmup forward pass while ranks 1-3 hang at the next collective op for 5 minutes before crashing.
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.