Skip to content

[https://nvbugs/6085022][fix] Synchronize warmup skip decisions across ranks#13177

Open
chenfeiz0326 wants to merge 1 commit intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/fix-warmup-oom-multi-rank
Open

[https://nvbugs/6085022][fix] Synchronize warmup skip decisions across ranks#13177
chenfeiz0326 wants to merge 1 commit intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/fix-warmup-oom-multi-rank

Conversation

@chenfeiz0326
Copy link
Copy Markdown
Collaborator

@chenfeiz0326 chenfeiz0326 commented Apr 18, 2026

…s ranks to prevent hang

When one rank hits OOM during warmup (e.g., rank 0 with extra orchestrator overhead on a disaggregated CTX server), it catches the exception and moves on. However, other ranks remain stuck in collective operations inside the forward pass, eventually timing out and crashing with CUDA launch failures.

This fix adds _all_ranks_warmup_ready() which uses allreduce to synchronize the warmup decision across all ranks:

  1. Ensures all ranks have a valid warmup batch before proceeding
  2. Pre-checks free GPU memory on all ranks before large warmups to prevent OOM that would leave other ranks hanging in collective ops

This specifically fixes the DSV3.2 CTX server crash (TP=4, max_num_tokens=32784) where rank 0 OOMs during the 32K-token warmup forward pass while ranks 1-3 hang at the next collective op for 5 minutes before crashing.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced warmup process for distributed GPU execution with synchronized readiness checks across all ranks.
    • Added GPU memory availability validation before warmup to prevent out-of-memory errors in multi-GPU deployments.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@chenfeiz0326 chenfeiz0326 requested a review from a team as a code owner April 18, 2026 15:01
@chenfeiz0326 chenfeiz0326 requested a review from dongxuy04 April 18, 2026 15:01
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

Updated distributed communicator imports to include ReduceOp and modified the warmup process to perform cross-rank readiness synchronization. Added a new helper method that checks local readiness (considering GPU memory availability) and performs an allreduce operation to ensure all ranks are prepared before proceeding.

Changes

Cohort / File(s) Summary
Distributed Warmup Synchronization
tensorrt_llm/_torch/pyexecutor/model_engine.py
Updated imports to include ReduceOp, modified _general_warmup to include cross-rank readiness checks, and added _all_ranks_warmup_ready helper that validates local readiness (including GPU memory pre-check with conservative per-token estimates) and synchronizes across all ranks using allreduce with MIN operation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description lacks the required template structure; it is missing dedicated sections for 'Description' and 'Test Coverage' with substantive content, and the checklist is not completed. Add a clear 'Description' section explaining the issue and solution, a 'Test Coverage' section listing relevant tests, and complete the PR checklist items to match the template format.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: synchronizing warmup skip decisions across ranks, with the NVBugs ticket properly formatted and [fix] type specified.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 839-846: The readiness/memory check guarded by "if is_ready and
num_tokens > 256:" can leave smaller warmups vulnerable to OOMs that hang other
ranks; update the logic so the free_mem/min_free check
(torch.cuda.mem_get_info(), computation of min_free) runs for every warmup shape
instead of only when num_tokens > 256 — i.e., remove or change the >256 gate and
ensure the same free_mem < min_free handling is executed before calling
_general_warmup() (or invoke this memory check inside the warmup loop that
iterates warmup shapes) so that each call to _general_warmup() verifies
sufficient memory per-rank and fails locally if insufficient.
- Around line 854-860: The warmup sync currently passes a CUDA tensor into
self.dist.allreduce (ready_val = self.dist.allreduce(torch.tensor(...,
device='cuda'), op=ReduceOp.MIN)) which unnecessarily couples the barrier to GPU
memory and breaks MPIDist expectations; change the call to pass a plain Python
int (e.g., 1 or 0) into self.dist.allreduce and then interpret the returned
scalar to set all_ready (use the returned value > 0). Update references around
ready_val, self.dist.allreduce, ReduceOp.MIN and all_ready to handle a scalar
return instead of calling .item(), mirroring how other call sites (e.g.,
resource_manager.py) pass scalars and how TorchDist.allreduce/MPIDist.allreduce
accept base types.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: aa406598-e26f-4b9c-bf69-88b2343ce020

📥 Commits

Reviewing files that changed from the base of the PR and between 66d7711 and 6ea76b9.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py

Comment thread tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated
…s ranks to prevent hang

When one rank hits OOM during warmup (e.g., rank 0 with extra orchestrator
overhead on a disaggregated CTX server), it catches the exception and moves
on. However, other ranks remain stuck in collective operations inside the
forward pass, eventually timing out and crashing with CUDA launch failures.

This fix adds _all_ranks_warmup_ready() which uses allreduce to synchronize
the warmup decision across all ranks:
1. Ensures all ranks have a valid warmup batch before proceeding
2. Pre-checks free GPU memory on all ranks before large warmups to prevent
   OOM that would leave other ranks hanging in collective ops

This specifically fixes the DSV3.2 CTX server crash (TP=4, max_num_tokens=32784)
where rank 0 OOMs during the 32K-token warmup forward pass while ranks 1-3
hang at the next collective op for 5 minutes before crashing.

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@chenfeiz0326 chenfeiz0326 force-pushed the chenfeiz/fix-warmup-oom-multi-rank branch from 6ea76b9 to ff05569 Compare April 18, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant