Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ class NVLinkOneSided(Communication):

# Single shared workspace/memory across the process
_WORKSPACE: dict | None = None
_WORKSPACE_INIT_FAILED: bool = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution is duplicate of #13172, we may close this PR instead. We can unwaive 6084764 in #13172.


# MetaInfo indices - initialized from C++ constants
FLAG_VAL_OFFSET_INDEX = None
Expand Down Expand Up @@ -168,6 +169,14 @@ def __init__(
transfer (halves NVLink bandwidth usage, output precision is preserved).
Corresponds to model_config.use_low_precision_moe_combine.
"""
# Skip if workspace initialization previously failed to avoid repeated
# MnnvlMemory allocations that leak CUDA physical memory (held alive by
# exception traceback references), which can exhaust GPU memory.
if NVLinkOneSided._WORKSPACE_INIT_FAILED:
raise RuntimeError(
"NVLinkOneSided workspace initialization previously failed; skipping retry."
)

super().__init__(mapping)

if self.mapping.world_size != self.ep_size:
Expand Down Expand Up @@ -229,15 +238,26 @@ def __init__(
f"NVLinkOneSided: Allocating workspace with size {self.workspace_size_per_rank} bytes."
f"ep_rank: {self.ep_rank}, ep_size: {self.ep_size}, top_k: {self.top_k}, max_num_tokens_per_rank: {self.max_num_tokens_per_rank}"
)
mnnvl_mem = MnnvlMemory(mapping, self.workspace_size_per_rank)
workspace = mnnvl_mem.as_torch_strided_tensor(torch.uint8)
metainfo = torch.ops.trtllm.moe_a2a_initialize(
workspace,
self.ep_rank,
self.ep_size,
self.max_num_tokens_per_rank,
self.eplb_stats_num_experts,
)
mnnvl_mem = None
workspace = None
try:
mnnvl_mem = MnnvlMemory(mapping, self.workspace_size_per_rank)
workspace = mnnvl_mem.as_torch_strided_tensor(torch.uint8)
metainfo = torch.ops.trtllm.moe_a2a_initialize(
workspace,
self.ep_rank,
self.ep_size,
self.max_num_tokens_per_rank,
self.eplb_stats_num_experts,
)
except Exception:
# Release CUDA physical memory immediately to prevent leak.
# Without explicit cleanup, MnnvlMemory objects stay alive
# (held by exception traceback references) until GC runs.
workspace = None
mnnvl_mem = None
NVLinkOneSided._WORKSPACE_INIT_FAILED = True
raise
Comment on lines +253 to +260
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py | sed -n '240,275p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 2009


🏁 Script executed:

find . -name "MOE_DEVELOPER_GUIDE.md" -type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 125


🏁 Script executed:

cat tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md

Repository: NVIDIA/TensorRT-LLM

Length of output: 8060


🏁 Script executed:

rg "class MnnvlMemory" -A 30 tensorrt_llm/

Repository: NVIDIA/TensorRT-LLM

Length of output: 2183


🏁 Script executed:

rg "class MnnvlMemory" -A 100 tensorrt_llm/_mnnvl_utils.py | head -150

Repository: NVIDIA/TensorRT-LLM

Length of output: 4017


🏁 Script executed:

rg "def open_mnnvl_memory" -A 50 tensorrt_llm/

Repository: NVIDIA/TensorRT-LLM

Length of output: 3845


🏁 Script executed:

rg "def open_mnnvl_memory" -A 100 tensorrt_llm/_mnnvl_utils.py | head -150

Repository: NVIDIA/TensorRT-LLM

Length of output: 4627


🏁 Script executed:

rg "_WORKSPACE_INIT_FAILED" -B 3 -A 3 tensorrt_llm/_torch/modules/fused_moe/communication/

Repository: NVIDIA/TensorRT-LLM

Length of output: 2562


🏁 Script executed:

rg "pack_strided_memory" -B 5 -A 20 tensorrt_llm/

Repository: NVIDIA/TensorRT-LLM

Length of output: 5061


🏁 Script executed:

cat -n tensorrt_llm/_dlpack_utils.py | sed -n '1,100p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 3978


🏁 Script executed:

cat -n tensorrt_llm/_dlpack_utils.py | sed -n '58,120p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 2927


🏁 Script executed:

rg "def pack_strided_memory" -A 30 tensorrt_llm/_dlpack_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1143


🏁 Script executed:

rg "moe_a2a_initialize" -B 3 -A 3 tensorrt_llm/

Repository: NVIDIA/TensorRT-LLM

Length of output: 2184


🏁 Script executed:

rg "except.*:" -B 3 -A 10 tensorrt_llm/_torch/distributed/moe_alltoall.py | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

cat -n tensorrt_llm/_torch/distributed/moe_alltoall.py | sed -n '1,150p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 7549


🏁 Script executed:

cat -n tensorrt_llm/_torch/distributed/moe_alltoall.py | sed -n '150,230p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 4528


Narrow exception handling to only catch expected initialization failures.

except Exception is too broad and can permanently disable NVLinkOneSided for unrelated errors. Since MNNVL initialization operations (MnnvlMemory, as_torch_strided_tensor, torch.ops.trtllm.moe_a2a_initialize) are expected to raise RuntimeError on initialization failures, catch RuntimeError specifically. Setting the class-level _WORKSPACE_INIT_FAILED flag prevents any future initialization attempts, so it should only be set for actual initialization errors, not incidental exceptions.

Proposed change
-            except Exception:
+            except RuntimeError:
                 # Release CUDA physical memory immediately to prevent leak.
                 # Without explicit cleanup, MnnvlMemory objects stay alive
                 # (held by exception traceback references) until GC runs.
                 workspace = None
                 mnnvl_mem = None
                 NVLinkOneSided._WORKSPACE_INIT_FAILED = True
                 raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception:
# Release CUDA physical memory immediately to prevent leak.
# Without explicit cleanup, MnnvlMemory objects stay alive
# (held by exception traceback references) until GC runs.
workspace = None
mnnvl_mem = None
NVLinkOneSided._WORKSPACE_INIT_FAILED = True
raise
except RuntimeError:
# Release CUDA physical memory immediately to prevent leak.
# Without explicit cleanup, MnnvlMemory objects stay alive
# (held by exception traceback references) until GC runs.
workspace = None
mnnvl_mem = None
NVLinkOneSided._WORKSPACE_INIT_FAILED = True
raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py`
around lines 253 - 260, The except block in NVLinkOneSided that currently uses a
broad "except Exception" should be narrowed to "except RuntimeError" so only
expected MNNVL initialization failures (from MnnvlMemory,
as_torch_strided_tensor, torch.ops.trtllm.moe_a2a_initialize) trigger setting
NVLinkOneSided._WORKSPACE_INIT_FAILED and cleanup of workspace and mnnvl_mem;
change the handler to "except RuntimeError:" keep the workspace = None and
mnnvl_mem = None cleanup and re-raise the error, ensuring other unexpected
exceptions are not swallowed and do not permanently disable NVLinkOneSided.

NVLinkOneSided._WORKSPACE = {
"workspace_size_per_rank": self.workspace_size_per_rank,
"max_num_tokens_per_rank": self.max_num_tokens_per_rank,
Expand Down
Loading