Skip to content

[https://nvbugs/6071070][fix] Add K2.5 DISAGG Gen Only EPLB Cases into CI#13185

Merged
chenfeiz0326 merged 2 commits intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/fix-bug-6071070
Apr 20, 2026
Merged

[https://nvbugs/6071070][fix] Add K2.5 DISAGG Gen Only EPLB Cases into CI#13185
chenfeiz0326 merged 2 commits intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/fix-bug-6071070

Conversation

@chenfeiz0326
Copy link
Copy Markdown
Collaborator

@chenfeiz0326 chenfeiz0326 commented Apr 19, 2026

…nking EPLB disagg configs on dep16

Root cause: With EPLB num_slots=384 on EP=16, each GPU stores 24 expert weight sets (384/16) vs 16 without EPLB, consuming ~10 GiB extra per GPU. Combined with NVLink MoE communication buffers (MnnvlMemory, 512 MiB/layer x 58 layers), the original config (max_batch_size=256, cuda_graph bs=256, fraction=0.8) leaves insufficient memory for CUDA graph autotuner warmup, causing OOM during model initialization.

Fix: Reduce gen worker memory pressure while keeping CUDA graphs enabled:

  • max_batch_size: 256 -> 128
  • max_num_tokens: 256 -> 128
  • cuda_graph_config.max_batch_size: 256 -> 32
  • kv_cache_config.free_gpu_memory_fraction: 0.8 -> 0.45

Validated on lyris GB200 cluster: server starts successfully, EPLB loads all 384 slots, CUDA graphs captured, benchmark passes (636s).

Summary by CodeRabbit

  • Chores
    • Updated performance testing configurations for generation worker parameters and memory allocation settings to optimize resource management during testing scenarios.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 19, 2026

📝 Walkthrough

Walkthrough

Two performance sanity test configuration files for GB200 with Kimi models are updated. Generation worker batch size, token limits, and CUDA graph configuration are reduced (max_batch_size from 256 to 128, cuda_graph_config.max_batch_size from 256 to 32). KV cache memory fraction is lowered from 0.8 to 0.45.

Changes

Cohort / File(s) Summary
Performance Test Configuration
tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml, tests/scripts/perf-sanity/disaggregated/gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml
Reduced max_batch_size and max_num_tokens from 256 to 128 in generation worker config. Decreased cuda_graph_config.max_batch_size from 256 to 32. Adjusted kv_cache_config.free_gpu_memory_fraction from 0.8 to 0.45.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive PR description includes root cause analysis, specific configuration changes, and validation results, but the Description and Test Coverage template sections are empty. Fill in the 'Description' section explaining the issue and solution, and the 'Test Coverage' section listing relevant tests that validate these configuration changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title references adding K2.5 DISAGG Gen Only EPLB cases to CI, which aligns with the PR objectives of fixing a gen-worker OOM issue by updating configuration files for K2.5 disagg deployments.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/scripts/perf-sanity/disaggregated/gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml (1)

50-63: Consider adding a short inline rationale for these tuned limits.

A brief note near this block (EPLB-384 + EP16 + CUDA graph warmup memory headroom) would reduce accidental regressions in future perf-config edits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/scripts/perf-sanity/disaggregated/gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml`
around lines 50 - 63, Add a short inline rationale comment immediately above or
beside the tuned fields (e.g., max_batch_size, max_num_tokens,
tensor_parallel_size, moe_expert_parallel_size,
cuda_graph_config.enable_padding/max_batch_size, and
kv_cache_config.free_gpu_memory_fraction) explaining why these limits were
chosen (EPLB-384 headroom for CUDA-graph warmup, EP16 for expert parallelism,
and kv cache memory headroom) so future editors understand the constraints and
won’t regress perf by changing them unintentionally.
tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml (1)

50-63: Optional: centralize shared disagg gen tuning across K2/K2.5 configs.

Since both files carry identical memory-tuning knobs, a shared template/source-of-truth would help prevent config drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml`
around lines 50 - 63, The disaggregated K2/K2.5 configs repeat identical
memory-tuning keys (e.g., max_batch_size, max_num_tokens, tensor_parallel_size,
moe_expert_parallel_size, cuda_graph_config.enable_padding,
cuda_graph_config.max_batch_size, kv_cache_config.enable_block_reuse,
kv_cache_config.free_gpu_memory_fraction); centralize these into a shared
template or include (e.g., a common YAML anchor or a shared config snippet) and
have the K2/K2.5 YAMLs import or reference that single source so changes to keys
like max_batch_size or kv_cache_config.free_gpu_memory_fraction are made once
and prevent drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml`:
- Around line 50-63: The disaggregated K2/K2.5 configs repeat identical
memory-tuning keys (e.g., max_batch_size, max_num_tokens, tensor_parallel_size,
moe_expert_parallel_size, cuda_graph_config.enable_padding,
cuda_graph_config.max_batch_size, kv_cache_config.enable_block_reuse,
kv_cache_config.free_gpu_memory_fraction); centralize these into a shared
template or include (e.g., a common YAML anchor or a shared config snippet) and
have the K2/K2.5 YAMLs import or reference that single source so changes to keys
like max_batch_size or kv_cache_config.free_gpu_memory_fraction are made once
and prevent drift.

In
`@tests/scripts/perf-sanity/disaggregated/gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml`:
- Around line 50-63: Add a short inline rationale comment immediately above or
beside the tuned fields (e.g., max_batch_size, max_num_tokens,
tensor_parallel_size, moe_expert_parallel_size,
cuda_graph_config.enable_padding/max_batch_size, and
kv_cache_config.free_gpu_memory_fraction) explaining why these limits were
chosen (EPLB-384 headroom for CUDA-graph warmup, EP16 for expert parallelism,
and kv cache memory headroom) so future editors understand the constraints and
won’t regress perf by changing them unintentionally.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 0a0a4151-e724-4200-9aef-9e8247cfcab6

📥 Commits

Reviewing files that changed from the base of the PR and between fc9d130 and 6c76346.

📒 Files selected for processing (2)
  • tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml
  • tests/scripts/perf-sanity/disaggregated/gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml

@chenfeiz0326 chenfeiz0326 force-pushed the chenfeiz/fix-bug-6071070 branch 3 times, most recently from 85ef99f to 877e2c1 Compare April 19, 2026 09:21
@chenfeiz0326 chenfeiz0326 requested review from a team as code owners April 19, 2026 09:43
@chenfeiz0326 chenfeiz0326 changed the title [https://nvbugs/6071070][fix] Add K2.5 DISAGG Gen Only EPLB Cases into CI [https://nvbugs/6071070][fix] Add K2.5 DISAGG Gen Only EPLB Cases into CI Apr 19, 2026
…nfigs

EPLB causes OOM on gen worker during CUDA graph capture for Kimi K2/K2.5
disaggregated deployments. With num_slots=384 on EP=16, each GPU stores
24 expert weight sets (50% more than without EPLB), consuming ~10 GiB
extra per GPU. Combined with NVLink MoE communication buffers, this
leaves insufficient memory for CUDA graph autotuner warmup.

Tested workarounds (reducing cuda_graph max_batch_size to 32, fraction
to 0.45) make the server functional but significantly limit serving
capacity. Disabling EPLB is the better choice until the memory budget
is resolved.

Removed load_balancer config from gen worker moe_config in all 6 files:
- K2.5 dep16 eplb384, K2.5 dep32 eplb384, K2.5 dep32 eplb416
- K2 dep16 eplb384, K2 dep32 eplb384, K2 dep32 eplb416

Renamed config files from eplbXXX to eplb0 and updated test list references.
Uncommented K2.5 gen_only disagg test cases in CI test lists.

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@chenfeiz0326 chenfeiz0326 force-pushed the chenfeiz/fix-bug-6071070 branch from 4122106 to a9f4069 Compare April 19, 2026 09:56
@chenfeiz0326
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-20_GPUs-5_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-9"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44176 [ run ] triggered by Bot. Commit: a9f4069 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44176 [ run ] completed with state SUCCESS. Commit: a9f4069
/LLM/main/L0_MergeRequest_PR pipeline #34603 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@chenfeiz0326
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-7,GB200-20_GPUs-5_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44287 [ run ] triggered by Bot. Commit: 77b03ae Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44287 [ run ] completed with state FAILURE. Commit: 77b03ae
/LLM/main/L0_MergeRequest_PR pipeline #34708 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chenfeiz0326
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-7,GB200-20_GPUs-5_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44314 [ run ] triggered by Bot. Commit: 77b03ae Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44314 [ run ] completed with state FAILURE. Commit: 77b03ae
/LLM/main/L0_MergeRequest_PR pipeline #34733 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chenfeiz0326
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-7,GB200-20_GPUs-5_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8"

1 similar comment
@chenfeiz0326
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-7,GB200-20_GPUs-5_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44392 [ run ] triggered by Bot. Commit: 77b03ae Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44392 [ run ] completed with state SUCCESS. Commit: 77b03ae
/LLM/main/L0_MergeRequest_PR pipeline #34807 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@chenfeiz0326
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "Only add new perf tests, no need to run the whole CI pipeline"

@chenfeiz0326 chenfeiz0326 enabled auto-merge (squash) April 20, 2026 09:47
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44433 [ skip ] triggered by Bot. Commit: 77b03ae Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44433 [ skip ] completed with state SUCCESS. Commit: 77b03ae
Skipping testing for commit 77b03ae

Link to invocation

@chenfeiz0326 chenfeiz0326 merged commit d08817c into NVIDIA:main Apr 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants