[https://nvbugs/6071070][fix] Add K2.5 DISAGG Gen Only EPLB Cases into CI#13185
Conversation
📝 WalkthroughWalkthroughTwo performance sanity test configuration files for GB200 with Kimi models are updated. Generation worker batch size, token limits, and CUDA graph configuration are reduced (max_batch_size from 256 to 128, cuda_graph_config.max_batch_size from 256 to 32). KV cache memory fraction is lowered from 0.8 to 0.45. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/scripts/perf-sanity/disaggregated/gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml (1)
50-63: Consider adding a short inline rationale for these tuned limits.A brief note near this block (EPLB-384 + EP16 + CUDA graph warmup memory headroom) would reduce accidental regressions in future perf-config edits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/disaggregated/gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml` around lines 50 - 63, Add a short inline rationale comment immediately above or beside the tuned fields (e.g., max_batch_size, max_num_tokens, tensor_parallel_size, moe_expert_parallel_size, cuda_graph_config.enable_padding/max_batch_size, and kv_cache_config.free_gpu_memory_fraction) explaining why these limits were chosen (EPLB-384 headroom for CUDA-graph warmup, EP16 for expert parallelism, and kv cache memory headroom) so future editors understand the constraints and won’t regress perf by changing them unintentionally.tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml (1)
50-63: Optional: centralize shared disagg gen tuning across K2/K2.5 configs.Since both files carry identical memory-tuning knobs, a shared template/source-of-truth would help prevent config drift.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml` around lines 50 - 63, The disaggregated K2/K2.5 configs repeat identical memory-tuning keys (e.g., max_batch_size, max_num_tokens, tensor_parallel_size, moe_expert_parallel_size, cuda_graph_config.enable_padding, cuda_graph_config.max_batch_size, kv_cache_config.enable_block_reuse, kv_cache_config.free_gpu_memory_fraction); centralize these into a shared template or include (e.g., a common YAML anchor or a shared config snippet) and have the K2/K2.5 YAMLs import or reference that single source so changes to keys like max_batch_size or kv_cache_config.free_gpu_memory_fraction are made once and prevent drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml`:
- Around line 50-63: The disaggregated K2/K2.5 configs repeat identical
memory-tuning keys (e.g., max_batch_size, max_num_tokens, tensor_parallel_size,
moe_expert_parallel_size, cuda_graph_config.enable_padding,
cuda_graph_config.max_batch_size, kv_cache_config.enable_block_reuse,
kv_cache_config.free_gpu_memory_fraction); centralize these into a shared
template or include (e.g., a common YAML anchor or a shared config snippet) and
have the K2/K2.5 YAMLs import or reference that single source so changes to keys
like max_batch_size or kv_cache_config.free_gpu_memory_fraction are made once
and prevent drift.
In
`@tests/scripts/perf-sanity/disaggregated/gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml`:
- Around line 50-63: Add a short inline rationale comment immediately above or
beside the tuned fields (e.g., max_batch_size, max_num_tokens,
tensor_parallel_size, moe_expert_parallel_size,
cuda_graph_config.enable_padding/max_batch_size, and
kv_cache_config.free_gpu_memory_fraction) explaining why these limits were
chosen (EPLB-384 headroom for CUDA-graph warmup, EP16 for expert parallelism,
and kv cache memory headroom) so future editors understand the constraints and
won’t regress perf by changing them unintentionally.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 0a0a4151-e724-4200-9aef-9e8247cfcab6
📒 Files selected for processing (2)
tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yamltests/scripts/perf-sanity/disaggregated/gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb384_mtp0_ccb-UCX.yaml
85ef99f to
877e2c1
Compare
…nfigs EPLB causes OOM on gen worker during CUDA graph capture for Kimi K2/K2.5 disaggregated deployments. With num_slots=384 on EP=16, each GPU stores 24 expert weight sets (50% more than without EPLB), consuming ~10 GiB extra per GPU. Combined with NVLink MoE communication buffers, this leaves insufficient memory for CUDA graph autotuner warmup. Tested workarounds (reducing cuda_graph max_batch_size to 32, fraction to 0.45) make the server functional but significantly limit serving capacity. Disabling EPLB is the better choice until the memory budget is resolved. Removed load_balancer config from gen worker moe_config in all 6 files: - K2.5 dep16 eplb384, K2.5 dep32 eplb384, K2.5 dep32 eplb416 - K2 dep16 eplb384, K2 dep32 eplb384, K2 dep32 eplb416 Renamed config files from eplbXXX to eplb0 and updated test list references. Uncommented K2.5 gen_only disagg test cases in CI test lists. Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
4122106 to
a9f4069
Compare
|
/bot run --disable-fail-fast --stage-list "GB200-20_GPUs-5_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-9" |
|
PR_Github #44176 [ run ] triggered by Bot. Commit: |
|
PR_Github #44176 [ run ] completed with state
|
|
/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-7,GB200-20_GPUs-5_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8" |
|
PR_Github #44287 [ run ] triggered by Bot. Commit: |
|
PR_Github #44287 [ run ] completed with state
|
|
/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-7,GB200-20_GPUs-5_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8" |
|
PR_Github #44314 [ run ] triggered by Bot. Commit: |
|
PR_Github #44314 [ run ] completed with state
|
|
/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-7,GB200-20_GPUs-5_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8" |
1 similar comment
|
/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-7,GB200-20_GPUs-5_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-8" |
|
PR_Github #44392 [ run ] triggered by Bot. Commit: |
|
PR_Github #44392 [ run ] completed with state |
|
/bot skip --comment "Only add new perf tests, no need to run the whole CI pipeline" |
|
PR_Github #44433 [ skip ] triggered by Bot. Commit: |
|
PR_Github #44433 [ skip ] completed with state |
…nking EPLB disagg configs on dep16
Root cause: With EPLB num_slots=384 on EP=16, each GPU stores 24 expert weight sets (384/16) vs 16 without EPLB, consuming ~10 GiB extra per GPU. Combined with NVLink MoE communication buffers (MnnvlMemory, 512 MiB/layer x 58 layers), the original config (max_batch_size=256, cuda_graph bs=256, fraction=0.8) leaves insufficient memory for CUDA graph autotuner warmup, causing OOM during model initialization.
Fix: Reduce gen worker memory pressure while keeping CUDA graphs enabled:
Validated on lyris GB200 cluster: server starts successfully, EPLB loads all 384 slots, CUDA graphs captured, benchmark passes (636s).
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.