fix: guard DP-imbalance empty micro-batches under dynamic batching#1860
Open
leofan-lab wants to merge 1 commit intoTHUDM:mainfrom
Open
fix: guard DP-imbalance empty micro-batches under dynamic batching#1860leofan-lab wants to merge 1 commit intoTHUDM:mainfrom
leofan-lab wants to merge 1 commit intoTHUDM:mainfrom
Conversation
With --use-dynamic-batch-size and variable-length rollouts (common in multi-turn RL), DP ranks can need different micro-batch counts. The all_reduce(MAX, num_microbatches) forces lagging ranks to iterate empty micro-batches, which previously died inside torch.cat / torch.split with no useful context. Two layered fixes for the same root cause: 1. get_batch: fabricate a 1-token placeholder when `tokens=[]`, with a declared placeholder-for-key schema and a post-fill invariant assert so a future schema addition fails loudly here instead of 400 lines downstream. response_length=0 makes the placeholder a no-op under CP chunking and in sum_of_sample_mean. 2. forward_only: size the reordered output to len(origin_indices), not len(values), so trailing placeholder outputs are dropped. Add invariant asserts for the first-fit bin-packer assumption (real samples at positions [0, N)) and for output-count >= origin count, so future changes that interleave empties or drop real outputs fail locally rather than at some downstream .device access. Tests: tests/test_dp_imbalance_reorder.py (forward_only reorder), tests/test_dp_imbalance_placeholder.py (get_batch placeholder schema). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
With
--use-dynamic-batch-sizeand variable-length rollouts (common in multi-turn RL with tool calls), DP ranks can need different numbers of micro-batches. Theall_reduce(MAX, num_microbatches)inget_data_iteratorforces ranks with fewer real samples to iterate empty micro-batches through the pipeline scheduler, previously producing two crash signatures deep inside post-forward aggregation:torch.cat(): expected a non-empty list of Tensors(fromget_batchon emptytokens)AttributeError: 'NoneType' object has no attribute 'device'(fromcompute_advantages_and_returnson trailingNones in the reorder output)Both trace back to the same root cause.
When this fires
Hit on async-retool training runs with
--use-dynamic-batch-size+ DP > 1 once sample-length variance got high enough that some ranks finished their real micro-batches before the collective cap. Does not fire with fixed batch size or DP=1.Fix
Two layered guards for the same class of bug:
get_batch: fabricate a 1-token placeholder whentokens=[], with a declaredplaceholder_for_keyschema and a post-fill invariant assert so a future schema addition fails loudly at the placeholder site instead of 400 lines downstream.response_length=0makes the placeholder a no-op under CP chunking andsum_of_sample_mean(0-size tensors,split_sizes=[0]).forward_only: size the reordered output tolen(origin_indices)(notlen(values)), dropping trailing placeholder outputs. Add local asserts for the first-fit bin-packer invariant (real samples at[0, N)) and output-count >= real-sample-count, so a future change to the partitioner or forward path fails locally rather than at some downstream.deviceaccess.Tests
tests/test_dp_imbalance_reorder.py— 5 tests on theforward_onlyreorder, including atest_fix_vs_pre_fix_behavior_divergencecase that asserts the exact pre-fix trailing-Noneshape.tests/test_dp_imbalance_placeholder.py— 4 tests on theget_batchplaceholder schema: known-key fill, per-sample-count invariant, unknown-key default, invariant-assert firing on schema drift. Imports and exercises the real_fill_empty_microbatch_placeholderhelper (mutation-tested: removing the invariant-assert block breakstest_placeholder_invariant_fires_on_unfilled_key).9/9 pass on a CPU-only dev pod.