fix: harden retool rollout against multi-turn / retry desync#1861
Merged
zhuzilin merged 1 commit intoTHUDM:mainfrom May 11, 2026
Merged
fix: harden retool rollout against multi-turn / retry desync#1861zhuzilin merged 1 commit intoTHUDM:mainfrom
zhuzilin merged 1 commit intoTHUDM:mainfrom
Conversation
Keeps sample.rollout_log_probs, loss_masks, response_token_ids, and response length-aligned across turns. Four fixes: 1. Reset stale sample state at entry so a retried (previously aborted) sample doesn't concat new tokens onto old log-probs. 2. Clamp per-turn max_new_tokens to the remaining context budget so total_length can't blow past max_context_length, producing samples larger than the training-side per-partition cap. 3. Abort when sglang returns text without output_token_logprobs. The old fallback retokenized, which grows response_token_ids without matching log-probs — abort lets the rollout manager re-queue cleanly. 4. Trim post-tool-output overflow. Fix THUDM#2 clamps the model's generation, but tool output is appended unconstrained. Trim tokens / loss_masks / log-probs together and mark TRUNCATED. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four small fixes to
examples/retool/generate_with_retool.pythat protect one invariant across multi-turn tool-calling rollouts:sample.rollout_log_probs,loss_masks,response_token_ids, andresponsemust stay the same length. Any of these getting out of sync crashesslice_log_prob_with_cpin the trainer with a confusing length-mismatch error.When this fires
Hit during async retool training runs once samples started cycling through the rollout manager's retry path (aborted → re-enqueued) and once tool outputs got large enough to push past the context cap. Four separate ways the invariant broke in practice; four small fixes.
Fixes
Reset stale sample state at entry. Retried samples arrive with
rollout_log_probs / response / loss_maskstill populated from the prior attempt; the main loop appends onto the stale list and desyncs from the freshresponse_token_ids. Clear all four fields up front.Clamp per-turn
max_new_tokensto the remaining context budget. A turn can otherwise append up torollout_max_response_lentokens on top of a total already nearmax_context_length, producing samples larger than the training-side per-partition cap.Abort when sglang returns text without
output_token_logprobs. The old fallback retokenized the text, which growsresponse_token_idswithout a matching log-probs entry. ReturnABORTEDso the rollout manager re-queues the group cleanly instead of poisoning the trainer.Trim post-tool-output overflow. Fix [rollout] feat: implement partial rollout feature on rollout engine side #2 clamps the model's generation, but tool output (e.g. a large
print()fromcode_interpreter) is appended unconstrained. Trimresponse_token_ids / loss_masks / rollout_log_probstogether to keep lengths aligned, re-decoderesponseso the text field matches, and mark the sampleTRUNCATED.Risk
examples/, consumers opt in via--custom-generate-function-path.ABORTEDsamples here as a signal to check sglang version /--return-logprobrouting.Tests
None added.
examples/retool/has no unit-test coverage in the repo today; a full integration test would require mocking sglang + the tool sandbox. Each fix's failure mode is described concretely in the commit message and inline comments, and is reproducible by hand with a 0.5B model + retool yaml + a deliberately noisy tool.