fix: make Megatron one-shot train() assumptions idempotent across slime rollouts by leofan-lab · Pull Request #1857 · THUDM/slime

leofan-lab · 2026-04-24T05:50:58Z

Megatron's upstream train() is written for one-shot invocation. Slime invokes train() once per rollout, which trips two separate per-rollout idempotency bugs in slime/backends/megatron_utils/model.py. This PR fixes both.

Fix 1: `no_sync_func` install (commit `14abe6a`)

Megatron asserts config.no_sync_func is None on entry. Slime installs model.no_sync at the end of rollout 1, so rollout 2 trips the assert.

Replace the assert with if ... is None: install. Same first-time behavior, idempotent on re-entry. Encountered when I enabled --overlap-grad-reduce.

Fix 2: `disable_forward_pre_hook` re-entry (commit `a9243e4`)

At the end of rollout N, disable_forward_pre_hook() removes forward pre-hooks from each DDP chunk and clears that chunk's remove_forward_pre_hook_handles list. On rollout N+1, slime calls disable_forward_pre_hook again; with no hooks left to disable, Megatron's internal Float16Module lookup fails with KeyError.

Guard the call: only disable if any chunk still has an active hook. Encountered when I enabled --overlap-param-gather alongside --overlap-grad-reduce.

Scope

Both fixes are only triggered when overlap flags are on (--overlap-grad-reduce, --overlap-param-gather). Without them, the relevant code paths don't execute and the existing assert is unreachable. So these changes are no-ops for users who don't enable overlap, and make overlap usable for slime's multi-rollout model.

Slime calls train() once per rollout, but Megatron's upstream code asserts `config.no_sync_func is None` on entry — written for a one-shot trainer. After rollout 1 we've installed `model.no_sync` ourselves, so rollout 2 trips the assert. Replace the assert with `if ... is None: install`. Same first-time behavior, idempotent on re-entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second symptom of the same root cause as the previous commit: Megatron's upstream code assumes a one-shot train() call, but slime invokes train() per rollout. At the end of rollout N, disable_forward_pre_hook() removes the forward pre-hooks from each DDP chunk and clears the chunk's `remove_forward_pre_hook_handles` list. On rollout N+1, slime calls disable_forward_pre_hook again; with no hooks left to disable, Megatron's internal Float16Module lookup fails with KeyError. Guard the call: only disable if any chunk still has an active hook. Encountered when I enabled --overlap-param-gather alongside --overlap-grad-reduce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

leofan-lab and others added 2 commits April 24, 2026 05:46

leofan-lab changed the title ~~fix: make no_sync_func install idempotent across train() calls~~ fix: make Megatron one-shot train() assumptions idempotent across slime rollouts May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make Megatron one-shot train() assumptions idempotent across slime rollouts#1857

fix: make Megatron one-shot train() assumptions idempotent across slime rollouts#1857
leofan-lab wants to merge 2 commits intoTHUDM:mainfrom
leofan-lab:fix/megatron-no-sync-idempotent

leofan-lab commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leofan-lab commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix 1: no_sync_func install (commit 14abe6a)

Fix 2: disable_forward_pre_hook re-entry (commit a9243e4)

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leofan-lab commented Apr 24, 2026 •

edited

Loading

Fix 1: `no_sync_func` install (commit `14abe6a`)

Fix 2: `disable_forward_pre_hook` re-entry (commit `a9243e4`)