Skip to content

fix: add fallback for --save-hf when Megatron-Bridge lacks model support#1881

Open
WangHong-yang wants to merge 1 commit intoTHUDM:mainfrom
WangHong-yang:main
Open

fix: add fallback for --save-hf when Megatron-Bridge lacks model support#1881
WangHong-yang wants to merge 1 commit intoTHUDM:mainfrom
WangHong-yang:main

Conversation

@WangHong-yang
Copy link
Copy Markdown
Contributor

Summary

  • When AutoBridge.from_hf_pretrained raises (unsupported architecture), save_hf_model now
    falls back to reading the just-saved torch_dist checkpoint from disk and converting to HF
    safetensors via slime's own megatron_to_hf handlers.
  • Extract shared checkpoint loading/saving utilities from tools/convert_torch_dist_to_hf.py
    into slime/utils/torch_dist_to_hf.py so both the CLI tool and the runtime fallback reuse
    the same code.

Motivation

Models not yet in the pinned Bridge version (e.g. Qwen3.5) or not supported
by Megatron-Bridge (e.g. GLM-5, GLM-5.1) silently fail on --save-hf.
This change makes the export work for any model that slime already has a
megatron_to_hf handler for, without requiring a Bridge update.

How it works

  1. save_hf_model tries the existing Megatron-Bridge path.
  2. On failure, _save_hf_direct runs on rank 0 only:
    • Loads iter_{rollout_id} torch_dist checkpoint from disk (just saved moments before).
    • Converts to HF safetensors via slime/utils/torch_dist_to_hf.py.
    • Copies tokenizer/config from --hf-checkpoint.
  3. All other ranks wait at a barrier.

Test plan

  • Tested on Qwen3.5-122B-A10B (4x p5en.48xlarge, EP=16, DP=2)
  • Megatron-Bridge fails → fallback triggers → 46 safetensors shards saved (~11 min)
  • Byte-compared online (--save-hf fallback) vs offline (tools/convert_torch_dist_to_hf.py)
    from the same torch_dist checkpoint: 46/46 files identical

When Megatron-Bridge doesn't support a model architecture, save_hf_model
now falls back to reading the just-saved torch_dist checkpoint and
converting to HF format via the existing conversion logic.

Extract shared checkpoint loading/saving utilities from
tools/convert_torch_dist_to_hf.py into slime/utils/torch_dist_to_hf.py
so both the CLI tool and the runtime fallback can reuse them.
@WangHong-yang WangHong-yang marked this pull request as ready for review April 30, 2026 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant