convert: infer Qwen3.5/3.6 mtp_num_hidden_layers from safetensors#8
convert: infer Qwen3.5/3.6 mtp_num_hidden_layers from safetensors#8danielhanchen wants to merge 1 commit into
Conversation
Upstream Qwen3.6 repos (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B) ship the
MTP block weights under mtp.* in the safetensors shards but do not set
mtp_num_hidden_layers in config.json. The current _Qwen35MtpMixin reads
this hparam from the config to decide whether to emit the nextn metadata
key and to extend block_count, so when the key is missing the entire MTP
block is silently dropped at conversion time. The resulting GGUF fails
to load later with GGML_ASSERT(nextn_predict_layers > 0) as soon as the
user passes --spec-type mtp.
Fix by inferring mtp_num_hidden_layers from the safetensors weight map
(model.safetensors.index.json, or the bare model.safetensors header for
single-shard repos), counting unique mtp.layers.{N}. indices. If any
mtp.* tensors are present the hparam is injected and a warning is
logged, after which the existing remap path runs unchanged.
No behavioural change for repos that already set mtp_num_hidden_layers.
|
For context: this is orthogonal to #6. PR #6 is the C++ runtime rebase of MTP onto |
|
Closing this. After actually running an end-to-end conversion (which I should have done before opening the PR) the auto-detect turns out to be unnecessary. Test setup: cached Result: Output GGUF metadata: The HF assert reports I cited were not from "upstream config missing the key". The real root cause on our side was a pipeline bug that pulled the ggml-org master Sorry for the noise. |
Summary
Upstream Qwen3.6 repos (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B) ship the MTP block weights under
mtp.*in the safetensors shards but do not setmtp_num_hidden_layersinconfig.json. The current_Qwen35MtpMixinreads this hparam from the config to decide whether to extendblock_countand emitnextn_predict_layers. When the key is missing the entire MTP block is silently dropped at conversion time, producing a GGUF that boots fine for plain inference but fails at load withas soon as the user passes
--spec-type mtp.This is what is happening in HF discussion reports against unsloth and other community uploaders: unsloth/Qwen3.6-27B-GGUF-MTP#1, unsloth/Qwen3.6-27B-GGUF-MTP#2,
lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP,llmfan46/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP-GGUF. Anyone converting an unmodified upstream Qwen3.6 repo today gets an MTP-less GGUF without any warning, and the failure only surfaces on the consumer side.Fix
Infer
mtp_num_hidden_layersfrom the safetensors weight map at the top of_Qwen35MtpMixin.__init__:model.safetensors.index.jsonif present; otherwise peek at the baremodel.safetensorsheader for single-shard repos.mtp.layers.{N}.indices and injectmtp_num_hidden_layers = max(N) + 1intoself.hparams.Everything downstream (
block_countextension,add_nextn_predict_layers, themtp.fc/mtp.pre_fc_norm_*/mtp.norm-> per-blockeh_proj/enorm/hnorm/shared_head.normfan-out) runs unchanged.No behavioural change for repos that already set
mtp_num_hidden_layers(the guard isif not self.hparams.get(...)).Test plan
convert_hf_to_gguf.py.weight_mapcontainingmtp.layers.0.*keys: correctly returnsmtp_num_hidden_layers = 1.Qwen/Qwen3.6-27BandQwen/Qwen3.6-35B-A3B(both unmodified) on the patched branch: resulting GGUFs carryqwen35.block_count = 65/qwen35moe.block_count = 41,nextn_predict_layers = 1, and the fourblk.{N}.nextn.*tensors at the MTP slot.llama-server --spec-type mtp --spec-draft-n-max 3loads the resulting files and decodes with non-zero draft acceptance (Qwen3.6-27B around 0.74, Qwen3.6-35B-A3B around 0.78 on a small B200 prompt set).Followups (not in this PR)
GGML_ASSERTatsrc/models/qwen35_mtp.cpp:8with a clear error message pointing at the converter-side cause, so consumers of pre-existing broken GGUFs get a useful pointer instead of a stack trace. Happy to send as a separate PR.mtp_num_hidden_layers: 1to the upstreamconfig.jsonfiles so this converter-side workaround can eventually be dropped.cc @am17an