convert: infer Qwen3.5/3.6 mtp_num_hidden_layers from safetensors by danielhanchen · Pull Request #8 · am17an/llama.cpp

danielhanchen · 2026-05-12T00:36:42Z

Summary

Upstream Qwen3.6 repos (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B) ship the MTP block weights under mtp.* in the safetensors shards but do not set mtp_num_hidden_layers in config.json. The current _Qwen35MtpMixin reads this hparam from the config to decide whether to extend block_count and emit nextn_predict_layers. When the key is missing the entire MTP block is silently dropped at conversion time, producing a GGUF that boots fine for plain inference but fails at load with

GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed

as soon as the user passes --spec-type mtp.

This is what is happening in HF discussion reports against unsloth and other community uploaders: unsloth/Qwen3.6-27B-GGUF-MTP#1, unsloth/Qwen3.6-27B-GGUF-MTP#2, lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP, llmfan46/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP-GGUF. Anyone converting an unmodified upstream Qwen3.6 repo today gets an MTP-less GGUF without any warning, and the failure only surfaces on the consumer side.

Fix

Infer mtp_num_hidden_layers from the safetensors weight map at the top of _Qwen35MtpMixin.__init__:

Read model.safetensors.index.json if present; otherwise peek at the bare model.safetensors header for single-shard repos.
Count unique mtp.layers.{N}. indices and inject mtp_num_hidden_layers = max(N) + 1 into self.hparams.
Log a warning when the value is inferred so the user is aware.

Everything downstream (block_count extension, add_nextn_predict_layers, the mtp.fc / mtp.pre_fc_norm_* / mtp.norm -> per-block eh_proj / enorm / hnorm / shared_head.norm fan-out) runs unchanged.

No behavioural change for repos that already set mtp_num_hidden_layers (the guard is if not self.hparams.get(...)).

Test plan

Syntax check on convert_hf_to_gguf.py.
Unit-level check of the detection logic against a synthesised weight_map containing mtp.layers.0.* keys: correctly returns mtp_num_hidden_layers = 1.
End-to-end conversion of Qwen/Qwen3.6-27B and Qwen/Qwen3.6-35B-A3B (both unmodified) on the patched branch: resulting GGUFs carry qwen35.block_count = 65 / qwen35moe.block_count = 41, nextn_predict_layers = 1, and the four blk.{N}.nextn.* tensors at the MTP slot.
llama-server --spec-type mtp --spec-draft-n-max 3 loads the resulting files and decodes with non-zero draft acceptance (Qwen3.6-27B around 0.74, Qwen3.6-35B-A3B around 0.78 on a small B200 prompt set).

Followups (not in this PR)

Replace the bare GGML_ASSERT at src/models/qwen35_mtp.cpp:8 with a clear error message pointing at the converter-side cause, so consumers of pre-existing broken GGUFs get a useful pointer instead of a stack trace. Happy to send as a separate PR.
Encourage Qwen to add mtp_num_hidden_layers: 1 to the upstream config.json files so this converter-side workaround can eventually be dropped.

cc @am17an

Upstream Qwen3.6 repos (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B) ship the MTP block weights under mtp.* in the safetensors shards but do not set mtp_num_hidden_layers in config.json. The current _Qwen35MtpMixin reads this hparam from the config to decide whether to emit the nextn metadata key and to extend block_count, so when the key is missing the entire MTP block is silently dropped at conversion time. The resulting GGUF fails to load later with GGML_ASSERT(nextn_predict_layers > 0) as soon as the user passes --spec-type mtp. Fix by inferring mtp_num_hidden_layers from the safetensors weight map (model.safetensors.index.json, or the bare model.safetensors header for single-shard repos), counting unique mtp.layers.{N}. indices. If any mtp.* tensors are present the hparam is injected and a warning is logged, after which the existing remap path runs unchanged. No behavioural change for repos that already set mtp_num_hidden_layers.

danielhanchen · 2026-05-12T00:37:36Z

For context: this is orthogonal to #6. PR #6 is the C++ runtime rebase of MTP onto gg/spec-parallel and does not touch convert_hf_to_gguf.py, so the converter-side silent-drop covered here is not addressed there. Happy to retarget the base to gg/spec-parallel if that branch is going to be the canonical landing spot for MTP work going forward; the 25-line diff applies cleanly to either branch since _Qwen35MtpMixin is untouched in #6.

danielhanchen · 2026-05-12T01:02:41Z

Closing this. After actually running an end-to-end conversion (which I should have done before opening the PR) the auto-detect turns out to be unnecessary.

Test setup: cached Qwen/Qwen3.6-27B safetensors, pristine upstream config.json (with mtp_num_hidden_layers: 1 only nested inside text_config, no top-level key), unpatched mtp-clean at 5d5f1b46e.

Result:

INFO:hf-to-gguf:blk.64.nextn.eh_proj.weight,          torch.bfloat16 --> BF16, shape = {10240, 5120}
INFO:hf-to-gguf:blk.64.nextn.shared_head_norm.weight, torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.64.nextn.enorm.weight,            torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.64.nextn.hnorm.weight,            torch.bfloat16 --> F32, shape = {5120}

Output GGUF metadata: qwen35.block_count = 65, qwen35.nextn_predict_layers = 1, all 4 nextn tensors present. So the existing text_config to root promotion at convert_hf_to_gguf.py:1021-1023 already covers upstream Qwen3.6 (which always ships as *ForConditionalGeneration with a text_config block).

The HF assert reports I cited were not from "upstream config missing the key". The real root cause on our side was a pipeline bug that pulled the ggml-org master convert_hf_to_gguf.py (no MTP support at all) instead of the local mtp-clean clone, so MTP tensors were dropped at conversion regardless of config. Other uploaders hitting the same assert almost certainly have the same class of issue, not a converter-side gap.

Sorry for the noise.

github-actions Bot added the python label May 12, 2026

danielhanchen closed this May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert: infer Qwen3.5/3.6 mtp_num_hidden_layers from safetensors#8

convert: infer Qwen3.5/3.6 mtp_num_hidden_layers from safetensors#8
danielhanchen wants to merge 1 commit into
am17an:mtp-cleanfrom
danielhanchen:mtp-auto-detect-from-safetensors

danielhanchen commented May 12, 2026

Uh oh!

danielhanchen commented May 12, 2026

Uh oh!

danielhanchen commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielhanchen commented May 12, 2026

Summary

Fix

Test plan

Followups (not in this PR)

Uh oh!

danielhanchen commented May 12, 2026

Uh oh!

danielhanchen commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant