Skip to content

llama + spec: MTP Support #22673

Draft
am17an wants to merge 11 commits intoggml-org:masterfrom
am17an:mtp-clean
Draft

llama + spec: MTP Support #22673
am17an wants to merge 11 commits intoggml-org:masterfrom
am17an:mtp-clean

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 4, 2026

Overview

This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detailed results below, but typically I see a steady-state acceptance of around 75% with 3 draft tokens, which is more than >2x speed-up over baseline. The design decisions I took to get to this stage are as follows:

Next Steps

Performance

A simple bench for testing various prompts is here: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090. Posting the results below:

Performance on DGX Spark 🧵

No MTP (baseline)

./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=7.1
  qa_factual         pred= 177 draft=   0 acc=   0 rate=n/a tok/s=7.0
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=7.7
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.1
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.2
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1404,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 201.07
}

MTP --spec-draft-max-n 3

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6
  code_cpp           pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7
  explain_concept    pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3
  summarize          pred=  55 draft=  51 acc=  37 rate=0.726 tok/s=17.9
  qa_factual         pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=13.9
  creative_short     pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3
  long_code_review   pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1319,
  "total_draft_accepted": 952,
  "aggregate_accept_rate": 0.7218,
  "wall_s_total": 83.8
}

MTP --spec-draft-max-n 2

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4
  code_cpp           pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5
  explain_concept    pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1
  summarize          pred=  55 draft=  44 acc=  32 rate=0.727 tok/s=15.6
  qa_factual         pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2
  translation        pred=  22 draft=  18 acc=  12 rate=0.667 tok/s=15.2
  creative_short     pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1421,
  "total_draft": 1062,
  "total_draft_accepted": 877,
  "aggregate_accept_rate": 0.8258,
  "wall_s_total": 90.44
}

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4
  code_cpp           pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8
  explain_concept    pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7
  summarize          pred=  57 draft=  63 acc=  39 rate=0.619 tok/s=16.9
  qa_factual         pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7
  translation        pred=  23 draft=  18 acc=  15 rate=0.833 tok/s=18.7
  creative_short     pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4
  stepwise_math      pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3
  long_code_review   pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1424,
  "total_draft": 1497,
  "total_draft_accepted": 1013,
  "aggregate_accept_rate": 0.6767,
  "wall_s_total": 81.39
}

Master with draft model with spec-draft-n-max 64 with no partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2
  code_cpp           pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0
  explain_concept    pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4
  summarize          pred=  55 draft=  48 acc=  36 rate=0.750 tok/s=14.6
  qa_factual         pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9
  translation        pred=  22 draft=  13 acc=  13 rate=1.000 tok/s=16.5
  creative_short     pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8
  stepwise_math      pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0
  long_code_review   pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1137,
  "total_draft_accepted": 897,
  "aggregate_accept_rate": 0.7889,
  "wall_s_total": 97.13
}

How to use

I've uploaded the GGUF which I made by using the convert_hf_to_gguf.py changes in this PR. Here is another GGUF for the MoE (35BA3B) model

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, for debugging and reviewing. Also the convert_hf_to_gguf.py + model definitions. Writing bench for validation against vLLM.

@github-actions github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes server ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 4, 2026

Nice, I think this is a fresh start better than my WIP #18886 (that I still never find the time to continue)

There were some other attempts to add MTP support but they all heavily rely on host <--> device data copy. I assume you tried addressed this, right? (Maybe there was a discussion somewhere but I wasn't aware of)

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a review, but opening some discussions)

Comment thread src/llama-memory-recurrent.h
Comment thread src/models/qwen35.cpp

for (int il = 0; il < n_layer; ++il) {
// MTP/NextN layers are loaded as extra decoder blocks but not executed in the main pass.
const int n_transformer_layers = n_layer - (int)hparams.nextn_predict_layers;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits, but maybe call it n_main_layers, as technically nextn layer is also a transformer layer

Comment on lines +811 to +823
//TODO: generalize if this is ok, we should load <arch_name>_mtp arch?
if (params_base.speculative.type == COMMON_SPECULATIVE_TYPE_MTP) {
SRV_INF("loading MTP head from '%s' (override_arch=qwen35_mtp)\n",
params_base.model.path.c_str());

auto mparams_mtp = common_model_params_to_llama(params_base);
mparams_mtp.override_arch = "qwen35_mtp";

model_mtp.reset(llama_model_load_from_file(params_base.model.path.c_str(), mparams_mtp));
if (model_mtp == nullptr) {
SRV_ERR("failed to load MTP head from '%s'\n", params_base.model.path.c_str());
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at #18886, the better way is to move llama_graph_type to the public API, then load the context with the appropriate graph type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that seems like the correct way to do this if we want to support MTP in a generic way

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@ngxson yes the h2d was discussed with GG, he's working on a refactor which will allow us to share tensors between two llama context

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 4, 2026

Great work, this should massively bridge the TG gap with vLLM, or maybe even surpass it together with tensor-parallel.

am17an added 4 commits May 4, 2026 20:15
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
@cmp-nct
Copy link
Copy Markdown
Contributor

cmp-nct commented May 4, 2026

in my opinion Qwen 3.6 is the most important thing that happened in open source models in a long time, this is going to be so valuable.
I wonder if this, once merged, could be combined with ngram drafting ?
So MTP is used until ngram is triggered - switching to ngram until rejection and back to MTP

ngram could be set to match only very strong and long candidates - for large repetitive paraphrasing
and MTP fills the gap

@Dampfinchen
Copy link
Copy Markdown

Dampfinchen commented May 4, 2026

" idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc." -> Does this mean MTP needs additional resources (RAM/VRAM?)

If so, there should always be an option to remain to disable it. Right now on my system (6 GB VRAM, 32 GB RAM), speculative decoding just makes things much slower even on very small draft models because of that exact reason, they need own context and kv-cache. Such low to midrange systems already operate on the edge in terms of memory.

@mbednarek360
Copy link
Copy Markdown

I'm getting garbage responses running this PR on the Vulkan backend with an R9700 using llama-server. I'm using the GGUF you linked above. Interestingly, draft acceptance is only 0.01282.

Prompt: "Hello!"
Response:

The from,

;::...

... on;srible威风to{ islitor

\ ...

• We
&eq和chn ***, on
Prompt (:
mouth

“ ? forM� P 

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@cmp-nct I'm not sure, but could be possible

@Dampfinchen as of right now it is opt-in via --spec-type mtp, but in terms of memory it should be < 10% of overall memory used (it's just a single layer transformer + kv cache, much lighter than draft models)

@mbednarek360 I've only tested this on a small number of CUDA devices as of now, once it's ready to review I would have tested more devices/backends. In particular this PR relies on #22400 which is not implemented for vulkan for now, if you ask an LLM to add support for that you might get a little further Vulkan and Metal also tested

@nawoa
Copy link
Copy Markdown

nawoa commented May 4, 2026

Might it be possible/useful to run the draft model on a second GPU? Given that MTP weights model are relatively small this might provide a useful speedup on systems with a dedicated high-VRAM "AI" GPU with a cheaper low-VRAM "normal" GPU used for display output, etc... possibly prevent some degree of resource contention.

@cturan
Copy link
Copy Markdown

cturan commented May 4, 2026

Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine;

__
Qwen3.6-27B Q6_K benchmark on llama.cpp b9025-10829dbcc / PR #22673 branch
Hardware: RTX 3090 24GB + RTX 3060 12GB
Runtime flags: -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt
Endpoint: /completion, raw text prompt
Prompt: 6978 tokens
Generation: 256 tokens
Runs: 3 measured runs after warmup

mode model prefill tok/s avg generation tok/s avg MTP acceptance loaded VRAM
MTP enabled Qwen3.6-27B-MTP-Q6_K.gguf + --spec-type mtp --spec-draft-n-max 3 665.14 42.45 76.0% 24.96 GiB
MTP disabled, same GGUF Qwen3.6-27B-MTP-Q6_K.gguf, no spec 1315.46 22.97 n/a 22.47 GiB
Existing non-MTP Q6 Qwen3.6-27B-Q6_K.gguf, no spec 1260.12 22.39 n/a 22.59 GiB

Result:

  • MTP improves decode from 22.97 tok/s to 42.45 tok/s on the same GGUF: ~1.85x speedup.
  • Against the existing non-MTP Q6 file, decode improves from 22.39 tok/s to 42.45 tok/s: ~1.90x speedup.
  • Prefill is slower with MTP enabled in this PR path: 665 tok/s vs 1315 tok/s on the same GGUF (~0.51x).
  • MTP adds about 2.49 GiB loaded VRAM in this setup.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@cturan Thanks for testing, I'm aware of the issue for the prefill and will work on a fix.

@iiLaurens
Copy link
Copy Markdown

Might be a long shot, but any chance of supporting MTP with a reduced vocabulary? MTP layers are rather chonky and reducing token embeddings might help users with less VRAM by filtering out certain languages. Obviously the full model will still be able to produce those tokens if need be so it won't be gimped.

@nybblr
Copy link
Copy Markdown

nybblr commented May 4, 2026

Working on taking this for a spin with the Q4_K_M quant of Qwen3.6-35BA3B. I was gonna try to start from unsloth's quant since they already perform really well, but of course they don't have any mtp layers.

@am17an Think it would work if I just "steal" the layers from your q8 quant and merge them into the unsloth quant? (add blk.40 and bump some top-level config like block_count and kv_count)

@volkermauel
Copy link
Copy Markdown

only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 532 | processing task, is_child = 0
slot update_slots: id  0 | task 532 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 16
slot update_slots: id  0 | task 532 | n_past = 3, slot.prompt.tokens.size() = 1327, seq_id = 0, pos_min = 1326, n_swa = 0
slot update_slots: id  0 | task 532 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 532 | n_tokens = 0, memory_seq_rm [0, end)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.178.49 200
slot update_slots: id  0 | task 532 | prompt processing progress, n_tokens = 12, batch.n_tokens = 12, progress = 0.750000
slot update_slots: id  0 | task 532 | n_tokens = 12, memory_seq_rm [12, end)
slot init_sampler: id  0 | task 532 | init sampler, took 0.01 ms, tokens: text = 16, total = 16
slot update_slots: id  0 | task 532 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 532 |
prompt eval time =������63.16 ms /����16 tokens (����3.95 ms per token,   253.34 tokens per second)
�������eval time =   56063.04 ms /  5913 tokens (����9.48 ms per token,   105.47 tokens per second)
������total time =   56126.20 ms /  5929 tokens
draft acceptance rate = 0.79728 ( 4169 accepted /  5229 generated)
statistics mtp: #calls(b,g,a) = 2 2272 1976, #gen drafts = 2272, #acc drafts = 1976, #gen tokens = 6816, #acc tokens = 4950, dur(b,g,a) = 0.007, 15393.656, 64.921 ms
slot������release: id  0 | task 532 | stop processing: n_tokens = 5928, truncated = 0
srv  update_slots: all slots are idle

same model, same config (except mtp)

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 0 | 
prompt eval time =      91.85 ms /    16 tokens (    5.74 ms per token,   174.20 tokens per second)
       eval time =  103127.94 ms /  6571 tokens (   15.69 ms per token,    63.72 tokens per second)
      total time =  103219.79 ms /  6587 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 6586, truncated = 0
srv  update_slots: all slots are idle

prompt „create a flappy bird clone“

(I‘m not creative, sorry)

Great Speedup!

@Ktos93
Copy link
Copy Markdown

Ktos93 commented May 8, 2026

I have a question, this might have been covered already, the PP performance loss,
Could MTP be turned off for the first request (and then activating it if there's already a valid cache - ie second request onwards)? - with some coding harnesses and similar things that prompt is huge - sometimes over a minute to start getting tokens for the first request, if we can skip it on the initial large prompt that would be nice, then accept the MTP slowdown for the rest of the prompting - since that'll be much smaller.

I had the same idea, and cobbled something together. Needs testing though... Have no access to my rig atm.

John-Dekka@e063730

Tested this on Vulkan with RX 9070XT + RX6600 and still PP ~0.5 no changes at all

@seadra
Copy link
Copy Markdown

seadra commented May 8, 2026

I used convert.py with this model (Q4_KP) for testing.
I'm not seeing any improvements with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 -ngl 99 -ncmoe 99 -c 262144 -t 12 -tb 12 --no-mmap -spec-type mtp --spec-draft-n-max 3 --parallel 1 on RTX 3060 12GB with 128GB RAM: token generation slightly decreased from ~26tok/s to ~24tok/s
I know, large context, aggressive ngl/ncmoe, but this has been performing pretty well for me with this budget setup so far.

probably because that model doesn't have the MTP layers...

Maybe my post wasn't clear, but I did use convert.py to implant the MTP layers from the model given in the top post:

python convert.py Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf Qwen3.6-35BA3B-MTP.gguf Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-MTP-Q4_K_P.gguf

@svilendotorg
Copy link
Copy Markdown

Adding another reproducer for the Windows + CUDA MTP-head crash (same as @DNPMBHC, @manh354, @amirrezasalimi above).

Environment

  • Windows 11, RTX 4090 (CUDA 13.1), MSVC v143 (BuildTools 2022)
  • llama.cpp built from f8c6b03 and 5d5f1b4 (latest PR head) — both crash identically
  • Build flags: -G Ninja -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release

Tested MTP ggufs (all crash at the same step)

  • havenoammo/Qwen3.6-27B-MTP-UD-Q4_K_XL (UD-XL graft)
  • havenoammo/Qwen3.6-27B-MTP-UD-Q3_K_XL (UD-XL graft)
  • llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M (native preserved, 15 heads)
  • froggeric/Qwen3.6-27B-Q4_K_M-mtp (graft built specifically for this PR)

All flags confirmed (--spec-type mtp, --spec-draft-n-max 1 and 3, --parallel 1). Same gguf files load fine via the MMPROJ profile (no --spec-type mtp), so the gguf themselves are sound — the crash is strictly in the MTP-head-loading code path. Linux build at the same commit (f8c6b03) loads havenoammo/...Q4_K_XL MTP without issue; Windows MSVC trips MSVC's checked-iterator on what's likely a benign OOB read in the head loader.

Error (identical for all 4 ggufs)

load_backend: loaded RPC backend ...
load_backend: loaded CPU backend ...
llama_model_load: error loading model: invalid vector subscript
llama_model_load_from_file_impl: failed to load model
srv    load_model: failed to load MTP head from '...gguf'
main: exiting due to model loading error

This appears to be a Windows-CUDA-only issue across the MTP-graft ecosystem (havenoammo, llmfan46, froggeric all affected). Haven't tried the Vulkan backend workaround (@manh354) yet — happy to test if it would help triangulate.

@Wetbikeboy2500
Copy link
Copy Markdown

Adding another reproducer for the Windows + CUDA MTP-head crash (same as @DNPMBHC, @manh354, @amirrezasalimi above).

Environment

  • Windows 11, RTX 4090 (CUDA 13.1), MSVC v143 (BuildTools 2022)
  • llama.cpp built from f8c6b03 and 5d5f1b4 (latest PR head) — both crash identically
  • Build flags: -G Ninja -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release

Tested MTP ggufs (all crash at the same step)

  • havenoammo/Qwen3.6-27B-MTP-UD-Q4_K_XL (UD-XL graft)
  • havenoammo/Qwen3.6-27B-MTP-UD-Q3_K_XL (UD-XL graft)
  • llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M (native preserved, 15 heads)
  • froggeric/Qwen3.6-27B-Q4_K_M-mtp (graft built specifically for this PR)

All flags confirmed (--spec-type mtp, --spec-draft-n-max 1 and 3, --parallel 1). Same gguf files load fine via the MMPROJ profile (no --spec-type mtp), so the gguf themselves are sound — the crash is strictly in the MTP-head-loading code path. Linux build at the same commit (f8c6b03) loads havenoammo/...Q4_K_XL MTP without issue; Windows MSVC trips MSVC's checked-iterator on what's likely a benign OOB read in the head loader.

Error (identical for all 4 ggufs)

load_backend: loaded RPC backend ...
load_backend: loaded CPU backend ...
llama_model_load: error loading model: invalid vector subscript
llama_model_load_from_file_impl: failed to load model
srv    load_model: failed to load MTP head from '...gguf'
main: exiting due to model loading error

This appears to be a Windows-CUDA-only issue across the MTP-graft ecosystem (havenoammo, llmfan46, froggeric all affected). Haven't tried the Vulkan backend workaround (@manh354) yet — happy to test if it would help triangulate.

I ran into this issue and found a similar issue reported for invalid vector subscript. For me, the this error occurs when trying to load regular draft models, eagle PR, dflash PR, and even then this MTP PR. I did find the line that caused the error and did a quick bounds limit on it and has allowed me to run this PR: #22337 (comment)

@manh354
Copy link
Copy Markdown

manh354 commented May 9, 2026

at what quant? @manh354

@amirrezasalimi I tested IQ4_XS, Q3_K_XL, Q2_K_XL, all achieved 18-22 t/s.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 9, 2026

@max-krasnyansky Yes I'm thinking for now a simple solution would be to add the ability to load different MTP GGUFs as well as enable MTP inside the main model's GGUF. That provides the maximum flexibility without a lot of effort.

@ManasInd
Copy link
Copy Markdown

ManasInd commented May 9, 2026

In Mac mini M4 pro, I tested with Q4_K_M MTP gguf, but the token count is going down when using MTP.

Without MTP:
./llama-server --model ./models/Qwen3.6-27B-MTP-Q4_K_M.gguf --host 127.0.0.1
--port 8033
--ctx-size 70000
--gpu-layers all
--parallel 1 --spec-default --temp 0.6 -np 1 --reasoning off

Result: ~12.34 tok/sec

When using MTP :

./llama-server --model ./models/Qwen3.6-27B-MTP-Q4_K_M.gguf --host 127.0.0.1
--port 8033
--ctx-size 70000
--gpu-layers all
--parallel 1 --no-warmup --alias "localtest/qwen-3.6-27b-mtp-q4-k-m-gguf" --spec-type mtp --spec-draft-n-max 3 --temp 0.6 -np 1 --reasoning off

Result: ~8.5 tok/sec

@Scorp1o117
Copy link
Copy Markdown

Hi @am17an, testing MTP with vision/multimodal models reveals a bug: when --spec-type mtp is used together with image input, find_slot enters an infinite loop with "non-consecutive token position" errors, causing OOM.

Reproduced with Qwen3.6-35B-A3B + mmproj on RTX 5070 Ti. Works fine without MTP.

Detailed issue: #22867

Seems like the MTP hook in process_ubatch doesn't handle vision embedding token positions correctly. Great work on MTP overall though! 🎉

@aamsellem
Copy link
Copy Markdown

Tested mtp-clean (5d5f1b4) on RTX 5090 Laptop (sm_120, 24 GB GDDR7) with Qwen 3.6 27B + froggeric/Qwen3.6-27B-Q4_K_M-mtp.gguf and --spec-type mtp --spec-draft-n-max 5 -ctk q4_0 -ctv q4_0 -c 128000 → 65 t/s, 64% draft acceptance.

Going to 262K context OOMs at MTP draft compute buffer alloc on 24 GB cards.

Would adding TurboQuant KV types (turbo3/turbo4 lossless 4.25 bpv, as shipped in TheTom/llama-cpp-turboquant and spiritbuun/buun-llama-cpp) be reasonable as a follow-up to this PR? On RTX 4090 24 GB the combined stack reportedly hits 80 t/s @ 262K with 73% accept. The TurboQuant forks don't have this PR's single-file MTP loader so they fall back to DFlash with separate drafter, which on Qwen 3.6 27B gets ~25-30% accept (z-lab known issue). The single-file MTP path is structurally better for the model.

@chatton2-coles

This comment was marked as off-topic.

@John-Dekka
Copy link
Copy Markdown

John-Dekka commented May 9, 2026

Tested this on Vulkan with RX 9070XT + RX6600 and still PP ~0.5 no changes at all

@Ktos93 Would you mind to slap this one on? John-Dekka@ba501ec

My 'ol RX590 gained some prefill speed back.

## baseline clean build 5d5f1b46e (mtp-clean) --spec-type none

prompt eval time =   92947.36 ms /  9935 tokens (    9.36 ms per token,   106.89 tokens per second)
       eval time =    2937.33 ms /    69 tokens (   42.57 ms per token,    23.49 tokens per second)
      total time =   95884.68 ms / 10004 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 10003, truncated = 0
srv  update_slots: all slots are idle


## baseline clean build 5d5f1b46e (mtp-clean) --spec-type mtp

rompt eval time =  114145.30 ms /  9935 tokens (   11.49 ms per token,    87.04 tokens per second)
       eval time =    1738.36 ms /    49 tokens (   35.48 ms per token,    28.19 tokens per second)
      total time =  115883.67 ms /  9984 tokens
draft acceptance rate = 0.76316 (   29 accepted /    38 generated)


## benchmark clean build ba501ec62 (mtp-no-mtp-during-prefill) --spec-type mtp

prompt eval time =  104475.22 ms /  9935 tokens (   10.52 ms per token,    95.09 tokens per second)
       eval time =    1480.65 ms /    34 tokens (   43.55 ms per token,    22.96 tokens per second)
      total time =  105955.87 ms /  9969 tokens
draft acceptance rate = 0.53125 (   17 accepted /    32 generated)

Initial acceptance is lower here as the mtp starts building up after prefill.

@miloslavnosek
Copy link
Copy Markdown

I don't suppose there is a way to move discussions about this implementation on a different place? It's just that it has become very hard to track the actual progress of this PR, because our inboxes are filled with people's random benchmarks and notes unrelated to the code of this pull request.

Don't get me wrong, it is nice to read that e.g. someone with low vram can now run llm that was not feasible before because of unbearable t/s, but wouldn't github discussions be a better place?

cmeta added a commit to cmeta/llama.cpp that referenced this pull request May 9, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 9, 2026

IMO it's unclear from the PR description but we should define a fix list of goals for the current PR. From what I understand, the main goals of this PR are (1) add the initial infrastructure for MTP and (2) make sure that it's expandable in the future, i.e. support future models and allow optimizations via follow-up PRs.

Speed is important but I don't think it'd the top priority of the current PR. I assume there are backend improvements that can be done in dedicated PRs.

If @am17an agree with this, maybe you can update PR's description? I will also hide off-topic comments to make it a bit easier to keep track.

@IIIIIllllIIIIIlllll
Copy link
Copy Markdown

IIIIIllllIIIIIlllll commented May 9, 2026

On my Strix Halo, Qwen3.5-122B-A10B-Q4_K_M work well, thank you for your great work!

slot print_timing: id  0 | task 1137 | 
prompt eval time =     367.92 ms /    23 tokens (   16.00 ms per token,    62.51 tokens per second)
       eval time =   15104.08 ms /   489 tokens (   30.89 ms per token,    32.38 tokens per second)
      total time =   15472.00 ms /   512 tokens
draft acceptance rate = 0.91040 (  315 accepted /   346 generated)

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 9, 2026

@ngxson I'm waiting for the refactor to get completed (#22787 #22838) before making changes to this PR. After those are done we will have a clear plan forward to merge (based on @ggerganov guidance), till then I'm concentrating on those PRs and making sure I understand what needs to change here. So my stance is that this PR should wait for those to get merged, I can update the PR description to reflect that.

In principle I agree with your goals for this PR, although my initial goal with this PR to have an absolutely correct and performant version of MTP (like vLLM) for Qwen3.6 model, IMO speed has to be kept in mind for how we design MTP as after all it's a speed improvement at the end of the day, but it should not come at the cost of making this too difficult to merge.

@trbom5c

This comment has been minimized.

@vertexodessa
Copy link
Copy Markdown

vertexodessa commented May 9, 2026

I love the PR, it speeds up inference twice in my setup. I'm getting 80+ TPS on NVidia Tesla V100 (with Q4 quants).
just wanted to note that MTP mode crashes on startup if used together with --rerank argument

@StarWingOwl
Copy link
Copy Markdown

StarWingOwl commented May 9, 2026

  • AMD RX 7900 GRE
  • 64GB
  • Arch Linux
  • Vulkan backend
    Tested on both pr-22673 and mtp-clean branches. The main model loads and offloads correctly, but crashes during inference when it's first used, any ideas?
    Command used : ./llama-server -m Qwen3.6-27B-IQ4_XS-mtp.gguf -fa on -ctk q8_0 -ctv q8_0 -ngl 60 --n-cpu-moe 20 --spec-type mtp --parallel 1
    Model used : froggeric/Qwen3.6-27B-IQ4_XS-mtp.gguf

Error after sending the first message :

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 713, batch.n_tokens = 4
slot create_check: id  0 | task 0 | created context checkpoint 2 of 32 (pos_min = 708, pos_max = 708, n_tokens = 709, size = 149.626 MiB)
radv/amdgpu: The CS has been cancelled because the context is lost. This context is innocent.
[New LWP 261187]
[New LWP 261186]
[New LWP 261185]
[New LWP 261184]
[New LWP 261183]
[New LWP 261157]
[New LWP 261156]
[New LWP 261155]
[New LWP 261154]
[New LWP 261153]
[New LWP 261152]
[New LWP 261151]
[New LWP 261150]
[New LWP 261149]
[New LWP 261148]
[New LWP 261147]
[New LWP 261146]
[New LWP 261145]
[New LWP 261143]
[New LWP 261142]
[New LWP 261141]
[New LWP 261140]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
⚠️ warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /usr/lib/libvulkan_nouveau.so.
Use `info auto-load python-scripts [REGEXP]' to list them.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f4eefaa0a52 in ?? () from /usr/lib/libc.so.6
#0  0x00007f4eefaa0a52 in ?? () from /usr/lib/libc.so.6
#1  0x00007f4eefa94abc in ?? () from /usr/lib/libc.so.6
#2  0x00007f4eefa94b04 in ?? () from /usr/lib/libc.so.6
#3  0x00007f4eefb05c6f in wait4 () from /usr/lib/libc.so.6
#4  0x00007f4ef494f37b in ggml_print_backtrace () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-base.so.0
#5  0x00007f4ef4962729 in ggml_uncaught_exception() () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-base.so.0
#6  0x00007f4eefeb673a in ?? () from /usr/lib/libstdc++.so.6
#7  0x00007f4eefe9a5e9 in std::terminate() () from /usr/lib/libstdc++.so.6
#8  0x00007f4eefeb69f6 in __cxa_throw () from /usr/lib/libstdc++.so.6
#9  0x00007f4ef0a9c7cc in ggml_vk_submit(std::shared_ptr<vk_context_struct>&, vk::Fence) [clone .cold] () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-vulkan.so.0
#10 0x00007f4ef0bedd37 in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool) [clone .isra.0] () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-vulkan.so.0
#11 0x00007f4ef0beeac4 in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-vulkan.so.0
#12 0x00007f4ef496c4fb in ggml_backend_sched_graph_compute_async () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-base.so.0
#13 0x00007f4ef46c8260 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/user/llama-mtp/llama.cpp/build/bin/libllama.so.0
#14 0x00007f4ef46d224f in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/user/llama-mtp/llama.cpp/build/bin/libllama.so.0
#15 0x00007f4ef46cfd92 in llama_context::decode(llama_batch const&) () from /home/user/llama-mtp/llama.cpp/build/bin/libllama.so.0
#16 0x00007f4ef46d1d8e in llama_decode () from /home/user/llama-mtp/llama.cpp/build/bin/libllama.so.0
#17 0x00005642e626eabe in server_context_impl::update_slots() ()
#18 0x00005642e63120b1 in server_queue::start_loop(long) ()
#19 0x00005642e61ca63f in main ()
[Inferior 1 (process 261139) detached]
terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Queue::submit: ErrorDeviceLost
[1]    261139 IOT instruction (core dumped)  ./llama-server -m Qwen3.6-27B-IQ4_XS-mtp.gguf -fa on

@John-Dekka
Copy link
Copy Markdown

@StarWingOwl : radv/amdgpu: The CS has been cancelled because the context is lost. This context is innocent.

Try to leave more vram room for mtp. (relax ngl & ncmoe)

@erasmus74
Copy link
Copy Markdown

  • AMD RX 7900 GRE

    • 64GB

    • Arch Linux

    • Vulkan backend
      Tested on both pr-22673 and mtp-clean branches. The main model loads and offloads correctly, but crashes during inference when it's first used, any ideas?
      Command used : ./llama-server -m Qwen3.6-27B-IQ4_XS-mtp.gguf -fa on -ctk q8_0 -ctv q8_0 -ngl 60 --n-cpu-moe 20 --spec-type mtp --parallel 1
      Model used : froggeric/Qwen3.6-27B-IQ4_XS-mtp.gguf

Error after sending the first message :

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 713, batch.n_tokens = 4
slot create_check: id  0 | task 0 | created context checkpoint 2 of 32 (pos_min = 708, pos_max = 708, n_tokens = 709, size = 149.626 MiB)
radv/amdgpu: The CS has been cancelled because the context is lost. This context is innocent.
[New LWP 261187]
[New LWP 261186]
[New LWP 261185]
[New LWP 261184]
[New LWP 261183]
[New LWP 261157]
[New LWP 261156]
[New LWP 261155]
[New LWP 261154]
[New LWP 261153]
[New LWP 261152]
[New LWP 261151]
[New LWP 261150]
[New LWP 261149]
[New LWP 261148]
[New LWP 261147]
[New LWP 261146]
[New LWP 261145]
[New LWP 261143]
[New LWP 261142]
[New LWP 261141]
[New LWP 261140]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
⚠️ warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /usr/lib/libvulkan_nouveau.so.
Use `info auto-load python-scripts [REGEXP]' to list them.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f4eefaa0a52 in ?? () from /usr/lib/libc.so.6
#0  0x00007f4eefaa0a52 in ?? () from /usr/lib/libc.so.6
#1  0x00007f4eefa94abc in ?? () from /usr/lib/libc.so.6
#2  0x00007f4eefa94b04 in ?? () from /usr/lib/libc.so.6
#3  0x00007f4eefb05c6f in wait4 () from /usr/lib/libc.so.6
#4  0x00007f4ef494f37b in ggml_print_backtrace () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-base.so.0
#5  0x00007f4ef4962729 in ggml_uncaught_exception() () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-base.so.0
#6  0x00007f4eefeb673a in ?? () from /usr/lib/libstdc++.so.6
#7  0x00007f4eefe9a5e9 in std::terminate() () from /usr/lib/libstdc++.so.6
#8  0x00007f4eefeb69f6 in __cxa_throw () from /usr/lib/libstdc++.so.6
#9  0x00007f4ef0a9c7cc in ggml_vk_submit(std::shared_ptr<vk_context_struct>&, vk::Fence) [clone .cold] () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-vulkan.so.0
#10 0x00007f4ef0bedd37 in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool) [clone .isra.0] () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-vulkan.so.0
#11 0x00007f4ef0beeac4 in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-vulkan.so.0
#12 0x00007f4ef496c4fb in ggml_backend_sched_graph_compute_async () from /home/user/llama-mtp/llama.cpp/build/bin/libggml-base.so.0
#13 0x00007f4ef46c8260 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/user/llama-mtp/llama.cpp/build/bin/libllama.so.0
#14 0x00007f4ef46d224f in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/user/llama-mtp/llama.cpp/build/bin/libllama.so.0
#15 0x00007f4ef46cfd92 in llama_context::decode(llama_batch const&) () from /home/user/llama-mtp/llama.cpp/build/bin/libllama.so.0
#16 0x00007f4ef46d1d8e in llama_decode () from /home/user/llama-mtp/llama.cpp/build/bin/libllama.so.0
#17 0x00005642e626eabe in server_context_impl::update_slots() ()
#18 0x00005642e63120b1 in server_queue::start_loop(long) ()
#19 0x00005642e61ca63f in main ()
[Inferior 1 (process 261139) detached]
terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Queue::submit: ErrorDeviceLost
[1]    261139 IOT instruction (core dumped)  ./llama-server -m Qwen3.6-27B-IQ4_XS-mtp.gguf -fa on

Same here. Pretty much the same setup. Unsure what the cause is. Have been trying to fix it but I am no expert. Exact same error. Vulkan on 7900xtx

@StarWingOwl
Copy link
Copy Markdown

@StarWingOwl : radv/amdgpu: The CS has been cancelled because the context is lost. This context is innocent.

Try to leave more vram room for mtp. (relax ngl & ncmoe)

@John-Dekka It worked when I removed n-cpu-moe, and halved the -ngl. Got about 3 t/s, with the non-mtp model before, I used to get about 9 t/s with Q4_XL, now IQ4_XS-mtp seems to be struggling to even load, is the performace supposed to be that much worse? what could cause this?

@John-Dekka
Copy link
Copy Markdown

@StarWingOwl : I am puzzled a bit. My historic RX590 runs at 30tps. Had some trouble with dual GPU setup though. Try to run on your main GPU only just for testing.

@StarWingOwl
Copy link
Copy Markdown

@John-Dekka The peaks are 3 tps, If I let it run, it gets even worse (with -ngl 60 and the -dev specified, but no ncmoe). I was curious too on why this was happening, which is why I came here, hoping to resolve this. Could you share your llama-server command? and are you also running a vulkan backend or rocm? (sorry to anyone checking in on the progress, if the thread is getting too cluttered.)

@aamsellem
Copy link
Copy Markdown

Quick follow-up to my morning comment. Switched the target from froggeric/Qwen3.6-27B-Q4_K_M-mtp.gguf to havenoammo's UD-Q3_K_XL on the same branch (mtp-clean, 5d5f1b4) — same flags, same hardware (RTX 5090 Laptop, sm_120, 24 GB).

--spec-type mtp --spec-draft-n-max 5 -ctk q4_0 -ctv q4_0 -c 262144:

Run Tokens Elapsed t/s
1 2000 25.1s 79.6
2 2000 26.7s 74.9
3 2000 26.1s 76.6

Avg 77 t/s, draft acceptance 74–80% (vs 64% at 128K with the Q4_K_M target). Full 262K native context fits — the Q3_K_XL is 1.9 GB lighter on disk and that's exactly the headroom MTP draft compute buffers need on 24 GB.

What's interesting is the acceptance went UP at longer context, not down. The Unsloth Dynamic quant preserves the MTP head + attention at 6/8-bit while only dense FFN goes to 3-bit, so drafter alignment stays tight even with the smaller target.

Datapoint for the consumer Blackwell 24 GB matrix — dense 27B path on this branch is solid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes server testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.