spec : refactor ctx by ggerganov · Pull Request #22787 · ggml-org/llama.cpp

ggerganov · 2026-05-07T06:55:03Z

Overview

Refactor the speculative code to use a single unified "draft" llama_context for all sequences. The draft context is synchronized with the main context. This has several advantages:

We have a single draft context instead of one per each slot (less memory)
The draft context evaluates in parallel with the main context. This will enable passing the target embeddings per batch when necessary (next PR)
The speculative logic in common/speculative.cpp is significantly simplified
The draft models now "see" the actual images (e.g. OCR tasks are much faster)
We can now support parallel drafting for multiple sequences (likely in a follow-up PR)

Backwards incompatible changes:

No longer support incompatible draft/target vocabs via replacements

Test commands:

llama-server \
  -hf             ggml-org/Qwen3-30B-A3B-GGUF:Q8_0 \
  --spec-draft-hf ggml-org/Qwen3-0.6B-GGUF:Q8_0 \
  --spec-default --spec-draft-n-max 16

llama-server \
  -hf             ggml-org/gemma-4-26B-A4B-it-GGUF:Q8_0 \
  --spec-draft-hf ggml-org/gemma-4-E2B-it-GGUF \
  --spec-default --spec-draft-n-max 16

llama-server \
  -hf             ggml-org/Qwen3.6-27B-GGUF:Q8_0 \
  --spec-draft-hf ggml-org/Qwen3.5-0.8B-GGUF:Q8_0 \
  --spec-default --spec-draft-n-max 16

TODO

Clean-up and DRY
Update speculative-simple.cpp

[no ci]

am17an · 2026-05-08T05:29:09Z

Currently the draft model actually doesn't see the vision tokens and can never attend to them, so it's just a silent degradation which lowers the acceptance rate in master right? Not a correctness issue though.

ggerganov · 2026-05-08T05:47:12Z

Currently the draft model actually doesn't see the vision tokens and can never attend to them, so it's just a silent degradation which lowers the acceptance rate in master right? Not a correctness issue though.

Yes, on master the draft model processes the "text-only" part of the prompts. So it does not "see" the images. But the processing is technically correct.

With this PR, we now feed the image tokens to the draft context too, so the acceptance increases for these use cases - as long as the draft model is trained to understand the image embeddings. Note that we make an important assumption:

The target and draft models share the same vision encoder (i.e. mctx)

For draft-model based speculative decoding, this is rarely correct. For example Qwen3.5 0.8B does not even have a vision capabilities. And Gemma4 E2B has a different mmproj from the large Gemma4 models. So the improvement is a bit difficult to observe. The simplest way to test it is to run the same target model as draft model (e.g. -hf ggml-org/Qwen3.6-27B-GGUF:Q8_0 --spec-draft-hf ggml-org/Qwen3.6-27B-GGUF:Q8_0) and see that the acceptance rate from OCR is basically 100%.

This can be improved by having a separate mctx for the draft, but I don't think it's worth the extra logic. For me, draft-model based speculative decoding is mostly useful for prototyping and verifying the implementation. The proper speculative decoding methods like MTP, Eagle, etc. don't have this problem because their drafting components are trained on the same vision tower as the target.

[no ci]

pwilkin · 2026-05-08T06:45:14Z

Another reason while it's not probably worth adding the extra logic is that usually the image tokens constitute a very small part of model input, unless someone specifically processes images only (like OCR), but then they're probably better off using a dedicated OCR model anyway.

ggerganov · 2026-05-08T08:21:33Z

I think this PR is ready to merge. I'll start working on one more refactor of the speculative code to enable the parallel drafting - it will be nice to have this prepared before we proceed with introducing the speculative decoding methods.

am17an · 2026-05-08T08:57:52Z

A few other points:

~~do we need to retire --spec-draft-ctx-size?~~ looks like it was already removed
the new draft model capability of seeing the vision tokens is not exercised in server tests. Not sure if it's important to cover in this PR, just want to point it out.

ggerganov · 2026-05-08T09:41:02Z

the new draft model capability of seeing the vision tokens is not exercised in server tests. Not sure if it's important to cover in this PR, just want to point it out.

Yeah, we need to improve spec-decoding tests at some point.

ggerganov · 2026-05-08T09:45:21Z

+            // TODO: avoid restoring the draft context and re-evaluating the drafted tokens when not needed [TAG_SPEC_AVOID_DRAFT_REEVAL]
+            //       for now, always re-evaluate for simplicity
+            //       ref: https://github.com/ggml-org/llama.cpp/pull/22728#issuecomment-4400925384
+            //
+            //       | spec type   | need re-eval |
+            //       | ---         | ---          |
+            //       | draft model | no           | because the draft model does not use embeddings from the target
+            //       | MTP (std)   | yes          |
+            //       | MTP Gemma4  | no           | because the KV cache is shared
+            //       | Eagle3      | yes          |
+            //       | DFlash      | yes?         |
+            //
+            if (ctx_dft) {
+                // TODO: update as needed for MTP, Eagle3, etc.
+                const bool need_tgt_embd = false;
+
+                if (need_tgt_embd) {
+                    llama_synchronize(ctx_tgt);
+                }
+
+                // the logic here varies depending on the speculative decoding method
+                //  - some draft contexts require embeddings from the target context, others don't
+                //  - some draft contexts involve an encoder step to transform the target embeddings to draft embeddings
+                // TODO: extract this in a function ?
+                {
+                    // TODO: hook the embeddings from the last target batch here
+                    if (llama_model_has_encoder(model_dft.get())) {
+                        //llama_encode(ctx_dft, ...);
+
+                        GGML_ABORT("not implemented yet\n");
+                    }
+
+                    const int ret = llama_decode(ctx_dft.get(), batch_view);
+
+                    if (ret != 0) {
+                        SRV_ERR("failed to decode draft batch, ret = %d\n", ret);
+
+                        // TODO: handle error
+                        break;
+                    }
+                }
+            }
+


@am17an @ruixiang63 This will be the most important point - here we will have to hook the target embeddings. Either by getting/setting the tensors from the target context to the draft context, or by transferring through host memory. Still not clear to me at this point.

Let me know if you see and potential problems at this stage.

This will eventually become a function of the common_speculatice API, for example:

bool common_speculative_process( common_speculative * spec, llama_context * ctx_tgt, llama_batch batch, ...);

So we will be able to specialize the logic per spec decoding method in common/speculative.cpp. But for now, keep it here until we get something running.

// | DFlash | yes? |
Yes, it does. The KV cache of the DFlash draft model also depends on the compressed target-model features for the verified tokens. See: #22728 (comment)

Either by getting/setting the tensors from the target context to the draft context, or by transferring through host memory. Still not clear to me at this point.

I would propose transferring through host memory first. Although this is not fully optimal, it is simple and does not require additional GPU buffers, so we can save GPU VRAM.

Regarding performance, speculative decoding already achieves a significant speedup, and transferring embeddings through host memory is not the bottleneck. Even if we avoided the host-memory round trip, the benefit would likely be negligible. For example, the current Eagle3 and DFlash PRs both transfer via host memory and still show significant speedups.

Especailly for multi-sequences use case, keeping the embeddings directly on the GPU could consume a substantial amount of VRAM.

We need to make sure we transfer ubatch embeddings rather than the full batch embeddings, and consume each ubatch immediately once it arrives in the draft context. This way, we can avoid OOM issues.

I think for PP it matters a lot

Right. But with PP, layers are split across different GPUs. For example, suppose Eagle3 needs features from layer 1 and layer 2, where layer 1 is on GPU0 and layer 2 is on GPU1. I assume the draft model will also be placed on each GPU, since it is small and does not make much sense to split it across GPUs.

In this case, how do we obtain the required features? I guess we would also need to go through host memory, right?

Yes host memory is fine for now, we can look at optimising it later

I feel option 2 is a more elegant way to handle this. We can also keep the host-copy or device-copy logic behind llama_set_input_embeddings.

llama_decode(ctx_tgt); ggml_tensor * embd_src = llama_get_pre_norm_embeddings(ctx_tgt); llama_set_input_embeddings(ctx_dft, embd_src); llama_decode(ctx_dft);

BTW @ggerganov there is a subtlety which is that we need to store previous batch's last token embeddings and pair them up with current batch shifted by 1, this is because MTP is conditioned via $$h_{t-1}, x_{t}$$, not sure if this exists in other speculative methods as well but I think it does. It makes quite a large difference to acceptance at larger context so it is important to have this I think

Hm, yes it's the same. What embeddings do we use for the very first token x_0?

I treat just as 0 for MTP, I think that's a reasonable choice

ruixiang63 · 2026-05-08T17:40:27Z

A question came to mind about the draft enc-dec context: how do we handle the draft model encoder → decoder embedding transfer in the multi-slot case?
Currently the only mechanism I see in the existing API is to use llama_cross, but v_embd is a single host buffer and isn't multi-slot safe - concurrent llama_encode calls across slots would overwrite each other. And llama_cross is supported for cross_attn not for side input embedding for decoder (i.e. encoder's output as side input for decoder).

Should we refactor llama_cross to be per-seq and add side input embedding support, introduce a separate dedicated struct for the draft enc→dec handoff, or implement it independently per method (EAGLE3, DFlash, etc.)? I feel we need to have a unified struct to handle this.

Maybe there’s already a better solution that I’m not aware of? @ggerganov
Maybe we can handle this similarly to extracted target embeddings --> draft context transfer?

Besides this, llama_encode's call scheduling under multi-seq batched draft is a related open question? Not sure how this handle multi-seq in draft context.

ggerganov · 2026-05-08T18:14:27Z

Currently the only mechanism I see in the existing API is to use llama_cross, but v_embd is a single host buffer and isn't multi-slot safe - concurrent llama_encode calls across slots would overwrite each other. And llama_cross is supported for cross_attn not for side input embedding for decoder (i.e. encoder's output as side input for decoder).

With parallel decoding, we don't do actual parallel calls to llama_encode or llama_decode. Instead, we construct a llama_batch that contains the tokens and embeddings of the parallel sequences. So there is no issue with concurrency. The v_embd buffer can without problems contain data from all sequences, as defined by the token order in the batch.

ruixiang63 · 2026-05-08T18:28:38Z


        if (!vocab_cmpt) {
-            LOG_WRN("the target and draft vocabs are not compatible - tokens will be translated between the two\n");
+            LOG_ERR("%s: the target and draft vocabs are not compatible\n", __func__);


This seems to break EAGLE3, since EAGLE3 uses a reduced draft vocab that is much smaller than the target vocab size. https://github.com/ruixiang63/llama.cpp/blob/4567954ab06b792253582872673326df31e2ac7d/src/llama-arch.h#L564
We could add a branch in common_speculative_are_compatible, e.g.:

if (has_d2t) { // skip the size-diff and per-id token-text checks for d2t-equipped drafts } else { // existing checks }

Yes, we can adjust if needed. The speculative implementations could use their custom implementations of common_speculative_are_compatible().

github-actions Bot added examples server labels May 7, 2026

ggerganov mentioned this pull request May 7, 2026

llama : extend embeddings API #22728

Draft

ggerganov force-pushed the gg/spec-refactor-ctx branch 2 times, most recently from 1d5b0fe to d719d8a Compare May 7, 2026 13:48

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 7, 2026

ggerganov marked this pull request as ready for review May 7, 2026 18:42

ggerganov requested review from a team as code owners May 7, 2026 18:42

ggerganov requested a review from am17an May 7, 2026 18:42

ggerganov added 14 commits May 7, 2026 21:44

spec : refactor

2c9a408

[no ci]

spec : drop support for incompatible vocabs

befc7ef

[no ci]

spec : update common_speculative_init()

4550f0f

[no ci]

cont : pass seq_id

77269ad

[no ci]

cont : dedup ctx_seq_rm_type

8a50f6f

[no ci]

server : sketch the ctx_dft decode loop

c97dc36

[no ci]

server : draft prompt cache and checkpoints

11fd5e7

[no ci]

server : improve ctx names

1afee5b

[no ci]

server, spec : transition to unified spec context

de35b12

cont : sync main and drft contexts

08c8012

cont : async drft eval when possible

c7facb0

cont : handle non-ckpt models

0239f4c

cont : pass correct n_past for drafting

ae6703f

cont : process images throught the draft context

7e118cd

ggerganov force-pushed the gg/spec-refactor-ctx branch from 21e83ad to 7e118cd Compare May 7, 2026 18:44

am17an reviewed May 8, 2026

View reviewed changes

Comment thread tools/server/server-context.cpp Outdated

ggerganov added 2 commits May 8, 2026 07:11

spec : handle draft running out of context

8be14e4

server : fix mtmd draft processing

6a4b05a

server : fix URL for draft model

12c7cfb

github-actions Bot added the python python script changes label May 8, 2026

server : add comment

233d1ae

[no ci]

ggerganov added 2 commits May 8, 2026 10:20

server : clean-up + dry

3b1a8df

speculative-simple : update

e5b1401

am17an reviewed May 8, 2026

View reviewed changes

Comment thread common/speculative.cpp

am17an reviewed May 8, 2026

View reviewed changes

Comment thread tools/server/server-context.cpp Outdated

ggerganov added 3 commits May 8, 2026 11:54

spec : fix n_past type

161eae0

server : fix slot ctx_drft ptr

1dbc054

tools : update readme

778f9e2

ggerganov requested a review from ngxson as a code owner May 8, 2026 08:55

am17an reviewed May 8, 2026

View reviewed changes

Comment thread examples/speculative-simple/speculative-simple.cpp

naming : improve consistency

efa2f8e

ggerganov commented May 8, 2026

View reviewed changes

ggerganov mentioned this pull request May 8, 2026

spec : parallel drafting support #22838

Open

1 task

ruixiang63 reviewed May 8, 2026

View reviewed changes

am17an mentioned this pull request May 9, 2026

llama + spec: MTP Support #22673

Draft

1 task

Conversation

ggerganov commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

TODO

Uh oh!

Uh oh!

am17an commented May 8, 2026

Uh oh!

ggerganov commented May 8, 2026

Uh oh!

pwilkin commented May 8, 2026

Uh oh!

ggerganov commented May 8, 2026

Uh oh!

Uh oh!

Uh oh!

am17an commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ggerganov commented May 8, 2026

Uh oh!

ggerganov May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruixiang63 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented May 7, 2026 •

edited

Loading

am17an commented May 8, 2026 •

edited

Loading

ggerganov May 8, 2026 •

edited

Loading

am17an May 9, 2026 •

edited

Loading

ruixiang63 commented May 8, 2026 •

edited

Loading