Skip to content

spec : refactor ctx#22787

Open
ggerganov wants to merge 24 commits intomasterfrom
gg/spec-refactor-ctx
Open

spec : refactor ctx#22787
ggerganov wants to merge 24 commits intomasterfrom
gg/spec-refactor-ctx

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented May 7, 2026

Overview

Refactor the speculative code to use a single unified "draft" llama_context for all sequences. The draft context is synchronized with the main context. This has several advantages:

  • We have a single draft context instead of one per each slot (less memory)
  • The draft context evaluates in parallel with the main context. This will enable passing the target embeddings per batch when necessary (next PR)
  • The speculative logic in common/speculative.cpp is significantly simplified
  • The draft models now "see" the actual images (e.g. OCR tasks are much faster)
  • We can now support parallel drafting for multiple sequences (likely in a follow-up PR)

Backwards incompatible changes:

  • No longer support incompatible draft/target vocabs via replacements

Test commands:

llama-server \
  -hf             ggml-org/Qwen3-30B-A3B-GGUF:Q8_0 \
  --spec-draft-hf ggml-org/Qwen3-0.6B-GGUF:Q8_0 \
  --spec-default --spec-draft-n-max 16

llama-server \
  -hf             ggml-org/gemma-4-26B-A4B-it-GGUF:Q8_0 \
  --spec-draft-hf ggml-org/gemma-4-E2B-it-GGUF \
  --spec-default --spec-draft-n-max 16

llama-server \
  -hf             ggml-org/Qwen3.6-27B-GGUF:Q8_0 \
  --spec-draft-hf ggml-org/Qwen3.5-0.8B-GGUF:Q8_0 \
  --spec-default --spec-draft-n-max 16

TODO

  • Clean-up and DRY
  • Update speculative-simple.cpp

@ggerganov ggerganov force-pushed the gg/spec-refactor-ctx branch 2 times, most recently from 1d5b0fe to d719d8a Compare May 7, 2026 13:48
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 7, 2026
@ggerganov ggerganov marked this pull request as ready for review May 7, 2026 18:42
@ggerganov ggerganov requested review from a team as code owners May 7, 2026 18:42
@ggerganov ggerganov requested a review from am17an May 7, 2026 18:42
@ggerganov ggerganov force-pushed the gg/spec-refactor-ctx branch from 21e83ad to 7e118cd Compare May 7, 2026 18:44
Comment thread tools/server/server-context.cpp Outdated
@github-actions github-actions Bot added the python python script changes label May 8, 2026
@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 8, 2026

Currently the draft model actually doesn't see the vision tokens and can never attend to them, so it's just a silent degradation which lowers the acceptance rate in master right? Not a correctness issue though.

@ggerganov
Copy link
Copy Markdown
Member Author

Currently the draft model actually doesn't see the vision tokens and can never attend to them, so it's just a silent degradation which lowers the acceptance rate in master right? Not a correctness issue though.

Yes, on master the draft model processes the "text-only" part of the prompts. So it does not "see" the images. But the processing is technically correct.

With this PR, we now feed the image tokens to the draft context too, so the acceptance increases for these use cases - as long as the draft model is trained to understand the image embeddings. Note that we make an important assumption:

  • The target and draft models share the same vision encoder (i.e. mctx)

For draft-model based speculative decoding, this is rarely correct. For example Qwen3.5 0.8B does not even have a vision capabilities. And Gemma4 E2B has a different mmproj from the large Gemma4 models. So the improvement is a bit difficult to observe. The simplest way to test it is to run the same target model as draft model (e.g. -hf ggml-org/Qwen3.6-27B-GGUF:Q8_0 --spec-draft-hf ggml-org/Qwen3.6-27B-GGUF:Q8_0) and see that the acceptance rate from OCR is basically 100%.

This can be improved by having a separate mctx for the draft, but I don't think it's worth the extra logic. For me, draft-model based speculative decoding is mostly useful for prototyping and verifying the implementation. The proper speculative decoding methods like MTP, Eagle, etc. don't have this problem because their drafting components are trained on the same vision tower as the target.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 8, 2026

Another reason while it's not probably worth adding the extra logic is that usually the image tokens constitute a very small part of model input, unless someone specifically processes images only (like OCR), but then they're probably better off using a dedicated OCR model anyway.

@ggerganov
Copy link
Copy Markdown
Member Author

I think this PR is ready to merge. I'll start working on one more refactor of the speculative code to enable the parallel drafting - it will be nice to have this prepared before we proceed with introducing the speculative decoding methods.

Comment thread common/speculative.cpp
Comment thread tools/server/server-context.cpp Outdated
@ggerganov ggerganov requested a review from ngxson as a code owner May 8, 2026 08:55
@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 8, 2026

A few other points:

  • do we need to retire --spec-draft-ctx-size? looks like it was already removed
  • the new draft model capability of seeing the vision tokens is not exercised in server tests. Not sure if it's important to cover in this PR, just want to point it out.

Comment thread examples/speculative-simple/speculative-simple.cpp
@ggerganov
Copy link
Copy Markdown
Member Author

the new draft model capability of seeing the vision tokens is not exercised in server tests. Not sure if it's important to cover in this PR, just want to point it out.

Yeah, we need to improve spec-decoding tests at some point.

Comment on lines +2908 to +2950
// TODO: avoid restoring the draft context and re-evaluating the drafted tokens when not needed [TAG_SPEC_AVOID_DRAFT_REEVAL]
// for now, always re-evaluate for simplicity
// ref: https://github.com/ggml-org/llama.cpp/pull/22728#issuecomment-4400925384
//
// | spec type | need re-eval |
// | --- | --- |
// | draft model | no | because the draft model does not use embeddings from the target
// | MTP (std) | yes |
// | MTP Gemma4 | no | because the KV cache is shared
// | Eagle3 | yes |
// | DFlash | yes? |
//
if (ctx_dft) {
// TODO: update as needed for MTP, Eagle3, etc.
const bool need_tgt_embd = false;

if (need_tgt_embd) {
llama_synchronize(ctx_tgt);
}

// the logic here varies depending on the speculative decoding method
// - some draft contexts require embeddings from the target context, others don't
// - some draft contexts involve an encoder step to transform the target embeddings to draft embeddings
// TODO: extract this in a function ?
{
// TODO: hook the embeddings from the last target batch here
if (llama_model_has_encoder(model_dft.get())) {
//llama_encode(ctx_dft, ...);

GGML_ABORT("not implemented yet\n");
}

const int ret = llama_decode(ctx_dft.get(), batch_view);

if (ret != 0) {
SRV_ERR("failed to decode draft batch, ret = %d\n", ret);

// TODO: handle error
break;
}
}
}

Copy link
Copy Markdown
Member Author

@ggerganov ggerganov May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@am17an @ruixiang63 This will be the most important point - here we will have to hook the target embeddings. Either by getting/setting the tensors from the target context to the draft context, or by transferring through host memory. Still not clear to me at this point.

Let me know if you see and potential problems at this stage.

This will eventually become a function of the common_speculatice API, for example:

bool common_speculative_process(
    common_speculative * spec,
    llama_context * ctx_tgt,
    llama_batch batch,
    ...);

So we will be able to specialize the logic per spec decoding method in common/speculative.cpp. But for now, keep it here until we get something running.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// | DFlash | yes? |
Yes, it does. The KV cache of the DFlash draft model also depends on the compressed target-model features for the verified tokens. See: #22728 (comment)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either by getting/setting the tensors from the target context to the draft context, or by transferring through host memory. Still not clear to me at this point.

I would propose transferring through host memory first. Although this is not fully optimal, it is simple and does not require additional GPU buffers, so we can save GPU VRAM.

Regarding performance, speculative decoding already achieves a significant speedup, and transferring embeddings through host memory is not the bottleneck. Even if we avoided the host-memory round trip, the benefit would likely be negligible. For example, the current Eagle3 and DFlash PRs both transfer via host memory and still show significant speedups.

Especailly for multi-sequences use case, keeping the embeddings directly on the GPU could consume a substantial amount of VRAM.

We need to make sure we transfer ubatch embeddings rather than the full batch embeddings, and consume each ubatch immediately once it arrives in the draft context. This way, we can avoid OOM issues.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for PP it matters a lot

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But with PP, layers are split across different GPUs. For example, suppose Eagle3 needs features from layer 1 and layer 2, where layer 1 is on GPU0 and layer 2 is on GPU1. I assume the draft model will also be placed on each GPU, since it is small and does not make much sense to split it across GPUs.

In this case, how do we obtain the required features? I guess we would also need to go through host memory, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes host memory is fine for now, we can look at optimising it later

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel option 2 is a more elegant way to handle this. We can also keep the host-copy or device-copy logic behind llama_set_input_embeddings.

llama_decode(ctx_tgt);

ggml_tensor * embd_src = llama_get_pre_norm_embeddings(ctx_tgt);
llama_set_input_embeddings(ctx_dft, embd_src);

llama_decode(ctx_dft);

Copy link
Copy Markdown
Contributor

@am17an am17an May 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW @ggerganov there is a subtlety which is that we need to store previous batch's last token embeddings and pair them up with current batch shifted by 1, this is because MTP is conditioned via $$h_{t-1}, x_{t}$$, not sure if this exists in other speculative methods as well but I think it does. It makes quite a large difference to acceptance at larger context so it is important to have this I think

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yes it's the same. What embeddings do we use for the very first token x_0?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I treat just as 0 for MTP, I think that's a reasonable choice

@ruixiang63
Copy link
Copy Markdown

ruixiang63 commented May 8, 2026

A question came to mind about the draft enc-dec context: how do we handle the draft model encoder → decoder embedding transfer in the multi-slot case?
Currently the only mechanism I see in the existing API is to use llama_cross, but v_embd is a single host buffer and isn't multi-slot safe - concurrent llama_encode calls across slots would overwrite each other. And llama_cross is supported for cross_attn not for side input embedding for decoder (i.e. encoder's output as side input for decoder).

Should we refactor llama_cross to be per-seq and add side input embedding support, introduce a separate dedicated struct for the draft enc→dec handoff, or implement it independently per method (EAGLE3, DFlash, etc.)? I feel we need to have a unified struct to handle this.

Maybe there’s already a better solution that I’m not aware of? @ggerganov
Maybe we can handle this similarly to extracted target embeddings --> draft context transfer?

Besides this, llama_encode's call scheduling under multi-seq batched draft is a related open question? Not sure how this handle multi-seq in draft context.

@ggerganov
Copy link
Copy Markdown
Member Author

Currently the only mechanism I see in the existing API is to use llama_cross, but v_embd is a single host buffer and isn't multi-slot safe - concurrent llama_encode calls across slots would overwrite each other. And llama_cross is supported for cross_attn not for side input embedding for decoder (i.e. encoder's output as side input for decoder).

With parallel decoding, we don't do actual parallel calls to llama_encode or llama_decode. Instead, we construct a llama_batch that contains the tokens and embeddings of the parallel sequences. So there is no issue with concurrency. The v_embd buffer can without problems contain data from all sequences, as defined by the token order in the batch.

Comment thread common/speculative.cpp

if (!vocab_cmpt) {
LOG_WRN("the target and draft vocabs are not compatible - tokens will be translated between the two\n");
LOG_ERR("%s: the target and draft vocabs are not compatible\n", __func__);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to break EAGLE3, since EAGLE3 uses a reduced draft vocab that is much smaller than the target vocab size. https://github.com/ruixiang63/llama.cpp/blob/4567954ab06b792253582872673326df31e2ac7d/src/llama-arch.h#L564
We could add a branch in common_speculative_are_compatible, e.g.:

if (has_d2t) {                                                                                                                          
     // skip the size-diff and per-id token-text checks for d2t-equipped drafts
 } else {                                                                                                                                
     // existing checks
 }

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can adjust if needed. The speculative implementations could use their custom implementations of common_speculative_are_compatible().

@am17an am17an mentioned this pull request May 9, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants