Skip to content

spec : parallel drafting support#22838

Merged
ggerganov merged 36 commits into
masterfrom
gg/spec-refactor-parallel
May 11, 2026
Merged

spec : parallel drafting support#22838
ggerganov merged 36 commits into
masterfrom
gg/spec-refactor-parallel

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented May 8, 2026

Overview

cont #22787 (rebased on top of it)

With these changes:

  • The draft context can generate speculative drafts for multiple sequences in parallel
  • The draft "sees" the multimodal data
  • Can chain multiple speculative types, f.ex --spec-type ngram-mod,mtp

server_context

  • Single common_speculative context for all slots, capable of handling multiple sequence ids
  • Extract the drafting logic from server_slot::update_batch() and parallelize it across the active slots
  • Disable adjusting of speculative decoding params (such as n_min, n_max, p_min) from the HTTP API

common/speculative

  • Rework the API to support drafting for multiple sequences at once
  • Add struct common_speculative_draft_params
  • Update draft-model based implementation to decode multiple draft sequences in parallel
  • Add common_speculative_process(spec, batch) for feeding the prompt tokens through the speculative context
  • Remove common_speculative_n_max
  • Remove common_speculative_n_min

llama_context

  • Avoid extra backend buffer allocations during llama_state_seq_get_data_ext() with the LLAMA_STATE_SEQ_FLAGS_ON_DEVICE flag (this can be merged directly to master)

common/args

TODOs

  • Clean-up

Requirements

@ggerganov ggerganov changed the title spec : refactor for multi-sequence speculative context spec : parallel drafting support May 8, 2026
@am17an am17an mentioned this pull request May 9, 2026
2 tasks
@ggerganov
Copy link
Copy Markdown
Member Author

ggerganov commented May 9, 2026

@am17an @ruixiang63 This refactor is ready. I also went ahead and introduced the common_speculative_process() API. We can now write specialized logic for the different speculative decoding methods in common/speculative.cpp by implementing the process() method of each different implementations. The most basic one is the draft-model based one - it simply processes the same batch as the target context, without any extra data from it:

bool process(const llama_batch & batch) override {
auto * ctx_dft = params.ctx_dft;
const int ret = llama_decode(ctx_dft, batch);
if (ret != 0) {
LOG_ERR("%s: failed to decode draft batch, ret = %d\n", __func__, ret);
return false;
}
return true;
}

I'll start looking next into hooking the target embeddings. I'll probably start from the MTP PR #22673 as it seems to be the easiest speculative approach. The Eagle3 has some extra complexities:

  • More target features to extract
  • Encoder step
  • Different vocabs

So I think after we fit the MTP implementation into the new speculative architecture, it will be easier to build the Eagle3 from there.

I'm already having some ideas how to efficiently move the target embeddings between the contexts. Inside the process() method, we can analyze the contents of the batch and decide which embeddings to store for later (i.e. the embeddings for the last token of each sequence in the batch) and which to copy directly to the draft context for processing with the current batch. Anyway, I think I am starting to wrap my head around it, but it still not 100% clear.

If you are prototyping stuff, it would be great to base it on top of this branch. There are comments in the code to clarify the logic, but if something is unclear - let me know.

@ggerganov
Copy link
Copy Markdown
Member Author

I'm already having some ideas how to efficiently move the target embeddings between the contexts.

For now, we should definitely focus on initial implementation that uses host roundtrips. The device-to-device version is possible, but it requires first a refactoring that introduces the concept of "on-device input/output batch embeddings" in the llama_context. I now have quite a good idea how to do this - will try to spec it soon. But IMO we should do this in next iterations as it's a relatively big change.

For the initial implementation, my current suggestion is to mirror the existing llama_get_embeddings_* API in order to at least utilize the host pinned memory. This should perform better than the current direct calls to ggml_backend_tensor_get() used in #22673. Basically, add respective "pre-norm output embedding" calls in llama-ext for the following functions:

// llama-ext.h
// (not in llama.h since this is a stop-gap solution to get something mergeable)

    // mirrors:
    // LLAMA_API void llama_set_embeddings(struct llama_context * ctx, bool embeddings);
    LLAMA_API void llama_set_embeddings_pre_norm(struct llama_context * ctx, bool value);

    // mirrors:
    // LLAMA_API float * llama_get_embeddings(struct llama_context * ctx);
    LLAMA_API float * llama_get_embeddings_pre_norm(struct llama_context * ctx);

    // mirrors:
    // LLAMA_API float * llama_get_embeddings_ith(struct llama_context * ctx, int32_t i);
    LLAMA_API float * llama_get_embeddings_pre_norm_ith(struct llama_context * ctx, int32_t i);

The llama_context should reserve extra pinned-host buffer space for the pre-norm embeddings (see llama_context::output_reserve()), async extract after every ubatch (see llama_context::decode(), .., i.e. the entire logic mirrors the existing "output embeddings" logic.

With this API, utlize it to extract both the hidden target embeddings and the MTP-generated embeddings. There is not need to distinguish explicitly between the 2 types of embeddings. I.e. this code:

    // MTP related inputs/outputs
    ggml_tensor * t_h_pre_norm  = nullptr; // [n_embd, n_outputs] hidden state required for MTP
    ggml_tensor * t_mtp_out     = nullptr; // [n_embd, n_tokens]

Should be just:

    ggml_tensor * t_embd_pre_norm  = nullptr; // [n_embd, n_outputs] embeddings before the output norm

This new llama-ext API should be enough for everything to get the MTP working. There is no need to access the hidden state tensors from the user code, so there should be no changes to llama.h at all.

The rest of the changes from #22673 should be moved into a new common/speculative MTP implementation in the respective process()/draft() methods.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 10, 2026

@ggerganov okay let me work on this, will send you my working branch once I have something.

@ggerganov
Copy link
Copy Markdown
Member Author

Ok great, I think the only a bit complicated thing would be to add the logic for analyzing the contents of the batch in process() and draft() to determine which sequences participate and to map the embeddings offsets correctly. Basically, the "pending" logic that you already have, should be extended to work per sequence id.

An initial simple version could be to assert that there is single-sequence id inside the batch and when this is working, we can extend it to multi-sequence.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 11, 2026

@ggerganov can you review am17an#6? It seems to work with these changes. cc @ngxson as well

@ruixiang63
Copy link
Copy Markdown

Thanks very much for the work! @ggerganov
I will also start an early prototype based on this refactoring, though it will not be finalized yet because unified feature extraction from the target model is still not finalized.

So I think after we fit the MTP implementation into the new speculative architecture, it will be easier to build the Eagle3 from there.

Makes sense to me!

Inside the process() method, we can analyze the contents of the batch and decide which embeddings to store for later, i.e. the embeddings for the last token of each sequence in the batch, and which to copy directly to the draft context for processing with the current batch.

Regarding “the embeddings for the last token of each sequence in the batch”: Eagle3 needs embeddings for every prompt token, rather than only the last token. Will this be supported in the current design?

With this API, utilize it to extract both the hidden target embeddings and the MTP-generated embeddings. There is no need to distinguish explicitly between the two types of embeddings.

Will this apply to Eagle3 as well? By the way, Eagle3 also uses the pre_norm embedding from the draft model as input for the next draft-token generation.

* replace old type field of type common_speculative_type in the
  common_params_speculative struct with a vector to allow multiple
  types to be specified

* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
  to figure out which implementations the user has enabled

* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
  to parse the already user provided spec types

* all speculators run sequentially, best one wins (we verify its drafted tokens)

* maximize expected accepted tokens for current round by calculating the
  product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
  and the draft's length
@ggerganov
Copy link
Copy Markdown
Member Author

Regarding “the embeddings for the last token of each sequence in the batch”: Eagle3 needs embeddings for every prompt token, rather than only the last token. Will this be supported in the current design?

Yes. Take a look at how it is done in am17an#6.

Will this apply to Eagle3 as well? By the way, Eagle3 also uses the pre_norm embedding from the draft model as input for the next draft-token generation.

Yes, for the Eagle3 pre_norm embeddings do the same as we do in the MTP code.

@ggerganov
Copy link
Copy Markdown
Member Author

Merging after CI is green.

@petersid2022
Copy link
Copy Markdown
Contributor

thanks for including some of my changes to this PR! should i amend/reword the last commit to remove the out-of-date description that i originally left?

@ggerganov
Copy link
Copy Markdown
Member Author

No need, it's not a problem if it is a bit outdated.

@ggerganov ggerganov changed the base branch from gg/spec-refactor-ctx to master May 11, 2026 16:08
@ggerganov ggerganov dismissed ServeurpersoCom’s stale review May 11, 2026 16:08

The base branch was changed.

@ggerganov ggerganov requested a review from ngxson as a code owner May 11, 2026 16:08
@ggerganov ggerganov merged commit 68e7ea3 into master May 11, 2026
45 checks passed
@ggerganov ggerganov deleted the gg/spec-refactor-parallel branch May 11, 2026 16:09
@ggerganov ggerganov mentioned this pull request May 11, 2026
2 tasks
mudler added a commit to mudler/LocalAI that referenced this pull request May 11, 2026
Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.

New `options:` keys (all under `backend: llama-cpp`):

ngram_mod (`ngram_mod` type):
  spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match

ngram_map_k (`ngram_map_k` type):
  spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits

ngram_map_k4v (`ngram_map_k4v` type):
  spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
  spec_ngram_map_k4v_min_hits

ngram lookup caches (`ngram_cache` type):
  spec_lookup_cache_static / lookup_cache_static
  spec_lookup_cache_dynamic / lookup_cache_dynamic

Draft-model tuning (active when `spec_type` is `draft`):
  draft_cache_type_k / spec_draft_cache_type_k
  draft_cache_type_v / spec_draft_cache_type_v
  draft_threads / spec_draft_threads
  draft_threads_batch / spec_draft_threads_batch
  draft_cpu_moe / spec_draft_cpu_moe          (bool flag)
  draft_n_cpu_moe / spec_draft_n_cpu_moe      (first N MoE layers on CPU)
  draft_override_tensor / spec_draft_override_tensor
    (comma-separated <tensor regex>=<buffer type>; re-implements upstream's
     static parse_tensor_buffer_overrides since it isn't exported)

`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.

Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.

Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
mudler added a commit to mudler/LocalAI that referenced this pull request May 12, 2026
…ec-decoding options (#9765)

* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994

Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.

Adapt the grpc-server wrapper accordingly:

  * `common_params_speculative::type` (single enum) became `types`
    (`std::vector<common_speculative_type>`). Update both the
    "default to draft when a draft model is set" branch and the
    `spec_type`/`speculative_type` option parser. The parser now also
    tolerates comma-separated lists, mirroring the upstream
    `common_speculative_types_from_names` semantics.
  * `common_params_speculative_draft::n_ctx` is gone (draft now shares
    the target context size). Keep the `draft_ctx_size` option name for
    backward compatibility and ignore the value rather than failing.
  * `server_context_impl::model` was renamed to `model_tgt`; update the
    two reranker / model-metadata call sites.

Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): expose new speculative-decoding option keys

Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.

New `options:` keys (all under `backend: llama-cpp`):

ngram_mod (`ngram_mod` type):
  spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match

ngram_map_k (`ngram_map_k` type):
  spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits

ngram_map_k4v (`ngram_map_k4v` type):
  spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
  spec_ngram_map_k4v_min_hits

ngram lookup caches (`ngram_cache` type):
  spec_lookup_cache_static / lookup_cache_static
  spec_lookup_cache_dynamic / lookup_cache_dynamic

Draft-model tuning (active when `spec_type` is `draft`):
  draft_cache_type_k / spec_draft_cache_type_k
  draft_cache_type_v / spec_draft_cache_type_v
  draft_threads / spec_draft_threads
  draft_threads_batch / spec_draft_threads_batch
  draft_cpu_moe / spec_draft_cpu_moe          (bool flag)
  draft_n_cpu_moe / spec_draft_n_cpu_moe      (first N MoE layers on CPU)
  draft_override_tensor / spec_draft_override_tensor
    (comma-separated <tensor regex>=<buffer type>; re-implements upstream's
     static parse_tensor_buffer_overrides since it isn't exported)

`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.

Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.

Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout

The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:

  * `ctx_server.impl->model_tgt` (fork still has `model`)
  * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
    (none of these sub-structs exist in the fork)
  * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
    tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
  * `params.speculative.types` vector / `common_speculative_types_from_names`
    (fork has a scalar `type` and only the singular helper)

Approach:

1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
   `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
   discriminations (the "default to draft when a draft model is set" branch
   and the `spec_type` / `speculative_type` option parser) fall back to the
   singular scalar form, and the entire new-option block (ngram_mod / map_k
   / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
   tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
   in the source tree — stock llama-cpp builds get the full new API.

2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
   to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
   - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
   - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
     `#include`, so the guarded blocks above drop out for the fork build.

   Both patches are idempotent and follow the existing sed/awk pattern in
   this script (KV cache types, `get_media_marker`, flat speculative
   renames). Stock llama-cpp's `grpc-server.cpp` is never touched.

Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): close draft_ctx_size brace inside legacy guard

The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.

Move the chain split inside the draft_ctx_size branch:

    } else if (... "draft_ctx_size") {
        // ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
    }                                  // legacy: chain ends here
#else
    } else if (... "spec_ngram_mod_n_min") {  // modern: chain continues
        ...
    } else if (... "draft_override_tensor") {
        ...
    }                                  // closes last branch
#endif
    }                                  // closes for-loop

Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).

Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt

Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.

backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:

  Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
  list of gfx targets e.g. gfx1100,gfx1101.  Stop.
  make: *** [Makefile:66: turboquant-fallback] Error 2

The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.

Mirror the existing pattern from Dockerfile.llama-cpp.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 12, 2026
* spec : refactor

* spec : drop support for incompatible vocabs

* spec : update common_speculative_init()

* cont : pass seq_id

* cont : dedup ctx_seq_rm_type

* server : sketch the ctx_dft decode loop

* server : draft prompt cache and checkpoints

* server : improve ctx names

* server, spec : transition to unified spec context

* cont : sync main and drft contexts

* cont : async drft eval when possible

* cont : handle non-ckpt models

* cont : pass correct n_past for drafting

* cont : process images throught the draft context

* spec : handle draft running out of context

* server : fix mtmd draft processing

* server : fix URL for draft model

* server : add comment

* server : clean-up + dry

* speculative-simple : update

* spec : fix n_past type

* server : fix slot ctx_drft ptr

* tools : update readme

* naming : improve consistency

* spec : refactor for multi-sequence speculative context

* cont : prepare params

* cont : prepare params

* spec : support parallel drafts

* server : support parallel drafting

* llama : reuse device buffers when possible

* server, spec : clean-up

* cont : clean-up

* cont : minor

* spec : reset `drafting` flag at the end

* spec : introduce `common_speculative_process()`

* spec : allow for multiple spec types (chain of speculators)

* replace old type field of type common_speculative_type in the
  common_params_speculative struct with a vector to allow multiple
  types to be specified

* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
  to figure out which implementations the user has enabled

* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
  to parse the already user provided spec types

* all speculators run sequentially, best one wins (we verify its drafted tokens)

* maximize expected accepted tokens for current round by calculating the
  product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
  and the draft's length

---------

Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants