spec : parallel drafting support by ggerganov · Pull Request #22838 · ggml-org/llama.cpp

ggerganov · 2026-05-08T12:44:43Z

Overview

cont #22787 (rebased on top of it)

With these changes:

The draft context can generate speculative drafts for multiple sequences in parallel
The draft "sees" the multimodal data
Can chain multiple speculative types, f.ex --spec-type ngram-mod,mtp

`server_context`

Single common_speculative context for all slots, capable of handling multiple sequence ids
Extract the drafting logic from server_slot::update_batch() and parallelize it across the active slots
Disable adjusting of speculative decoding params (such as n_min, n_max, p_min) from the HTTP API

`common/speculative`

Rework the API to support drafting for multiple sequences at once
Add struct common_speculative_draft_params
Update draft-model based implementation to decode multiple draft sequences in parallel
Add common_speculative_process(spec, batch) for feeding the prompt tokens through the speculative context
Remove common_speculative_n_max
Remove common_speculative_n_min

`llama_context`

Avoid extra backend buffer allocations during llama_state_seq_get_data_ext() with the LLAMA_STATE_SEQ_FLAGS_ON_DEVICE flag (this can be merged directly to master)

`common/args`

spec : allow for multiple spec types (chains of speculators) #22546

TODOs

Clean-up

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

[no ci]

ggerganov · 2026-05-09T14:38:45Z

@am17an @ruixiang63 This refactor is ready. I also went ahead and introduced the common_speculative_process() API. We can now write specialized logic for the different speculative decoding methods in common/speculative.cpp by implementing the process() method of each different implementations. The most basic one is the draft-model based one - it simply processes the same batch as the target context, without any extra data from it:

llama.cpp/common/speculative.cpp

Lines 227 to 241 in db8e326

    
           bool process(const llama_batch & batch) override { 
        
               auto * ctx_dft = params.ctx_dft; 
        
               const int ret = llama_decode(ctx_dft, batch); 
        
               if (ret != 0) { 
        
                   LOG_ERR("%s: failed to decode draft batch, ret = %d\n", __func__, ret); 
        
                   return false; 
        
               } 
        
               return true; 
        
           }

I'll start looking next into hooking the target embeddings. I'll probably start from the MTP PR #22673 as it seems to be the easiest speculative approach. The Eagle3 has some extra complexities:

More target features to extract
Encoder step
Different vocabs

So I think after we fit the MTP implementation into the new speculative architecture, it will be easier to build the Eagle3 from there.

I'm already having some ideas how to efficiently move the target embeddings between the contexts. Inside the process() method, we can analyze the contents of the batch and decide which embeddings to store for later (i.e. the embeddings for the last token of each sequence in the batch) and which to copy directly to the draft context for processing with the current batch. Anyway, I think I am starting to wrap my head around it, but it still not 100% clear.

If you are prototyping stuff, it would be great to base it on top of this branch. There are comments in the code to clarify the logic, but if something is unclear - let me know.

ggerganov · 2026-05-10T08:35:23Z

I'm already having some ideas how to efficiently move the target embeddings between the contexts.

For now, we should definitely focus on initial implementation that uses host roundtrips. The device-to-device version is possible, but it requires first a refactoring that introduces the concept of "on-device input/output batch embeddings" in the llama_context. I now have quite a good idea how to do this - will try to spec it soon. But IMO we should do this in next iterations as it's a relatively big change.

For the initial implementation, my current suggestion is to mirror the existing llama_get_embeddings_* API in order to at least utilize the host pinned memory. This should perform better than the current direct calls to ggml_backend_tensor_get() used in #22673. Basically, add respective "pre-norm output embedding" calls in llama-ext for the following functions:

// llama-ext.h
// (not in llama.h since this is a stop-gap solution to get something mergeable)

    // mirrors:
    // LLAMA_API void llama_set_embeddings(struct llama_context * ctx, bool embeddings);
    LLAMA_API void llama_set_embeddings_pre_norm(struct llama_context * ctx, bool value);

    // mirrors:
    // LLAMA_API float * llama_get_embeddings(struct llama_context * ctx);
    LLAMA_API float * llama_get_embeddings_pre_norm(struct llama_context * ctx);

    // mirrors:
    // LLAMA_API float * llama_get_embeddings_ith(struct llama_context * ctx, int32_t i);
    LLAMA_API float * llama_get_embeddings_pre_norm_ith(struct llama_context * ctx, int32_t i);

The llama_context should reserve extra pinned-host buffer space for the pre-norm embeddings (see llama_context::output_reserve()), async extract after every ubatch (see llama_context::decode(), .., i.e. the entire logic mirrors the existing "output embeddings" logic.

With this API, utlize it to extract both the hidden target embeddings and the MTP-generated embeddings. There is not need to distinguish explicitly between the 2 types of embeddings. I.e. this code:

    // MTP related inputs/outputs
    ggml_tensor * t_h_pre_norm  = nullptr; // [n_embd, n_outputs] hidden state required for MTP
    ggml_tensor * t_mtp_out     = nullptr; // [n_embd, n_tokens]

Should be just:

    ggml_tensor * t_embd_pre_norm  = nullptr; // [n_embd, n_outputs] embeddings before the output norm

This new llama-ext API should be enough for everything to get the MTP working. There is no need to access the hidden state tensors from the user code, so there should be no changes to llama.h at all.

The rest of the changes from #22673 should be moved into a new common/speculative MTP implementation in the respective process()/draft() methods.

am17an · 2026-05-10T08:45:45Z

@ggerganov okay let me work on this, will send you my working branch once I have something.

ggerganov · 2026-05-10T08:52:10Z

Ok great, I think the only a bit complicated thing would be to add the logic for analyzing the contents of the batch in process() and draft() to determine which sequences participate and to map the embeddings offsets correctly. Basically, the "pending" logic that you already have, should be extended to work per sequence id.

An initial simple version could be to assert that there is single-sequence id inside the batch and when this is working, we can extend it to multi-sequence.

am17an · 2026-05-11T03:48:41Z

@ggerganov can you review am17an#6? It seems to work with these changes. cc @ngxson as well

ruixiang63 · 2026-05-11T13:40:40Z

Thanks very much for the work! @ggerganov
I will also start an early prototype based on this refactoring, though it will not be finalized yet because unified feature extraction from the target model is still not finalized.

So I think after we fit the MTP implementation into the new speculative architecture, it will be easier to build the Eagle3 from there.

Makes sense to me!

Inside the process() method, we can analyze the contents of the batch and decide which embeddings to store for later, i.e. the embeddings for the last token of each sequence in the batch, and which to copy directly to the draft context for processing with the current batch.

Regarding “the embeddings for the last token of each sequence in the batch”: Eagle3 needs embeddings for every prompt token, rather than only the last token. Will this be supported in the current design?

With this API, utilize it to extract both the hidden target embeddings and the MTP-generated embeddings. There is no need to distinguish explicitly between the two types of embeddings.

Will this apply to Eagle3 as well? By the way, Eagle3 also uses the pre_norm embedding from the draft model as input for the next draft-token generation.

* replace old type field of type common_speculative_type in the common_params_speculative struct with a vector to allow multiple types to be specified * introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>) to figure out which implementations the user has enabled * introduce common_speculative_type_from_names(const std::vector<std::string> & names) to parse the already user provided spec types * all speculators run sequentially, best one wins (we verify its drafted tokens) * maximize expected accepted tokens for current round by calculating the product between the probability of accepting current token (n_acc_tokens / n_gen_drafts) and the draft's length

ggerganov · 2026-05-11T15:04:43Z

Regarding “the embeddings for the last token of each sequence in the batch”: Eagle3 needs embeddings for every prompt token, rather than only the last token. Will this be supported in the current design?

Yes. Take a look at how it is done in am17an#6.

Will this apply to Eagle3 as well? By the way, Eagle3 also uses the pre_norm embedding from the draft model as input for the next draft-token generation.

Yes, for the Eagle3 pre_norm embeddings do the same as we do in the MTP code.

ggerganov · 2026-05-11T15:15:21Z

Merging after CI is green.

petersid2022 · 2026-05-11T15:28:17Z

thanks for including some of my changes to this PR! should i amend/reword the last commit to remove the out-of-date description that i originally left?

ggerganov · 2026-05-11T15:32:47Z

No need, it's not a problem if it is a bit outdated.

The base branch was changed.

Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated <tensor regex>=<buffer type>; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ec-decoding options (#9765) * chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994 Picks up the upstream `spec : parallel drafting support` change (ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API and `server_context_impl`. Adapt the grpc-server wrapper accordingly: * `common_params_speculative::type` (single enum) became `types` (`std::vector<common_speculative_type>`). Update both the "default to draft when a draft model is set" branch and the `spec_type`/`speculative_type` option parser. The parser now also tolerates comma-separated lists, mirroring the upstream `common_speculative_types_from_names` semantics. * `common_params_speculative_draft::n_ctx` is gone (draft now shares the target context size). Keep the `draft_ctx_size` option name for backward compatibility and ignore the value rather than failing. * `server_context_impl::model` was renamed to `model_tgt`; update the two reranker / model-metadata call sites. Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp target locally. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): expose new speculative-decoding option keys Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated <tensor regex>=<buffer type>; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile, which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh on the copy. The fork branched before the API refactor, so it errors out on: * `ctx_server.impl->model_tgt` (fork still has `model`) * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*` (none of these sub-structs exist in the fork) * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads, tensor_buft_overrides}` (fork uses the pre-#22397 flat layout) * `params.speculative.types` vector / `common_speculative_types_from_names` (fork has a scalar `type` and only the singular helper) Approach: 1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]` discriminations (the "default to draft when a draft model is set" branch and the `spec_type` / `speculative_type` option parser) fall back to the singular scalar form, and the entire new-option block (ngram_mod / map_k / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*, tensor_buft_overrides}) is preprocessed out. The macro is *not* defined in the source tree — stock llama-cpp builds get the full new API. 2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp: - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model` - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first `#include`, so the guarded blocks above drop out for the fork build. Both patches are idempotent and follow the existing sed/awk pattern in this script (KV cache types, `get_media_marker`, flat speculative renames). Stock llama-cpp's `grpc-server.cpp` is never touched. Drop both legacy patches once the turboquant fork rebases past ggml-org/llama.cpp#22397 / #22838. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): close draft_ctx_size brace inside legacy guard The previous turboquant fix wrapped the new option-handler blocks in `#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard in the middle of an `else if` chain — the `} else if` openings of the new blocks were responsible for closing the previous block's brace. With the macro defined the new blocks vanish, draft_ctx_size's `{` loses its closer, the for-loop's `}` is consumed instead, and the file ends with a stray opening brace — clang reports it as `function-definition is not allowed here before '{'` on the next top-level `int main(...)` and `expected '}' at end of input`. Move the chain split inside the draft_ctx_size branch: } else if (... "draft_ctx_size") { // ... #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC } // legacy: chain ends here #else } else if (... "spec_ngram_mod_n_min") { // modern: chain continues ... } else if (... "draft_override_tensor") { ... } // closes last branch #endif } // closes for-loop Brace count is now balanced under both preprocessor branches (verified with `tr -cd '{' | wc -c` against the patched and unpatched outputs). Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp `turboquant-avx` variant cleanly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt Dockerfile.turboquant's `builder-prebuilt` stage was missing the `ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that `builder-fromsource` already has (and that `Dockerfile.llama-cpp` mirrors across both stages). When CI uses the prebuilt base image (quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg passed by the workflow never reaches the env inside the compile stage. backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on hipblas builds when AMDGPU_TARGETS is empty, and the turboquant Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the same check fires from turboquant-fallback under BUILD_TYPE=hipblas: Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101. Stop. make: *** [Makefile:66: turboquant-fallback] Error 2 The bug is latent on master because the docker layer cache stays warm across builds — the compile step rarely re-runs from scratch. The llama.cpp bump in this PR invalidates the cache, so the missing env var becomes load-bearing and the hipblas turboquant CI job fails. Mirror the existing pattern from Dockerfile.llama-cpp. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>

* spec : refactor * spec : drop support for incompatible vocabs * spec : update common_speculative_init() * cont : pass seq_id * cont : dedup ctx_seq_rm_type * server : sketch the ctx_dft decode loop * server : draft prompt cache and checkpoints * server : improve ctx names * server, spec : transition to unified spec context * cont : sync main and drft contexts * cont : async drft eval when possible * cont : handle non-ckpt models * cont : pass correct n_past for drafting * cont : process images throught the draft context * spec : handle draft running out of context * server : fix mtmd draft processing * server : fix URL for draft model * server : add comment * server : clean-up + dry * speculative-simple : update * spec : fix n_past type * server : fix slot ctx_drft ptr * tools : update readme * naming : improve consistency * spec : refactor for multi-sequence speculative context * cont : prepare params * cont : prepare params * spec : support parallel drafts * server : support parallel drafting * llama : reuse device buffers when possible * server, spec : clean-up * cont : clean-up * cont : minor * spec : reset `drafting` flag at the end * spec : introduce `common_speculative_process()` * spec : allow for multiple spec types (chain of speculators) * replace old type field of type common_speculative_type in the common_params_speculative struct with a vector to allow multiple types to be specified * introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>) to figure out which implementations the user has enabled * introduce common_speculative_type_from_names(const std::vector<std::string> & names) to parse the already user provided spec types * all speculators run sequentially, best one wins (we verify its drafted tokens) * maximize expected accepted tokens for current round by calculating the product between the probability of accepting current token (n_acc_tokens / n_gen_drafts) and the draft's length --------- Co-authored-by: Petros Sideris <petros.sideris@nokia.com>

ggerganov added 25 commits May 7, 2026 21:44

spec : refactor

2c9a408

[no ci]

spec : drop support for incompatible vocabs

befc7ef

[no ci]

spec : update common_speculative_init()

4550f0f

[no ci]

cont : pass seq_id

77269ad

[no ci]

cont : dedup ctx_seq_rm_type

8a50f6f

[no ci]

server : sketch the ctx_dft decode loop

c97dc36

[no ci]

server : draft prompt cache and checkpoints

11fd5e7

[no ci]

server : improve ctx names

1afee5b

[no ci]

server, spec : transition to unified spec context

de35b12

cont : sync main and drft contexts

08c8012

cont : async drft eval when possible

c7facb0

cont : handle non-ckpt models

0239f4c

cont : pass correct n_past for drafting

ae6703f

cont : process images throught the draft context

7e118cd

spec : handle draft running out of context

8be14e4

server : fix mtmd draft processing

6a4b05a

server : fix URL for draft model

12c7cfb

server : add comment

233d1ae

[no ci]

server : clean-up + dry

3b1a8df

speculative-simple : update

e5b1401

spec : fix n_past type

161eae0

server : fix slot ctx_drft ptr

1dbc054

tools : update readme

778f9e2

naming : improve consistency

efa2f8e

spec : refactor for multi-sequence speculative context

6582523

ggerganov changed the title ~~spec : refactor for multi-sequence speculative context~~ spec : parallel drafting support May 8, 2026

ggerganov added 4 commits May 8, 2026 17:06

cont : prepare params

8822c12

cont : prepare params

927d663

spec : support parallel drafts

f88c942

server : support parallel drafting

f165219

ggerganov added 5 commits May 9, 2026 10:21

server, spec : clean-up

ce0acf0

cont : clean-up

b3bd3bd

cont : minor

ec8bc44

spec : reset drafting flag at the end

0d5dd61

spec : introduce common_speculative_process()

db8e326

am17an mentioned this pull request May 9, 2026

llama + spec: MTP Support #22673

Open

2 tasks

This was referenced May 11, 2026

spec: support MTP am17an/llama.cpp#6

Open

spec : allow for multiple spec types (chains of speculators) #22546

Closed

ggerganov changed the base branch from gg/spec-refactor-ctx to master May 11, 2026 16:08

ggerganov requested a review from ngxson as a code owner May 11, 2026 16:08

ggerganov merged commit 68e7ea3 into master May 11, 2026
45 checks passed

ggerganov deleted the gg/spec-refactor-parallel branch May 11, 2026 16:09

ggerganov mentioned this pull request May 11, 2026

spec : refactor ctx #22787

Closed

2 tasks

This was referenced May 11, 2026

feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options mudler/LocalAI#9765

Merged

chore: ⬆️ Update ggml-org/llama.cpp to 1ec7ba0c14f33f17e980daeeda5f35b225d41994 mudler/LocalAI#9763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec : parallel drafting support#22838

spec : parallel drafting support#22838
ggerganov merged 36 commits into
masterfrom
gg/spec-refactor-parallel

ggerganov commented May 8, 2026 •

edited

Loading

Uh oh!

ggerganov commented May 9, 2026 •

edited

Loading

Uh oh!

ggerganov commented May 10, 2026

Uh oh!

am17an commented May 10, 2026

Uh oh!

ggerganov commented May 10, 2026

Uh oh!

am17an commented May 11, 2026 •

edited

Loading

Uh oh!

ruixiang63 commented May 11, 2026

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

petersid2022 commented May 11, 2026

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ggerganov commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

server_context

common/speculative

llama_context

common/args

TODOs

Requirements

Uh oh!

ggerganov commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 10, 2026

Uh oh!

am17an commented May 10, 2026

Uh oh!

ggerganov commented May 10, 2026

Uh oh!

am17an commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruixiang63 commented May 11, 2026

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

petersid2022 commented May 11, 2026

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggerganov commented May 8, 2026 •

edited

Loading

`server_context`

`common/speculative`

`llama_context`

`common/args`

ggerganov commented May 9, 2026 •

edited

Loading

am17an commented May 11, 2026 •

edited

Loading