Conversation
1d5b0fe to
d719d8a
Compare
[no ci]
[no ci]
[no ci]
[no ci]
21e83ad to
7e118cd
Compare
|
Currently the draft model actually doesn't see the vision tokens and can never attend to them, so it's just a silent degradation which lowers the acceptance rate in master right? Not a correctness issue though. |
Yes, on With this PR, we now feed the image tokens to the draft context too, so the acceptance increases for these use cases - as long as the draft model is trained to understand the image embeddings. Note that we make an important assumption:
For draft-model based speculative decoding, this is rarely correct. For example This can be improved by having a separate |
[no ci]
|
Another reason while it's not probably worth adding the extra logic is that usually the image tokens constitute a very small part of model input, unless someone specifically processes images only (like OCR), but then they're probably better off using a dedicated OCR model anyway. |
|
I think this PR is ready to merge. I'll start working on one more refactor of the speculative code to enable the parallel drafting - it will be nice to have this prepared before we proceed with introducing the speculative decoding methods. |
|
A few other points:
|
Yeah, we need to improve spec-decoding tests at some point. |
| // TODO: avoid restoring the draft context and re-evaluating the drafted tokens when not needed [TAG_SPEC_AVOID_DRAFT_REEVAL] | ||
| // for now, always re-evaluate for simplicity | ||
| // ref: https://github.com/ggml-org/llama.cpp/pull/22728#issuecomment-4400925384 | ||
| // | ||
| // | spec type | need re-eval | | ||
| // | --- | --- | | ||
| // | draft model | no | because the draft model does not use embeddings from the target | ||
| // | MTP (std) | yes | | ||
| // | MTP Gemma4 | no | because the KV cache is shared | ||
| // | Eagle3 | yes | | ||
| // | DFlash | yes? | | ||
| // | ||
| if (ctx_dft) { | ||
| // TODO: update as needed for MTP, Eagle3, etc. | ||
| const bool need_tgt_embd = false; | ||
|
|
||
| if (need_tgt_embd) { | ||
| llama_synchronize(ctx_tgt); | ||
| } | ||
|
|
||
| // the logic here varies depending on the speculative decoding method | ||
| // - some draft contexts require embeddings from the target context, others don't | ||
| // - some draft contexts involve an encoder step to transform the target embeddings to draft embeddings | ||
| // TODO: extract this in a function ? | ||
| { | ||
| // TODO: hook the embeddings from the last target batch here | ||
| if (llama_model_has_encoder(model_dft.get())) { | ||
| //llama_encode(ctx_dft, ...); | ||
|
|
||
| GGML_ABORT("not implemented yet\n"); | ||
| } | ||
|
|
||
| const int ret = llama_decode(ctx_dft.get(), batch_view); | ||
|
|
||
| if (ret != 0) { | ||
| SRV_ERR("failed to decode draft batch, ret = %d\n", ret); | ||
|
|
||
| // TODO: handle error | ||
| break; | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
@am17an @ruixiang63 This will be the most important point - here we will have to hook the target embeddings. Either by getting/setting the tensors from the target context to the draft context, or by transferring through host memory. Still not clear to me at this point.
Let me know if you see and potential problems at this stage.
This will eventually become a function of the common_speculatice API, for example:
bool common_speculative_process(
common_speculative * spec,
llama_context * ctx_tgt,
llama_batch batch,
...);So we will be able to specialize the logic per spec decoding method in common/speculative.cpp. But for now, keep it here until we get something running.
There was a problem hiding this comment.
// | DFlash | yes? |
Yes, it does. The KV cache of the DFlash draft model also depends on the compressed target-model features for the verified tokens. See: #22728 (comment)
There was a problem hiding this comment.
Either by getting/setting the tensors from the target context to the draft context, or by transferring through host memory. Still not clear to me at this point.
I would propose transferring through host memory first. Although this is not fully optimal, it is simple and does not require additional GPU buffers, so we can save GPU VRAM.
Regarding performance, speculative decoding already achieves a significant speedup, and transferring embeddings through host memory is not the bottleneck. Even if we avoided the host-memory round trip, the benefit would likely be negligible. For example, the current Eagle3 and DFlash PRs both transfer via host memory and still show significant speedups.
Especailly for multi-sequences use case, keeping the embeddings directly on the GPU could consume a substantial amount of VRAM.
We need to make sure we transfer ubatch embeddings rather than the full batch embeddings, and consume each ubatch immediately once it arrives in the draft context. This way, we can avoid OOM issues.
There was a problem hiding this comment.
I think for PP it matters a lot
There was a problem hiding this comment.
Right. But with PP, layers are split across different GPUs. For example, suppose Eagle3 needs features from layer 1 and layer 2, where layer 1 is on GPU0 and layer 2 is on GPU1. I assume the draft model will also be placed on each GPU, since it is small and does not make much sense to split it across GPUs.
In this case, how do we obtain the required features? I guess we would also need to go through host memory, right?
There was a problem hiding this comment.
Yes host memory is fine for now, we can look at optimising it later
There was a problem hiding this comment.
I feel option 2 is a more elegant way to handle this. We can also keep the host-copy or device-copy logic behind llama_set_input_embeddings.
llama_decode(ctx_tgt);
ggml_tensor * embd_src = llama_get_pre_norm_embeddings(ctx_tgt);
llama_set_input_embeddings(ctx_dft, embd_src);
llama_decode(ctx_dft);
There was a problem hiding this comment.
BTW @ggerganov there is a subtlety which is that we need to store previous batch's last token embeddings and pair them up with current batch shifted by 1, this is because MTP is conditioned via
There was a problem hiding this comment.
Hm, yes it's the same. What embeddings do we use for the very first token x_0?
There was a problem hiding this comment.
I treat just as 0 for MTP, I think that's a reasonable choice
|
A question came to mind about the draft enc-dec context: how do we handle the draft model encoder → decoder embedding transfer in the multi-slot case? Should we refactor Maybe there’s already a better solution that I’m not aware of? @ggerganov Besides this, |
With parallel decoding, we don't do actual parallel calls to |
|
|
||
| if (!vocab_cmpt) { | ||
| LOG_WRN("the target and draft vocabs are not compatible - tokens will be translated between the two\n"); | ||
| LOG_ERR("%s: the target and draft vocabs are not compatible\n", __func__); |
There was a problem hiding this comment.
This seems to break EAGLE3, since EAGLE3 uses a reduced draft vocab that is much smaller than the target vocab size. https://github.com/ruixiang63/llama.cpp/blob/4567954ab06b792253582872673326df31e2ac7d/src/llama-arch.h#L564
We could add a branch in common_speculative_are_compatible, e.g.:
if (has_d2t) {
// skip the size-diff and per-id token-text checks for d2t-equipped drafts
} else {
// existing checks
}
There was a problem hiding this comment.
Yes, we can adjust if needed. The speculative implementations could use their custom implementations of common_speculative_are_compatible().
Overview
Refactor the speculative code to use a single unified "draft"
llama_contextfor all sequences. The draft context is synchronized with the main context. This has several advantages:common/speculative.cppis significantly simplifiedBackwards incompatible changes:
Test commands:
TODO
speculative-simple.cpp