Skip to content

Write a readme on Multi-GPU usage in llama.cpp#22729

Merged
JohannesGaessler merged 4 commits intoggml-org:masterfrom
gaugarg-nv:multi-gpu-read-me
May 7, 2026
Merged

Write a readme on Multi-GPU usage in llama.cpp#22729
JohannesGaessler merged 4 commits intoggml-org:masterfrom
gaugarg-nv:multi-gpu-read-me

Conversation

@gaugarg-nv
Copy link
Copy Markdown
Contributor

Document known issues and provide troubleshooting for multi-GPU usage in llama.cpp

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. The initial document was written by AI. I have made some correctness fixes and edits.

@gaugarg-nv gaugarg-nv requested a review from ggerganov as a code owner May 5, 2026 18:03
@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

@JohannesGaessler could you please take a look at this?

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 5, 2026
@JohannesGaessler
Copy link
Copy Markdown
Contributor

Sorry, I had looked at this from my phone and thought this was an issue asking me to do it rather than a PR that already did it. I'll review it later today.

@rankaiyx
Copy link
Copy Markdown
Contributor

rankaiyx commented May 7, 2026

It seems like some information about the environment variable GGML_CUDA_P2P is missing?
#21910

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not so much an issue with this particular but rather with how to handle legacy code going forward: --split-mode row is definitely deprecated and I intend to remove it. But I would say that eventually we should also deprecate and remove --split-mode none and instead handle this via -dev which I would say is the recommended way to select GPUs.

Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
gaugarg-nv and others added 2 commits May 7, 2026 18:32
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

gaugarg-nv commented May 7, 2026

Thanks for the review and edits @JohannesGaessler.

I have addressed the remaining comments and also added a small section on GGML_CUDA_P2P as suggested by @rankaiyx

Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md Outdated
Comment thread docs/multi-gpu.md
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@JohannesGaessler
Copy link
Copy Markdown
Contributor

@ggerganov can you re-approve since the last commit has me as a co-author?

@adcape
Copy link
Copy Markdown

adcape commented May 7, 2026

@JohannesGaessler Sorry for chiming in, as I'm just a common user. But can you please re-consider removing '-sm row' at least until '-sm tensor' becomes fully functional? It gives a decent speed-up for generation with dense models in comparison to the default (-sm layer).
For example, with my setup (2 x 5060Ti-16), when running the following
./llama-server -dev CUDA0,CUDA1 -ngl 99 -ts 14,15 -c 128000 --host 0.0.0.0 -hf bartowski/Qwen_Qwen3.6-27B-GGUF:Q6_K_L -fa on --temp 1.0 --top-p 0.95 --min-p 0.0 --top-k 20 --presence-penalty 1.5 -ctk q8_0 -ctv q8_0 --no-mmap -sm row
I'm getting prompt processing 438.2 t/s, generation 20.8 t/s with an actual context length of ~5k
Without -sm row pp is a bit faster (484.7 t/s), but generation is much slower (16.8 t/s) on the same task. And this behavior is consistent between different dense models.
I couldn't test -sm tensor with the same model not only because it doesn't support quantised cache yet, but because Qwen3.6 crashes with '-sm tensor' no matter what the context length etc.
So, if it is possible and doesn't create much extra work for you and other developers, can you please consider keeping '-sm row' at least until '-sm tensor' works fine with all models, properly supports different backends beside CUDA (at least Vulkan) and works with quantised KV cache? Would be a pity to lose generation speed or have to compromise on context length while -sm row works just fine for now.

@JohannesGaessler JohannesGaessler merged commit b9afc19 into ggml-org:master May 7, 2026
2 checks passed
@JohannesGaessler
Copy link
Copy Markdown
Contributor

I intend to remove --split-mode row after --split-mode tensor gains support quantized KV caches at which point it should be fully obsolete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants