Write a readme on Multi-GPU usage in llama.cpp by gaugarg-nv · Pull Request #22729 · ggml-org/llama.cpp

gaugarg-nv · 2026-05-05T18:03:31Z

Document known issues and provide troubleshooting for multi-GPU usage in llama.cpp

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. The initial document was written by AI. I have made some correctness fixes and edits.

gaugarg-nv · 2026-05-05T18:05:06Z

@JohannesGaessler could you please take a look at this?

JohannesGaessler · 2026-05-07T06:10:52Z

Sorry, I had looked at this from my phone and thought this was an issue asking me to do it rather than a PR that already did it. I'll review it later today.

rankaiyx · 2026-05-07T07:49:09Z

It seems like some information about the environment variable GGML_CUDA_P2P is missing?
#21910

JohannesGaessler

This is not so much an issue with this particular but rather with how to handle legacy code going forward: --split-mode row is definitely deprecated and I intend to remove it. But I would say that eventually we should also deprecate and remove --split-mode none and instead handle this via -dev which I would say is the recommended way to select GPUs.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

gaugarg-nv · 2026-05-07T14:13:42Z

Thanks for the review and edits @JohannesGaessler.

I have addressed the remaining comments and also added a small section on GGML_CUDA_P2P as suggested by @rankaiyx

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

JohannesGaessler · 2026-05-07T15:27:35Z

@ggerganov can you re-approve since the last commit has me as a co-author?

adcape · 2026-05-07T15:40:09Z

@JohannesGaessler Sorry for chiming in, as I'm just a common user. But can you please re-consider removing '-sm row' at least until '-sm tensor' becomes fully functional? It gives a decent speed-up for generation with dense models in comparison to the default (-sm layer).
For example, with my setup (2 x 5060Ti-16), when running the following
./llama-server -dev CUDA0,CUDA1 -ngl 99 -ts 14,15 -c 128000 --host 0.0.0.0 -hf bartowski/Qwen_Qwen3.6-27B-GGUF:Q6_K_L -fa on --temp 1.0 --top-p 0.95 --min-p 0.0 --top-k 20 --presence-penalty 1.5 -ctk q8_0 -ctv q8_0 --no-mmap -sm row
I'm getting prompt processing 438.2 t/s, generation 20.8 t/s with an actual context length of ~5k
Without -sm row pp is a bit faster (484.7 t/s), but generation is much slower (16.8 t/s) on the same task. And this behavior is consistent between different dense models.
I couldn't test -sm tensor with the same model not only because it doesn't support quantised cache yet, but because Qwen3.6 crashes with '-sm tensor' no matter what the context length etc.
So, if it is possible and doesn't create much extra work for you and other developers, can you please consider keeping '-sm row' at least until '-sm tensor' works fine with all models, properly supports different backends beside CUDA (at least Vulkan) and works with quantised KV cache? Would be a pity to lose generation speed or have to compromise on context length while -sm row works just fine for now.

JohannesGaessler · 2026-05-07T15:49:39Z

I intend to remove --split-mode row after --split-mode tensor gains support quantized KV caches at which point it should be fully obsolete.

Write a readme on Multi-GPU usage in llama.cpp

f3f02e1

gaugarg-nv requested a review from ggerganov as a code owner May 5, 2026 18:03

github-actions Bot added the documentation Improvements or additions to documentation label May 5, 2026

rankaiyx approved these changes May 7, 2026

View reviewed changes

ggerganov approved these changes May 7, 2026

View reviewed changes

JohannesGaessler reviewed May 7, 2026

View reviewed changes

gaugarg-nv and others added 2 commits May 7, 2026 18:32

Apply suggestions from code review

5815cb8

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Address review comments

7270a17

JohannesGaessler approved these changes May 7, 2026

View reviewed changes

Comment thread docs/multi-gpu.md Outdated

Comment thread docs/multi-gpu.md Outdated

Comment thread docs/multi-gpu.md Outdated

Comment thread docs/multi-gpu.md

Apply suggestions from code review

7916e98

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

ggerganov approved these changes May 7, 2026

View reviewed changes

JohannesGaessler merged commit b9afc19 into ggml-org:master May 7, 2026
2 checks passed

Conversation

gaugarg-nv commented May 5, 2026

Requirements

Uh oh!

gaugarg-nv commented May 5, 2026

Uh oh!

JohannesGaessler commented May 7, 2026

Uh oh!

rankaiyx commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaugarg-nv commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented May 7, 2026

Uh oh!

adcape commented May 7, 2026

Uh oh!

Uh oh!

JohannesGaessler commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rankaiyx commented May 7, 2026 •

edited

Loading

gaugarg-nv commented May 7, 2026 •

edited

Loading