Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -529,6 +529,7 @@ To learn more about model quantization, [read this documentation](tools/quantize
- [How to build](docs/build.md)
- [Running on Docker](docs/docker.md)
- [Build on Android](docs/android.md)
- [Multi-GPU usage](docs/multi-gpu.md)
- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
- [GGML tips & tricks](https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-&-Tricks)

Expand Down
163 changes: 163 additions & 0 deletions docs/multi-gpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Using multiple GPUs with llama.cpp

This guide explains how to run [llama.cpp](https://github.com/ggml-org/llama.cpp) across more than one GPU. It covers the split modes, the command-line flags that control them, the limitations you need to know about, and ready-to-use recipes for `llama-cli` and `llama-server`.

The same flags work in both tools - they share `common/arg.cpp`.
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

---

## When you need multi-GPU

Reach for multi-GPU when one of these is true:

- **The model doesn't fit in a single GPU's VRAM.** Spread the weights across two or more GPUs so the whole model can stay on accelerators.
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
- **You want more throughput.** Distributing computation across GPUs can reduce per-token latency or lift batch throughput, depending on the split mode.
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

---

## The split modes

Set with `--split-mode` / `-sm`.

| Mode | What it does | When to use |
|---|---|---|
| `none` | Use a single GPU only. Pick which one with `--main-gpu`. | You explicitly want to confine the model to one GPU even though more are visible. |
| `layer` (**default**) | Pipeline parallelism. Each GPU holds a contiguous slice of layers. The KV cache for layer *l* lives on the GPU that owns layer *l*. | Default and most compatible multi-GPU choice. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
| `row` | **Deprecated.** Older row-split tensor-parallel path. Superseded by `tensor`. Avoid in new deployments. | Not recommended. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
| `tensor` | **EXPERIMENTAL.** Tensor parallelism that splits both weights *and* KV across the participating GPUs via a "meta device" abstraction. | If you want true TP including KV distribution and your model architecture is supported (see limitations below). Treat as experimental - verify correctness on your model before relying on it. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

> Pipeline parallel (`layer`) vs. tensor parallel (`tensor`): pipeline-parallel runs different layers on different GPUs and processes tokens sequentially through the pipeline. Tensor-parallel splits each layer across GPUs and does cross-GPU reductions inside every layer. Pipeline-parallel maximizes batch throughput; tensor-parallel can reduce single-token latency but adds communication on every layer.
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

---

## Command-line flags reference
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

| Short | Long | Value | Default | Notes |
|---|---|---|---|---|
| `-sm` | `--split-mode` | `none` \| `layer` \| `tensor` | `layer` | See modes above. |
| `-ts` | `--tensor-split` | comma-separated proportions, e.g. `3,1` | mode-dependent | How much of the model goes to each GPU. If omitted, `layer`/`row` use automatic memory-based splitting, while `tensor` splits tensor segments evenly. With `3,1` on two GPUs, GPU 0 gets 75 %, GPU 1 gets 25 %. The values follow the order in `--device`. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
| `-mg` | `--main-gpu` | integer device index | `0` | The single GPU used in `--split-mode none`. |
| `-ngl` | `--n-gpu-layers` / `--gpu-layers` | integer \| `auto` \| `all` | `auto` | Maximum number of layers to keep in VRAM. Use `999` or `all` to push everything possible to the GPUs. |
| `-dev` | `--device` | comma-separated device names, or `none` | auto | Restrict which devices llama.cpp may use. See `--list-devices` for names. |
| | `--list-devices` | - | - | Print the available devices and their memory. Run this first to learn the names you'd pass to `--device`. |
| `-fa` | `--flash-attn` | `on` \| `off` \| `auto` | `auto` | Required by `--split-mode tensor`. Recommended whenever you also use quantized V cache. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
| `-ctk` | `--cache-type-k` | `f32` \| `f16` \| `bf16` \| `q8_0` \| `q4_0` \| ... | `f16` | KV cache type for K. |
| `-ctv` | `--cache-type-v` | same as `-ctk` | `f16` | KV cache type for V. Quantized V requires `-fa on`. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
| `-fit` | `--fit` | `on` \| `off` | `on` | Auto-fit unset args to device memory. **Not supported with `tensor`** - set `-fit off` for that mode. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

`CUDA_VISIBLE_DEVICES` is honored implicitly through CUDA itself: if you set it, llama.cpp only sees the GPUs CUDA exposes. Use `--device` for finer control across multiple backends.
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

---

## Recipes

### 1. Default - pipeline parallel across all visible GPUs

```bash
llama-cli -m model.gguf -ngl all
llama-server -m model.gguf -ngl all --host 0.0.0.0 --port 8080
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
```

Easiest configuration. KV cache spreads across the GPUs along with the layers. `--fit` (on by default) sizes things automatically.

### 2. Pipeline parallel with a custom split ratio

```bash
llama-cli -m model.gguf -ngl all -sm layer -ts 3,1
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
```

Useful when GPUs have different memory: GPU 0 (3 parts) and GPU 1 (1 part). Proportions are normalized.
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

### 3. Single-GPU mode, picking a specific GPU

```bash
llama-cli -m model.gguf -ngl all -sm none -mg 1
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
```

Use only device index 1, even if more GPUs are visible.
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

### 4. Tensor parallelism - TENSOR mode (experimental)
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

```bash
llama-cli -m model.gguf -ngl all -sm tensor -fa on -ctk f16 -ctv f16 -fit off
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
```

- `--flash-attn on` is **mandatory**. Setting it to `auto` is fine (it gets upgraded automatically); setting it to `off` is a hard error.
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
- KV cache types must be non-quantized: `f32`, `f16`, or `bf16`. Quantized KV cache will refuse to start with this mode.
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
- `-fit off` because auto-fit isn't implemented for `tensor`.
- The model architecture must be on the allow-list - see limitations below.
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
- Mark this configuration as experimental in your tooling: validate output quality before deploying.

### 5. Restricting which GPUs participate

```bash
llama-cli --list-devices
# CUDA0: NVIDIA RTX A6000 (48 GiB)
# CUDA1: NVIDIA RTX 4090 (24 GiB)
# CUDA2: NVIDIA RTX 4090 (24 GiB)

llama-cli -m model.gguf -ngl all -sm tensor -dev CUDA0,CUDA2 -fa on -fit off
```

You can also use `CUDA_VISIBLE_DEVICES=0,2 llama-cli ...` to do this at the CUDA layer.

Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
### 6. With NCCL

There's no runtime flag for NCCL - it's selected at build time (`-DGGML_CUDA_NCCL=ON`). When NCCL is compiled in and the system has it installed, llama.cpp uses it automatically for cross-GPU reductions in `tensor` mode. When NCCL is missing on a multi-GPU build, you'll see this one-time warning at startup and performance will be lower:
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

```
NVIDIA Collective Communications Library (NCCL) is unavailable, multi GPU performance will be suboptimal
```
---

Comment thread
gaugarg-nv marked this conversation as resolved.
## Limitations and gotchas

### `--split-mode tensor` (TENSOR, experimental)

This mode is gated by several hard checks at startup. Hitting any of them is a fatal error.

1. **Flash attention is required.** `--flash-attn off` will fail with *"SPLIT_MODE_TENSOR requires flash_attn to be enabled"*. Use `-fa on` (or leave it `auto`, which auto-enables for this mode).
2. **KV cache must not be quantized.** `--cache-type-k` and `--cache-type-v` must be `f32`, `f16`, or `bf16`. Anything else fails with *"simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented"*.
3. **`--fit` is not supported.** Set `-fit off` and size things yourself. The auxiliary `--fit-target` / `--fit-ctx` flags do not make fitting work in this mode either.
4. **At least one non-CPU device required.** `--device none` is rejected with `--split-mode tensor`.
5. **Architecture must be supported.** TENSOR mode is implemented for most architectures, but a number are explicitly excluded today and will fail with *"LLAMA_SPLIT_MODE_TENSOR not implemented for architecture '...'"*. The current exclusion list includes:

- **MoE / hybrid:** Grok, MPT, OLMoE, DeepSeek2, GLM-DSA, Nemotron-H, Nemotron-H-MoE, Granite-Hybrid, LFM2-MoE, Minimax-M2, Mistral4, Kimi-Linear, Jamba, Falcon-H1
- **State-space / RWKV-style:** Mamba, Mamba2 (and the hybrid Mamba-attention models above)
- **Other:** PLAMO2, MiniCPM3, Gemma-3n, OLMo2, BitNet, T5

The list grows over time - if you hit the error, fall back to `--split-mode layer`.
6. **CPU devices are skipped.** TENSOR mode wraps the selected GPUs into a single "meta device"; CPU-typed devices are excluded automatically. The selected device list and `--tensor-split` list must also stay within the `GGML_CUDA_MAX_DEVICES` cap of 16.

### Backends

- **CUDA** - fullest support: `none`, `layer`, and `tensor` all work.
- **HIP / ROCm** - same modes as CUDA, but cross-GPU reduction performance benefits from RCCL (the AMD analog of NCCL); RCCL is opt-in at build time.
- **SYCL** - supports `none` and `layer`.
- **Metal, Vulkan, CANN, etc.** - each backend exposes its own set of devices; multi-device support varies. `layer` is the most portable choice across backends.

---

## Handling OOM with `--fit off`

`--fit off` (required for `--split-mode tensor`) disables the auto-fitter, so llama.cpp will not shrink anything to make the model fit. If you OOM at startup or during inference, you have to reduce memory pressure yourself. The knobs below are listed roughly from least to most disruptive - try them in order.

1. **Lower `--ctx-size`.** The KV cache is roughly proportional to `n_ctx`. Halving `-c` halves the KV cache. Pick a context size you actually need rather than the model maximum.

2. **For `llama-server`: lower `--parallel`.** `-np N` allocates a slot KV cache for every concurrent sequence; total KV memory ≈ `n_ctx × n_parallel`. The default is often higher than you need.

3. **Reduce `--n-gpu-layers`.** Last performance-preserving knob: offload fewer layers (e.g. `-ngl 30` instead of `all`). The remaining layers run on CPU and inference will be **much slower** - only do this when no other knob suffices.

---
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

## Diagnosing problems
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated

| Symptom | Likely cause |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
|---|---|
| Startup error *"SPLIT_MODE_TENSOR requires flash_attn to be enabled"* | Add `-fa on` or remove `-fa off`. |
| Startup error *"simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented"* | Use `-ctk f16 -ctv f16` (or `bf16`/`f32`) with `--split-mode tensor`. |
| Startup error *"LLAMA_SPLIT_MODE_TENSOR not implemented for architecture 'X'"* | Architecture not on the TENSOR allow-list. Use `--split-mode layer`. |
| Warning *"NCCL is unavailable, multi GPU performance will be suboptimal"* | llama.cpp wasn't built with NCCL. Either accept the lower performance or rebuild with `-DGGML_CUDA_NCCL=ON`. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
| CUDA OOM at startup or during prefill in `--split-mode tensor` | Auto-fit is disabled in this mode. See "Handling OOM with `--fit off`" above for the order in which to lower `-c`, `-np`, `-ngl`, etc. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
| Performance is worse with multi-GPU than single-GPU | Inter-GPU bandwidth is the bottleneck. Try `--split-mode layer` (less communication than `tensor`), and verify that NCCL is being used. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated
| GPU not used at all | `--n-gpu-layers` is `0` or too low - set `-ngl all`. Or your build doesn't include the relevant backend. |
Comment thread
gaugarg-nv marked this conversation as resolved.
Outdated