ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill by happyyzy · Pull Request #22755 · ggml-org/llama.cpp

happyyzy · 2026-05-06T11:42:01Z

Summary

This PR adds an opt-in Adreno xmem GEMM path for OpenCL prefill matmul.

Scope:

build-time gated by GGML_OPENCL_USE_ADRENO_KERNELS
runtime opt-in via GGML_OPENCL_ADRENO_XMEM_GEMM=1
limited to Adreno A8X
limited to F16 x F32 -> F32 GGML_OP_MUL_MAT
limited to contiguous, single-batch GEMM shapes
requires N > 1, so token-generation / GEMV decode is not routed through this path

The implementation keeps the existing ggml tensor layout externally and uses a small bridge around the xmem GEMM:

pack F32 activations into a half image
pack F16 weights into the xmem kernel layout
run the Adreno xmem OS8 GEMM
store the half image result back to F32 output

The generic OpenCL matmul path remains unchanged unless the new runtime opt-in is set.

Results

Tested on Adreno 830 with OpenCL:

OpenCL driver: OpenCL 3.0 QUALCOMM build: 0800.71 Compiler E031.47.18.49
build: 09294365a (468)

Qwen2.5 1.5B F16

Before, baseline OpenCL:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           pp512 |        204.98 ± 0.61 |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           tg128 |         18.84 ± 0.07 |

After, with GGML_OPENCL_ADRENO_XMEM_GEMM=1:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           pp512 |        356.19 ± 0.31 |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           tg128 |         18.98 ± 0.06 |

Prefill improved from 204.98 tok/s to 356.19 tok/s, about 1.74x.

Qwen2.5 3B F16

Before, baseline OpenCL:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           pp512 |        101.26 ± 0.04 |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           tg128 |          9.53 ± 0.04 |

After, with GGML_OPENCL_ADRENO_XMEM_GEMM=1:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           pp512 |        163.90 ± 2.84 |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           tg128 |          9.51 ± 0.11 |

Prefill improved from 101.26 tok/s to 163.90 tok/s, about 1.62x.

Decode is intentionally unchanged. Decode-only profiling confirmed that token generation stays on the existing OpenCL path (adreno_xmem count = 0).

Correctness

Checked end-to-end generation with the xmem path enabled on Qwen2.5 1.5B F16 and Qwen2.5 3B F16. Both models produced normal decode output.

Notes

This path depends on Qualcomm Adreno OpenCL subgroup constant-load extensions and is therefore guarded behind the existing Adreno kernel build option plus an explicit runtime environment variable.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - AI was used as an assistant for code review, testing coordination, and drafting text from user-provided results. The contributor is responsible for the submitted changes.

ggml-gh-bot · 2026-05-06T11:46:16Z

Hi @happyyzy, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

happyyzy · 2026-05-06T13:28:35Z

Thanks for the note. I closed the other open PR (#22117) and will focus on this smaller xmem GEMM PR first.

lhez · 2026-05-07T17:55:53Z

Thank you - this is much easier. Will take a closer look in the next few days.

ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill

c5e0577

happyyzy requested a review from a team as a code owner May 6, 2026 11:42

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels May 6, 2026

happyyzy mentioned this pull request May 6, 2026

ggml-opencl: add Adreno xmem attention path #22117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill#22755

ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill#22755
happyyzy wants to merge 1 commit intoggml-org:masterfrom
happyyzy:adreno-xmem-gemm-prefill

happyyzy commented May 6, 2026

Uh oh!

ggml-gh-bot Bot commented May 6, 2026

Uh oh!

happyyzy commented May 6, 2026

Uh oh!

lhez commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

happyyzy commented May 6, 2026

Summary

Results

Qwen2.5 1.5B F16

Qwen2.5 3B F16

Correctness

Notes

Requirements

Uh oh!

ggml-gh-bot Bot commented May 6, 2026

Uh oh!

happyyzy commented May 6, 2026

Uh oh!

lhez commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants