Skip to content

ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill#22755

Open
happyyzy wants to merge 1 commit intoggml-org:masterfrom
happyyzy:adreno-xmem-gemm-prefill
Open

ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill#22755
happyyzy wants to merge 1 commit intoggml-org:masterfrom
happyyzy:adreno-xmem-gemm-prefill

Conversation

@happyyzy
Copy link
Copy Markdown

@happyyzy happyyzy commented May 6, 2026

Summary

This PR adds an opt-in Adreno xmem GEMM path for OpenCL prefill matmul.

Scope:

  • build-time gated by GGML_OPENCL_USE_ADRENO_KERNELS
  • runtime opt-in via GGML_OPENCL_ADRENO_XMEM_GEMM=1
  • limited to Adreno A8X
  • limited to F16 x F32 -> F32 GGML_OP_MUL_MAT
  • limited to contiguous, single-batch GEMM shapes
  • requires N > 1, so token-generation / GEMV decode is not routed through this path

The implementation keeps the existing ggml tensor layout externally and uses a small bridge around the xmem GEMM:

  • pack F32 activations into a half image
  • pack F16 weights into the xmem kernel layout
  • run the Adreno xmem OS8 GEMM
  • store the half image result back to F32 output

The generic OpenCL matmul path remains unchanged unless the new runtime opt-in is set.

Results

Tested on Adreno 830 with OpenCL:

OpenCL driver: OpenCL 3.0 QUALCOMM build: 0800.71 Compiler E031.47.18.49
build: 09294365a (468)

Qwen2.5 1.5B F16

Before, baseline OpenCL:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           pp512 |        204.98 ± 0.61 |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           tg128 |         18.84 ± 0.07 |

After, with GGML_OPENCL_ADRENO_XMEM_GEMM=1:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           pp512 |        356.19 ± 0.31 |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           tg128 |         18.98 ± 0.06 |

Prefill improved from 204.98 tok/s to 356.19 tok/s, about 1.74x.

Qwen2.5 3B F16

Before, baseline OpenCL:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           pp512 |        101.26 ± 0.04 |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           tg128 |          9.53 ± 0.04 |

After, with GGML_OPENCL_ADRENO_XMEM_GEMM=1:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           pp512 |        163.90 ± 2.84 |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           tg128 |          9.51 ± 0.11 |

Prefill improved from 101.26 tok/s to 163.90 tok/s, about 1.62x.

Decode is intentionally unchanged. Decode-only profiling confirmed that token generation stays on the existing OpenCL path (adreno_xmem count = 0).

Correctness

Checked end-to-end generation with the xmem path enabled on Qwen2.5 1.5B F16 and Qwen2.5 3B F16. Both models produced normal decode output.

Notes

This path depends on Qualcomm Adreno OpenCL subgroup constant-load extensions and is therefore guarded behind the existing Adreno kernel build option plus an explicit runtime environment variable.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI was used as an assistant for code review, testing coordination, and drafting text from user-provided results. The contributor is responsible for the submitted changes.

@happyyzy happyyzy requested a review from a team as a code owner May 6, 2026 11:42
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels May 6, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 6, 2026

Hi @happyyzy, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@happyyzy
Copy link
Copy Markdown
Author

happyyzy commented May 6, 2026

Thanks for the note. I closed the other open PR (#22117) and will focus on this smaller xmem GEMM PR first.

@lhez
Copy link
Copy Markdown
Contributor

lhez commented May 7, 2026

Thank you - this is much easier. Will take a closer look in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants