ggml-opencl: add Adreno xmem attention path by happyyzy · Pull Request #22117 · ggml-org/llama.cpp

happyyzy · 2026-04-19T11:52:01Z

Summary

This PR adds an Adreno-specific xmem attention path to ggml-opencl behind GGML_OPENCL_ADRENO_XMEM_ATTN.

The new path is intended for high-throughput attention workloads on Adreno, especially:

large prefill
large noncausal attention
shapes where the current fa or nofuse routes are too slow, run out of memory, or fail to run

This PR does not claim a decode speedup. In current q=1 decode tests, the existing fa / nofuse routes are still faster.

Implementation

adds Adreno xmem attention kernels for:
- Q/K/V staging
- K/V packing
- QK GEMM
- softmax
- PV GEMM
wires the path into ggml-opencl runtime behind an explicit env gate
adds test-opencl-adreno-attn to reproduce correctness and route-to-route performance
compiles the dedicated xmem QK/PV translation units before the rest of the OpenCL kernel set

Why the compile order change exists

On current Adreno drivers, compiling the xmem QK/PV kernels later in the overall OpenCL kernel load sequence can produce a materially slower device binary even with identical source and build options.

This PR keeps a deterministic early compile order for the xmem QK/PV units because that is currently the most stable way to preserve the fast binary on device.

A future follow-up may be able to remove this compiler-order sensitivity by expressing the hot loop more directly, for example via the already-observed QCOM inline-asm path, but that is not part of this PR.

Correctness coverage

The new path was rechecked for:

noncausal + GQA
causal + GQA
decode

Representative correctness results:

noncausal + GQA:
- dq=128 dv=128 nq=256 nkv=512 n_head=8 n_head_kv=2
- mae=6.7302e-05
- max_abs=8.9369e-04
- cos=0.999993535
causal + GQA:
- dq=128 dv=128 nq=256 nkv=256 n_head=8 n_head_kv=2
- mae=7.8839e-05
- max_abs=9.92954e-04
- cos=0.999999313
decode:
- dq=128 dv=128 nq=1 nkv=512 n_head=8 n_head_kv=2
- mae=7.4717e-05
- max_abs=4.90941e-04
- cos=0.999992445

Device

Qualcomm Adreno 830
OpenCL 3.0 QUALCOMM build 0800.56.1

All throughput numbers below are full-route gpu_ms / effective TOPS, not qk+pv core-only accounting.

Performance tests

1. Large noncausal Z-Image-like shape

Shape:

H=30, L=4224, D=128
noncausal
--mask 0 --causal 0

Results:

xmem: 188.606 ms / 1.453 TOPS
fa: process killed by system
nofuse: OOM

This is one of the main motivations for the PR: the xmem path runs a shape that the current alternatives do not handle robustly.

2. Large noncausal throughput comparison

Shape:

H=30, L=2048, D=128
noncausal
--mask 0 --causal 0

Results:

xmem: 49.852 ms / 1.292 TOPS
fa: 2015.568 ms / 0.03196 TOPS
nofuse: 148.097 ms / 0.4350 TOPS

Relative speedup:

xmem vs fa: about 40.4x
xmem vs nofuse: about 3.0x

3. Large causal prefill throughput comparison

Shape:

H=30, L=2048, D=128
causal prefill
--mask 0 --causal 1

Results:

xmem: 46.069 ms / 1.398 TOPS
fa: 2150.036 ms / 0.02996 TOPS
nofuse: 2155.344 ms / 0.02989 TOPS

Relative speedup:

xmem vs fa: about 46.7x
xmem vs nofuse: about 46.8x

This is the strongest high-throughput prefill case in the current set.

4. Long-context, larger-head-dim noncausal shape

Shape:

H=1, L=16384, D=512
noncausal
--mask 0 --causal 0

Results:

xmem: 272.191 ms / 2.020 TOPS
nofuse: 1084.935 ms / 0.5067 TOPS
fa: crashes with map::at

Relative speedup:

xmem vs nofuse: about 4.0x

This is another important memory/coverage point: the xmem path handles a longer-context / larger-head-dim case where the current fa path does not run.

Memory behavior / routing value

Compared with the current routes, the xmem path has two practical advantages in the large-workload regime:

it is much faster in throughput-oriented prefill / noncausal attention
it handles larger shapes more robustly

In the current tests, that means:

shapes where fa is killed
shapes where nofuse OOMs
shapes where fa does not have a valid kernel entry and crashes

Decode results

This PR is not presenting xmem as a decode optimization.

Two representative decode tests:

short decode context:
- nq=1 nkv=512 dq=dv=128 n_head=8 n_head_kv=2
- xmem: 6.562 ms
- fa: 0.491 ms
- nofuse: 0.282 ms
long decode context:
- nq=1 nkv=16384 dq=dv=128 n_head=8 n_head_kv=2
- xmem: 12.350 ms
- fa: 8.276 ms
- nofuse: 3.535 ms

So for current q=1 decode:

xmem is slower than both existing routes
this PR should be read as a large-workload / throughput-path addition, not as a universal attention replacement

Notes

the current xmem path is best suited to large-q workloads
decode-specific routing can be refined in follow-up work
the compile-order workaround is deliberate and currently necessary for deterministic fast binaries on the tested Adreno driver

ggml-gh-bot · 2026-04-19T11:55:39Z

Hi @happyyzy, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

happyyzy · 2026-04-19T12:00:09Z

Thanks. The diff is large in line count, but the functional scope is actually quite narrow:

it only touches the OpenCL backend
it only adds one Adreno-specific attention path
it does not change ggml tensor semantics
it does not modify non-OpenCL backends
it does not change the existing FA / nofuse math paths

Concretely, the PR is just:

1 runtime integration file: ggml-opencl.cpp
3 kernel source files: adreno_xmem_attn.cl, adreno_xmem_exact_qk.cl, adreno_xmem_exact_pv.cl
1 test/bench reproducer: test-opencl-adreno-attn.cpp
2 small CMake file updates

I kept these together because this feature is hard to review meaningfully in partial pieces:

runtime wiring without the kernels is not testable
kernels without the route-selection / compile-order handling do not reproduce the fast binary on device
performance claims without the dedicated reproducer are not verifiable

So although the PR is large, it is still one backend-scoped feature with one test harness.

If maintainers would still prefer a split, I can do that, for example as:

kernel files + minimal loader plumbing
runtime routing / scratch / schedule integration
test and benchmark reproducer

I wanted to first submit the complete, reproducible version so the correctness/performance story is reviewable end-to-end.

lhez · 2026-04-21T05:56:31Z

@happyyzy Thank you for the PR. Could you also provide the models and commands that you use? Could you also provide the citation for xmem attention?

On current Adreno drivers, compiling the xmem QK/PV kernels later in the overall OpenCL kernel load sequence can produce a materially slower device binary even with identical source and build options.

Does this mean that if you compile xmem kernel sources at the end of kernel loading process, the resulting kernels will have worse performance?

A future follow-up may be able to remove this compiler-order sensitivity by expressing the hot loop more directly, for example via the already-observed QCOM inline-asm path, but that is not part of this PR.

Just curious about this - do you mean inline asm for Adreno? Are you aware of any public inline asm usage for Adreno (e.g., academic paper or open source projects)?

happyyzy · 2026-05-02T14:01:03Z

Thanks for taking a look.

For the model/commands: the PR reproducer does not require model weights. It generates random Q/K/V tensors and runs the target ggml_flash_attn_ext shapes directly, so the attention path can be tested without distributing a model.

Build:

cmake -S . -B build-adreno-xmem \
  -DGGML_OPENCL=ON \
  -DGGML_OPENCL_EMBED_KERNELS=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build-adreno-xmem --target ggml-opencl test-opencl-adreno-attn -j

Example large non-causal case:

export GGML_OPENCL_ADRENO_XMEM_ATTN=1

./test-opencl-adreno-attn \
  --route xmem \
  --dq 128 --dv 128 \
  --nq 2048 --nkv 2048 \
  --n-head 30 --n-head-kv 30 \
  --n-batch 1 \
  --mask 0 --causal 0 \
  --warmup 1 --iters 5

Comparison routes:

./test-opencl-adreno-attn --route fa ...
./test-opencl-adreno-attn --route nofuse ...

The motivating workload was Z-Image-style diffusion attention. We tested this path successfully there, but that setup depends on a larger model/runtime/export pipeline, so it is not a good PR-level reproducer. The standalone benchmark is intended to isolate the backend behavior and make the shape-level result reproducible.

For citation: I do not know of public documentation for this Adreno xmem path. The name xmem in this PR is our descriptive name for a hidden Adreno on-chip memory path that we identified by reverse engineering emitted OpenCL kernels and device binaries.

The relevant OpenCL source-level pieces are the Qualcomm subgroup load extensions, especially:

cl_qcom_subgroup_uniform_load
cl_qcom_subgroup_constant_load
qcom_sub_group_constant_load*

The important part is that the compiled ISA is not just ordinary global/local/image memory traffic. The disassembly shows distinct instructions and data movement patterns corresponding to this hidden on-chip memory route. Conceptually, it is closer to a vendor-specific on-chip staging memory path, somewhat analogous in role to Blackwell TMEM, although the programming interface and hardware details are Qualcomm-specific and not publicly specified.

About compile order: yes. On the Adreno 830 driver I tested, compiling the xmem QK/PV kernels later in the overall OpenCL kernel loading sequence can produce a materially slower device binary, even with the same source and build options. That is why the PR compiles the split QK/PV translation units early.

That said, this is not required for the path to be useful. Even with the normal compile order, the large-shape xmem attention path is still much faster than the current FA/nofuse routes, often by tens of times on the large prefill/non-causal cases. The early split compilation is only to preserve the best observed binary and avoid leaving a large amount of Adreno performance on the table.

About inline asm: yes, I meant Adreno OpenCL C inline asm. I have confirmed locally that a minimal asm probe can be accepted by the compiler. For example, a minimal FP16 MAD-style probe was accepted in OpenCL C form like this:

half a = (half) 1.0h;
half b = (half) 2.0h;
half c = (half) 3.0h;
half out;

__asm__ volatile(
    "mad.f16 %0, %1, %2, %3;"
    : "=h"(out)
    : "h"(a), "h"(b), "h"(c)
);

However, embedding inline asm into the real QK/PV hot loop caused backend compile failure. The PR therefore does not depend on inline asm. I am not aware of a public paper or open-source project that documents this inline-asm route well enough to cite.

happyyzy · 2026-05-03T10:57:52Z

One more note about the compile-order workaround.

If this part looks too unusual for an initial upstream PR, I can remove it and update the PR to a cleaner version that compiles the xmem attention kernels in the normal OpenCL kernel loading order.

That version would remove:

adreno_xmem_exact_qk.cl
adreno_xmem_exact_pv.cl
the early QK/PV split-program build path
the compile-order-specific comment

The PR would then be simpler: one xmem attention kernel source, one runtime integration path, and one test/benchmark reproducer.

The tradeoff is performance. On the Adreno 830 driver I tested, the normal compile order still gives a large speedup over the current FA/nofuse routes on large prefill/non-causal shapes, but it leaves some performance on the table compared with the early split-compile version.

So I am fine with either direction:

keep the current version if maintainers are comfortable with the driver workaround and want the best measured Adreno performance
or I can replace it with the cleaner normal-compile-order version first, and leave the split/early-compile optimization for a later follow-up after the base xmem path is reviewed

happyyzy · 2026-05-06T11:45:50Z

I opened a smaller and more focused Adreno xmem GEMM PR here: #22755

It follows the existing OpenCL matmul path more closely and is opt-in; on Adreno 830 it improves Qwen2.5 1.5B/3B F16 prefill by about 1.74x/1.62x while leaving decode unchanged.

This may be easier to review and merge first as the minimal xmem backend entry point.

happyyzy · 2026-05-06T13:22:21Z

Closing this for now to comply with the one-open-PR limit for new contributors. I will focus on the smaller xmem GEMM PR first: #22755

The attention path can be revisited later as a smaller follow-up after the base xmem GEMM path is reviewed.

ggml-opencl: add Adreno xmem attention path

1ebb843

happyyzy requested review from a team and ggerganov as code owners April 19, 2026 11:52

github-actions Bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Apr 19, 2026

happyyzy closed this May 6, 2026

happyyzy mentioned this pull request May 6, 2026

ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill #22755

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-opencl: add Adreno xmem attention path#22117

ggml-opencl: add Adreno xmem attention path#22117
happyyzy wants to merge 1 commit into
ggml-org:masterfrom
happyyzy:happyyzy/adreno-xmem-attn

happyyzy commented Apr 19, 2026

Uh oh!

ggml-gh-bot Bot commented Apr 19, 2026

Uh oh!

happyyzy commented Apr 19, 2026

Uh oh!

lhez commented Apr 21, 2026

Uh oh!

happyyzy commented May 2, 2026

Uh oh!

happyyzy commented May 3, 2026

Uh oh!

happyyzy commented May 6, 2026

Uh oh!

happyyzy commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

happyyzy commented Apr 19, 2026

Summary

Implementation

Why the compile order change exists

Correctness coverage

Device

Performance tests

1. Large noncausal Z-Image-like shape

2. Large noncausal throughput comparison

3. Large causal prefill throughput comparison

4. Long-context, larger-head-dim noncausal shape

Memory behavior / routing value

Decode results

Notes

Uh oh!

ggml-gh-bot Bot commented Apr 19, 2026

Uh oh!

happyyzy commented Apr 19, 2026

Uh oh!

lhez commented Apr 21, 2026

Uh oh!

happyyzy commented May 2, 2026

Uh oh!

happyyzy commented May 3, 2026

Uh oh!

happyyzy commented May 6, 2026

Uh oh!

happyyzy commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants