Skip to content

ggml-opencl: add Adreno xmem attention path#22117

Closed
happyyzy wants to merge 1 commit into
ggml-org:masterfrom
happyyzy:happyyzy/adreno-xmem-attn
Closed

ggml-opencl: add Adreno xmem attention path#22117
happyyzy wants to merge 1 commit into
ggml-org:masterfrom
happyyzy:happyyzy/adreno-xmem-attn

Conversation

@happyyzy
Copy link
Copy Markdown

Summary

This PR adds an Adreno-specific xmem attention path to ggml-opencl behind GGML_OPENCL_ADRENO_XMEM_ATTN.

The new path is intended for high-throughput attention workloads on Adreno, especially:

  • large prefill
  • large noncausal attention
  • shapes where the current fa or nofuse routes are too slow, run out of memory, or fail to run

This PR does not claim a decode speedup. In current q=1 decode tests, the existing fa / nofuse routes are still faster.

Implementation

  • adds Adreno xmem attention kernels for:
    • Q/K/V staging
    • K/V packing
    • QK GEMM
    • softmax
    • PV GEMM
  • wires the path into ggml-opencl runtime behind an explicit env gate
  • adds test-opencl-adreno-attn to reproduce correctness and route-to-route performance
  • compiles the dedicated xmem QK/PV translation units before the rest of the OpenCL kernel set

Why the compile order change exists

On current Adreno drivers, compiling the xmem QK/PV kernels later in the overall OpenCL kernel load sequence can produce a materially slower device binary even with identical source and build options.

This PR keeps a deterministic early compile order for the xmem QK/PV units because that is currently the most stable way to preserve the fast binary on device.

A future follow-up may be able to remove this compiler-order sensitivity by expressing the hot loop more directly, for example via the already-observed QCOM inline-asm path, but that is not part of this PR.

Correctness coverage

The new path was rechecked for:

  • noncausal + GQA
  • causal + GQA
  • decode

Representative correctness results:

  • noncausal + GQA:
    • dq=128 dv=128 nq=256 nkv=512 n_head=8 n_head_kv=2
    • mae=6.7302e-05
    • max_abs=8.9369e-04
    • cos=0.999993535
  • causal + GQA:
    • dq=128 dv=128 nq=256 nkv=256 n_head=8 n_head_kv=2
    • mae=7.8839e-05
    • max_abs=9.92954e-04
    • cos=0.999999313
  • decode:
    • dq=128 dv=128 nq=1 nkv=512 n_head=8 n_head_kv=2
    • mae=7.4717e-05
    • max_abs=4.90941e-04
    • cos=0.999992445

Device

  • Qualcomm Adreno 830
  • OpenCL 3.0 QUALCOMM build 0800.56.1

All throughput numbers below are full-route gpu_ms / effective TOPS, not qk+pv core-only accounting.

Performance tests

1. Large noncausal Z-Image-like shape

Shape:

  • H=30, L=4224, D=128
  • noncausal
  • --mask 0 --causal 0

Results:

  • xmem: 188.606 ms / 1.453 TOPS
  • fa: process killed by system
  • nofuse: OOM

This is one of the main motivations for the PR: the xmem path runs a shape that the current alternatives do not handle robustly.

2. Large noncausal throughput comparison

Shape:

  • H=30, L=2048, D=128
  • noncausal
  • --mask 0 --causal 0

Results:

  • xmem: 49.852 ms / 1.292 TOPS
  • fa: 2015.568 ms / 0.03196 TOPS
  • nofuse: 148.097 ms / 0.4350 TOPS

Relative speedup:

  • xmem vs fa: about 40.4x
  • xmem vs nofuse: about 3.0x

3. Large causal prefill throughput comparison

Shape:

  • H=30, L=2048, D=128
  • causal prefill
  • --mask 0 --causal 1

Results:

  • xmem: 46.069 ms / 1.398 TOPS
  • fa: 2150.036 ms / 0.02996 TOPS
  • nofuse: 2155.344 ms / 0.02989 TOPS

Relative speedup:

  • xmem vs fa: about 46.7x
  • xmem vs nofuse: about 46.8x

This is the strongest high-throughput prefill case in the current set.

4. Long-context, larger-head-dim noncausal shape

Shape:

  • H=1, L=16384, D=512
  • noncausal
  • --mask 0 --causal 0

Results:

  • xmem: 272.191 ms / 2.020 TOPS
  • nofuse: 1084.935 ms / 0.5067 TOPS
  • fa: crashes with map::at

Relative speedup:

  • xmem vs nofuse: about 4.0x

This is another important memory/coverage point: the xmem path handles a longer-context / larger-head-dim case where the current fa path does not run.

Memory behavior / routing value

Compared with the current routes, the xmem path has two practical advantages in the large-workload regime:

  • it is much faster in throughput-oriented prefill / noncausal attention
  • it handles larger shapes more robustly

In the current tests, that means:

  • shapes where fa is killed
  • shapes where nofuse OOMs
  • shapes where fa does not have a valid kernel entry and crashes

Decode results

This PR is not presenting xmem as a decode optimization.

Two representative decode tests:

  • short decode context:
    • nq=1 nkv=512 dq=dv=128 n_head=8 n_head_kv=2
    • xmem: 6.562 ms
    • fa: 0.491 ms
    • nofuse: 0.282 ms
  • long decode context:
    • nq=1 nkv=16384 dq=dv=128 n_head=8 n_head_kv=2
    • xmem: 12.350 ms
    • fa: 8.276 ms
    • nofuse: 3.535 ms

So for current q=1 decode:

  • xmem is slower than both existing routes
  • this PR should be read as a large-workload / throughput-path addition, not as a universal attention replacement

Notes

  • the current xmem path is best suited to large-q workloads
  • decode-specific routing can be refined in follow-up work
  • the compile-order workaround is deliberate and currently necessary for deterministic fast binaries on the tested Adreno driver

@happyyzy happyyzy requested review from a team and ggerganov as code owners April 19, 2026 11:52
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 19, 2026

Hi @happyyzy, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@happyyzy
Copy link
Copy Markdown
Author

Thanks. The diff is large in line count, but the functional scope is actually quite narrow:

  • it only touches the OpenCL backend
  • it only adds one Adreno-specific attention path
  • it does not change ggml tensor semantics
  • it does not modify non-OpenCL backends
  • it does not change the existing FA / nofuse math paths

Concretely, the PR is just:

  • 1 runtime integration file: ggml-opencl.cpp
  • 3 kernel source files: adreno_xmem_attn.cl, adreno_xmem_exact_qk.cl, adreno_xmem_exact_pv.cl
  • 1 test/bench reproducer: test-opencl-adreno-attn.cpp
  • 2 small CMake file updates

I kept these together because this feature is hard to review meaningfully in partial pieces:

  • runtime wiring without the kernels is not testable
  • kernels without the route-selection / compile-order handling do not reproduce the fast binary on device
  • performance claims without the dedicated reproducer are not verifiable

So although the PR is large, it is still one backend-scoped feature with one test harness.

If maintainers would still prefer a split, I can do that, for example as:

  1. kernel files + minimal loader plumbing
  2. runtime routing / scratch / schedule integration
  3. test and benchmark reproducer

I wanted to first submit the complete, reproducible version so the correctness/performance story is reviewable end-to-end.

@github-actions github-actions Bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Apr 19, 2026
@lhez
Copy link
Copy Markdown
Contributor

lhez commented Apr 21, 2026

@happyyzy Thank you for the PR. Could you also provide the models and commands that you use? Could you also provide the citation for xmem attention?

On current Adreno drivers, compiling the xmem QK/PV kernels later in the overall OpenCL kernel load sequence can produce a materially slower device binary even with identical source and build options.

Does this mean that if you compile xmem kernel sources at the end of kernel loading process, the resulting kernels will have worse performance?

A future follow-up may be able to remove this compiler-order sensitivity by expressing the hot loop more directly, for example via the already-observed QCOM inline-asm path, but that is not part of this PR.

Just curious about this - do you mean inline asm for Adreno? Are you aware of any public inline asm usage for Adreno (e.g., academic paper or open source projects)?

@happyyzy
Copy link
Copy Markdown
Author

happyyzy commented May 2, 2026

Thanks for taking a look.

For the model/commands: the PR reproducer does not require model weights. It generates random Q/K/V tensors and runs the target ggml_flash_attn_ext shapes directly, so the attention path can be tested without distributing a model.

Build:

cmake -S . -B build-adreno-xmem \
  -DGGML_OPENCL=ON \
  -DGGML_OPENCL_EMBED_KERNELS=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build-adreno-xmem --target ggml-opencl test-opencl-adreno-attn -j

Example large non-causal case:

export GGML_OPENCL_ADRENO_XMEM_ATTN=1

./test-opencl-adreno-attn \
  --route xmem \
  --dq 128 --dv 128 \
  --nq 2048 --nkv 2048 \
  --n-head 30 --n-head-kv 30 \
  --n-batch 1 \
  --mask 0 --causal 0 \
  --warmup 1 --iters 5

Comparison routes:

./test-opencl-adreno-attn --route fa ...
./test-opencl-adreno-attn --route nofuse ...

The motivating workload was Z-Image-style diffusion attention. We tested this path successfully there, but that setup depends on a larger model/runtime/export pipeline, so it is not a good PR-level reproducer. The standalone benchmark is intended to isolate the backend behavior and make the shape-level result reproducible.

For citation: I do not know of public documentation for this Adreno xmem path. The name xmem in this PR is our descriptive name for a hidden Adreno on-chip memory path that we identified by reverse engineering emitted OpenCL kernels and device binaries.

The relevant OpenCL source-level pieces are the Qualcomm subgroup load extensions, especially:

  • cl_qcom_subgroup_uniform_load
  • cl_qcom_subgroup_constant_load
  • qcom_sub_group_constant_load*

The important part is that the compiled ISA is not just ordinary global/local/image memory traffic. The disassembly shows distinct instructions and data movement patterns corresponding to this hidden on-chip memory route. Conceptually, it is closer to a vendor-specific on-chip staging memory path, somewhat analogous in role to Blackwell TMEM, although the programming interface and hardware details are Qualcomm-specific and not publicly specified.

About compile order: yes. On the Adreno 830 driver I tested, compiling the xmem QK/PV kernels later in the overall OpenCL kernel loading sequence can produce a materially slower device binary, even with the same source and build options. That is why the PR compiles the split QK/PV translation units early.

That said, this is not required for the path to be useful. Even with the normal compile order, the large-shape xmem attention path is still much faster than the current FA/nofuse routes, often by tens of times on the large prefill/non-causal cases. The early split compilation is only to preserve the best observed binary and avoid leaving a large amount of Adreno performance on the table.

About inline asm: yes, I meant Adreno OpenCL C inline asm. I have confirmed locally that a minimal asm probe can be accepted by the compiler. For example, a minimal FP16 MAD-style probe was accepted in OpenCL C form like this:

half a = (half) 1.0h;
half b = (half) 2.0h;
half c = (half) 3.0h;
half out;

__asm__ volatile(
    "mad.f16 %0, %1, %2, %3;"
    : "=h"(out)
    : "h"(a), "h"(b), "h"(c)
);

However, embedding inline asm into the real QK/PV hot loop caused backend compile failure. The PR therefore does not depend on inline asm. I am not aware of a public paper or open-source project that documents this inline-asm route well enough to cite.

@happyyzy
Copy link
Copy Markdown
Author

happyyzy commented May 3, 2026

One more note about the compile-order workaround.

If this part looks too unusual for an initial upstream PR, I can remove it and update the PR to a cleaner version that compiles the xmem attention kernels in the normal OpenCL kernel loading order.

That version would remove:

  • adreno_xmem_exact_qk.cl
  • adreno_xmem_exact_pv.cl
  • the early QK/PV split-program build path
  • the compile-order-specific comment

The PR would then be simpler: one xmem attention kernel source, one runtime integration path, and one test/benchmark reproducer.

The tradeoff is performance. On the Adreno 830 driver I tested, the normal compile order still gives a large speedup over the current FA/nofuse routes on large prefill/non-causal shapes, but it leaves some performance on the table compared with the early split-compile version.

So I am fine with either direction:

  • keep the current version if maintainers are comfortable with the driver workaround and want the best measured Adreno performance
  • or I can replace it with the cleaner normal-compile-order version first, and leave the split/early-compile optimization for a later follow-up after the base xmem path is reviewed

@happyyzy
Copy link
Copy Markdown
Author

happyyzy commented May 6, 2026

I opened a smaller and more focused Adreno xmem GEMM PR here: #22755

It follows the existing OpenCL matmul path more closely and is opt-in; on Adreno 830 it improves Qwen2.5 1.5B/3B F16 prefill by about 1.74x/1.62x while leaving decode unchanged.

This may be easier to review and merge first as the minimal xmem backend entry point.

@happyyzy
Copy link
Copy Markdown
Author

happyyzy commented May 6, 2026

Closing this for now to comply with the one-open-PR limit for new contributors. I will focus on the smaller xmem GEMM PR first: #22755

The attention path can be revisited later as a smaller follow-up after the base xmem GEMM path is reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants