ggml-opencl: add Adreno xmem attention path#22117
Conversation
|
Hi @happyyzy, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Thanks. The diff is large in line count, but the functional scope is actually quite narrow:
Concretely, the PR is just:
I kept these together because this feature is hard to review meaningfully in partial pieces:
So although the PR is large, it is still one backend-scoped feature with one test harness. If maintainers would still prefer a split, I can do that, for example as:
I wanted to first submit the complete, reproducible version so the correctness/performance story is reviewable end-to-end. |
|
@happyyzy Thank you for the PR. Could you also provide the models and commands that you use? Could you also provide the citation for xmem attention?
Does this mean that if you compile xmem kernel sources at the end of kernel loading process, the resulting kernels will have worse performance?
Just curious about this - do you mean inline asm for Adreno? Are you aware of any public inline asm usage for Adreno (e.g., academic paper or open source projects)? |
|
Thanks for taking a look. For the model/commands: the PR reproducer does not require model weights. It generates random Q/K/V tensors and runs the target Build: cmake -S . -B build-adreno-xmem \
-DGGML_OPENCL=ON \
-DGGML_OPENCL_EMBED_KERNELS=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build-adreno-xmem --target ggml-opencl test-opencl-adreno-attn -jExample large non-causal case: export GGML_OPENCL_ADRENO_XMEM_ATTN=1
./test-opencl-adreno-attn \
--route xmem \
--dq 128 --dv 128 \
--nq 2048 --nkv 2048 \
--n-head 30 --n-head-kv 30 \
--n-batch 1 \
--mask 0 --causal 0 \
--warmup 1 --iters 5Comparison routes: ./test-opencl-adreno-attn --route fa ...
./test-opencl-adreno-attn --route nofuse ...The motivating workload was Z-Image-style diffusion attention. We tested this path successfully there, but that setup depends on a larger model/runtime/export pipeline, so it is not a good PR-level reproducer. The standalone benchmark is intended to isolate the backend behavior and make the shape-level result reproducible. For citation: I do not know of public documentation for this Adreno xmem path. The name xmem in this PR is our descriptive name for a hidden Adreno on-chip memory path that we identified by reverse engineering emitted OpenCL kernels and device binaries. The relevant OpenCL source-level pieces are the Qualcomm subgroup load extensions, especially:
The important part is that the compiled ISA is not just ordinary global/local/image memory traffic. The disassembly shows distinct instructions and data movement patterns corresponding to this hidden on-chip memory route. Conceptually, it is closer to a vendor-specific on-chip staging memory path, somewhat analogous in role to Blackwell TMEM, although the programming interface and hardware details are Qualcomm-specific and not publicly specified. About compile order: yes. On the Adreno 830 driver I tested, compiling the xmem QK/PV kernels later in the overall OpenCL kernel loading sequence can produce a materially slower device binary, even with the same source and build options. That is why the PR compiles the split QK/PV translation units early. That said, this is not required for the path to be useful. Even with the normal compile order, the large-shape xmem attention path is still much faster than the current FA/nofuse routes, often by tens of times on the large prefill/non-causal cases. The early split compilation is only to preserve the best observed binary and avoid leaving a large amount of Adreno performance on the table. About inline asm: yes, I meant Adreno OpenCL C inline asm. I have confirmed locally that a minimal asm probe can be accepted by the compiler. For example, a minimal FP16 MAD-style probe was accepted in OpenCL C form like this: half a = (half) 1.0h;
half b = (half) 2.0h;
half c = (half) 3.0h;
half out;
__asm__ volatile(
"mad.f16 %0, %1, %2, %3;"
: "=h"(out)
: "h"(a), "h"(b), "h"(c)
);However, embedding inline asm into the real QK/PV hot loop caused backend compile failure. The PR therefore does not depend on inline asm. I am not aware of a public paper or open-source project that documents this inline-asm route well enough to cite. |
|
One more note about the compile-order workaround. If this part looks too unusual for an initial upstream PR, I can remove it and update the PR to a cleaner version that compiles the xmem attention kernels in the normal OpenCL kernel loading order. That version would remove:
The PR would then be simpler: one xmem attention kernel source, one runtime integration path, and one test/benchmark reproducer. The tradeoff is performance. On the Adreno 830 driver I tested, the normal compile order still gives a large speedup over the current FA/nofuse routes on large prefill/non-causal shapes, but it leaves some performance on the table compared with the early split-compile version. So I am fine with either direction:
|
|
I opened a smaller and more focused Adreno xmem GEMM PR here: #22755 It follows the existing OpenCL matmul path more closely and is opt-in; on Adreno 830 it improves Qwen2.5 1.5B/3B F16 prefill by about 1.74x/1.62x while leaving decode unchanged. This may be easier to review and merge first as the minimal xmem backend entry point. |
|
Closing this for now to comply with the one-open-PR limit for new contributors. I will focus on the smaller xmem GEMM PR first: #22755 The attention path can be revisited later as a smaller follow-up after the base xmem GEMM path is reviewed. |
Summary
This PR adds an Adreno-specific xmem attention path to
ggml-openclbehindGGML_OPENCL_ADRENO_XMEM_ATTN.The new path is intended for high-throughput attention workloads on Adreno, especially:
faornofuseroutes are too slow, run out of memory, or fail to runThis PR does not claim a decode speedup. In current
q=1decode tests, the existingfa/nofuseroutes are still faster.Implementation
ggml-openclruntime behind an explicit env gatetest-opencl-adreno-attnto reproduce correctness and route-to-route performanceWhy the compile order change exists
On current Adreno drivers, compiling the xmem QK/PV kernels later in the overall OpenCL kernel load sequence can produce a materially slower device binary even with identical source and build options.
This PR keeps a deterministic early compile order for the xmem QK/PV units because that is currently the most stable way to preserve the fast binary on device.
A future follow-up may be able to remove this compiler-order sensitivity by expressing the hot loop more directly, for example via the already-observed QCOM inline-asm path, but that is not part of this PR.
Correctness coverage
The new path was rechecked for:
Representative correctness results:
dq=128 dv=128 nq=256 nkv=512 n_head=8 n_head_kv=2mae=6.7302e-05max_abs=8.9369e-04cos=0.999993535dq=128 dv=128 nq=256 nkv=256 n_head=8 n_head_kv=2mae=7.8839e-05max_abs=9.92954e-04cos=0.999999313dq=128 dv=128 nq=1 nkv=512 n_head=8 n_head_kv=2mae=7.4717e-05max_abs=4.90941e-04cos=0.999992445Device
0800.56.1All throughput numbers below are full-route
gpu_ms/ effective TOPS, notqk+pvcore-only accounting.Performance tests
1. Large noncausal Z-Image-like shape
Shape:
H=30, L=4224, D=128--mask 0 --causal 0Results:
188.606 ms / 1.453 TOPSThis is one of the main motivations for the PR: the xmem path runs a shape that the current alternatives do not handle robustly.
2. Large noncausal throughput comparison
Shape:
H=30, L=2048, D=128--mask 0 --causal 0Results:
49.852 ms / 1.292 TOPS2015.568 ms / 0.03196 TOPS148.097 ms / 0.4350 TOPSRelative speedup:
40.4x3.0x3. Large causal prefill throughput comparison
Shape:
H=30, L=2048, D=128--mask 0 --causal 1Results:
46.069 ms / 1.398 TOPS2150.036 ms / 0.02996 TOPS2155.344 ms / 0.02989 TOPSRelative speedup:
46.7x46.8xThis is the strongest high-throughput prefill case in the current set.
4. Long-context, larger-head-dim noncausal shape
Shape:
H=1, L=16384, D=512--mask 0 --causal 0Results:
272.191 ms / 2.020 TOPS1084.935 ms / 0.5067 TOPSmap::atRelative speedup:
4.0xThis is another important memory/coverage point: the xmem path handles a longer-context / larger-head-dim case where the current
fapath does not run.Memory behavior / routing value
Compared with the current routes, the xmem path has two practical advantages in the large-workload regime:
In the current tests, that means:
fais killednofuseOOMsfadoes not have a valid kernel entry and crashesDecode results
This PR is not presenting xmem as a decode optimization.
Two representative decode tests:
nq=1 nkv=512 dq=dv=128 n_head=8 n_head_kv=26.562 ms0.491 ms0.282 msnq=1 nkv=16384 dq=dv=128 n_head=8 n_head_kv=212.350 ms8.276 ms3.535 msSo for current
q=1decode:Notes
qworkloads