SYCL: reduce allocation overhead during flash attention by sanmai · Pull Request #22732 · ggml-org/llama.cpp

sanmai · 2026-05-05T22:52:38Z

Overview

I found that flash attention allocated quite a few K/V buffers with little reuse, which remained in the legacy pool until teardown. And it sounds like a better strategy is to allocate the FA buffers outside the common pool and grow them on demand. So that at most, FA uses the largest buffers it needs.

Arguably, there are more optimal strategies: we still leave the buffers occupied. That said, there aren't evictions in the legacy pool, so at very least we are in a better spot.

To reduce the number of allocations, buffers grow in chunks of 16 MiB
The requests grow like this: 0.5, 1.0, 1.5, 2.0, 2.5... but that could be model-dependent
FA completes before ggml_sycl_fattn_alloc::alloc is called again so no queue sync should be needed but I added one anyway just to be extra safe

Additional information

Memory benchmarks with B60 and Qwen3.6-35B-A3B UD-Q4_K_M, q4_0:

Baseline (without buffers)

Smaller prompt:

~ggml_sycl_pool_leg: 9 buffers, cached = 181.67 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/50.40/8.40/50.40

Larger prompt:

~ggml_sycl_pool_leg: 23 buffers, cached = 1170.24 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/50.40/85.05/8.40/50.40/53.02/56.17/59.32/62.47/65.62/67.20/69.30/70.88/72.97/74.55/76.65/78.75/80.85/82.95

With buffers

Smaller prompt:

ggml_sycl_fattn_kv_buffer[0]: 16.00 MiB
ggml_sycl_fattn_kv_buffer[0]: 16.00 MiB
~ggml_sycl_pool_leg: 7 buffers, cached = 80.87 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/8.40

68.80 MiB savings compared to the baseline.

Larger prompt:

ggml_sycl_fattn_kv_buffer[0]: 96.00 MiB
ggml_sycl_fattn_kv_buffer[0]: 96.00 MiB
~ggml_sycl_pool_leg: 7 buffers, cached = 80.87 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/8.40

that's 272.87 total for the pool plus the two FA buffers, or 897.37 MiB savings compare to the baseline. In a memory constrained environment it could be a deal breaker.

It spills into the common memory breakdown just as one expects.

 | memory breakdown [MiB]                        | total   free     self   model   context   compute    unaccounted |
-|   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 =  854 + (20003 = 18727 +     782 +     493) +        2398 |
+|   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 = 1129 + (20643 = 18727 +    1422 +     493) +        1483 |
 |   - Host                                      |                  2786 =  2522 +       0 +     264                |

Discussion

From what I can tell, other backends, such as CUDA, do not have this problem. Specifically, CUDA uses a VMM pool. I briefly considered adding ggml_sycl_pool_vmm as suggested by one of the TODOs in the code, but quickly stumbled into allocation granularity issues.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES; I used them in assistive capacity for profiling, prototyping, code review.

sanmai · 2026-05-05T22:55:04Z

+
+    sycl::half * alloc(size_t n_elems) {
+        ptr = buf.ensure_half(n_elems);
+        return ptr;


the calling code does not use the return value but pool_alloc::alloc returns so keep doing that too to reduce the surprise if someone changes the old code to use the return value later

sanmai · 2026-05-05T22:59:37Z

-    ggml_sycl_pool_alloc<sycl::half>   K_f16(pool);
-    ggml_sycl_pool_alloc<sycl::half>   V_f16(pool);
+    ggml_sycl_fattn_alloc        K_f16(fbuf.K);
+    ggml_sycl_fattn_alloc        V_f16(fbuf.V);


I considered adding a no-op template to make it look more uniform:

ggml_sycl_fattn_alloc<sycl::half> K_f16(fbuf.K); ggml_sycl_fattn_alloc<sycl::half> V_f16(fbuf.V);

arthw

It's good job!
I test it and the memory usage is reduced. It's great to users.

comments:

DEBUG_SYCL_POOL
Please update the description in SYCL.md
The feature is about Flash-attention, please move the code in fattn-common.hpp or fattn-xxx.cpp/hpp. flash-attention is big feature, suggest moving all code in fattn-xxx files.

common.hpp is for more common code, and ggml-sycl.cpp is for base API of SYCL backend.

Thank you!

SYCL: reduce allocation overhead during flash attention

ad72f6a

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 5, 2026

sanmai commented May 5, 2026

View reviewed changes

tidy up whitespace

7844365

sanmai marked this pull request as ready for review May 5, 2026 23:05

sanmai requested a review from a team as a code owner May 5, 2026 23:05

arthw reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL: reduce allocation overhead during flash attention#22732

SYCL: reduce allocation overhead during flash attention#22732
sanmai wants to merge 2 commits intoggml-org:masterfrom
sanmai:fa-overhead-sycl

sanmai commented May 5, 2026 •

edited

Loading

Uh oh!

sanmai May 5, 2026 •

edited

Loading

Uh oh!

sanmai May 5, 2026 •

edited

Loading

Uh oh!

arthw left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sanmai commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Baseline (without buffers)

With buffers

Discussion

Requirements

Uh oh!

sanmai May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanmai May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanmai commented May 5, 2026 •

edited

Loading

sanmai May 5, 2026 •

edited

Loading

sanmai May 5, 2026 •

edited

Loading