SYCL: reduce allocation overhead during flash attention#22732
SYCL: reduce allocation overhead during flash attention#22732sanmai wants to merge 2 commits intoggml-org:masterfrom
Conversation
|
|
||
| sycl::half * alloc(size_t n_elems) { | ||
| ptr = buf.ensure_half(n_elems); | ||
| return ptr; |
There was a problem hiding this comment.
the calling code does not use the return value but pool_alloc::alloc returns so keep doing that too to reduce the surprise if someone changes the old code to use the return value later
| ggml_sycl_pool_alloc<sycl::half> K_f16(pool); | ||
| ggml_sycl_pool_alloc<sycl::half> V_f16(pool); | ||
| ggml_sycl_fattn_alloc K_f16(fbuf.K); | ||
| ggml_sycl_fattn_alloc V_f16(fbuf.V); |
There was a problem hiding this comment.
I considered adding a no-op template to make it look more uniform:
ggml_sycl_fattn_alloc<sycl::half> K_f16(fbuf.K);
ggml_sycl_fattn_alloc<sycl::half> V_f16(fbuf.V);
arthw
left a comment
There was a problem hiding this comment.
It's good job!
I test it and the memory usage is reduced. It's great to users.
comments:
-
DEBUG_SYCL_POOL
Please update the description in SYCL.md -
The feature is about Flash-attention, please move the code in fattn-common.hpp or fattn-xxx.cpp/hpp. flash-attention is big feature, suggest moving all code in fattn-xxx files.
common.hpp is for more common code, and ggml-sycl.cpp is for base API of SYCL backend.
Thank you!
Fixes #22585
Overview
I found that flash attention allocated quite a few K/V buffers with little reuse, which remained in the legacy pool until teardown. And it sounds like a better strategy is to allocate the FA buffers outside the common pool and grow them on demand. So that at most, FA uses the largest buffers it needs.
Arguably, there are more optimal strategies: we still leave the buffers occupied. That said, there aren't evictions in the legacy pool, so at very least we are in a better spot.
ggml_sycl_fattn_alloc::allocis called again so no queue sync should be needed but I added one anyway just to be extra safeAdditional information
Memory benchmarks with B60 and Qwen3.6-35B-A3B UD-Q4_K_M, q4_0:
Baseline (without buffers)
Smaller prompt:
Larger prompt:
With buffers
Smaller prompt:
68.80 MiB savings compared to the baseline.
Larger prompt:
that's 272.87 total for the pool plus the two FA buffers, or 897.37 MiB savings compare to the baseline. In a memory constrained environment it could be a deal breaker.
It spills into the common memory breakdown just as one expects.
Discussion
From what I can tell, other backends, such as CUDA, do not have this problem. Specifically, CUDA uses a VMM pool. I briefly considered adding
ggml_sycl_pool_vmmas suggested by one of the TODOs in the code, but quickly stumbled into allocation granularity issues.Requirements