Draft: ggml-opencl: Early proof-of-concept implementation of plans via command buffers by jansol · Pull Request #22764 · ggml-org/llama.cpp

jansol · 2026-05-06T14:20:04Z

Overview

This is still very incomplete and far from ready for proper review, but in offline discussions there has been some interest towards this, so I'm opening a draft PR for easier discussion & collaboration.

This is a not-at-all baked implementation of the ggml plan API with OpenCL command buffers that I've been experimenting with, built on top of the shared execution plan code from #16548. It's only tested gemma-3-1b-it-f16.gguf and only on the PoCL-CUDA and ROCm drivers.

I've also included some changes towards async support in ggml-opencl, as I saw a lot of time being spent on synchronous I/O when running on PoCL-Remote.

In my PoCL-Remote experiments I see some nice performance improvements from both the command buffers and the partial async-ification alike. PoCL-Remote is of course an extreme corner case of slow host<->device communication, but other setups should also see either slightly improved performance or no meaningful change.

Additional information

There are a bunch of changes to make various kernels build at all on ROCm and PoCL-CUDA. I would expect that most of those become unnecessary when #21310 and its follow-up PRs land.

The biggest open question is how to handle temporary (sub) buffers that are created in several kernels. This will need to be solved for both proper async support and command buffers. I've drafted in an attempt at deferred freeing, but with command buffers it should also possible to tie temporary buffers directly to the lifetime of the command buffer itself.

@wishstudio I took the liberty of cherry-picking only the relevant bits from your PR into a new commit and marking you as the author of that. Let me know if you'd prefer some other way of handling that.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

ggml-gh-bot · 2026-05-06T14:24:39Z

Hi @jansol, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

jansol and others added 5 commits May 6, 2026 16:47

hacky patches to make it work on pocl-cuda

138d41e

Implement and use cuda graph plans.

9abbfe0

opencl: use command buffers when available

0342381

opencl: async-ify tensor I/O a bit

6049b97

hacks to make the ROCm compiler happy(ier)

5b85b9b

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: ggml-opencl: Early proof-of-concept implementation of plans via command buffers#22764

Draft: ggml-opencl: Early proof-of-concept implementation of plans via command buffers#22764
jansol wants to merge 5 commits intoggml-org:masterfrom
jansol:experimental

jansol commented May 6, 2026

Uh oh!

ggml-gh-bot Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jansol commented May 6, 2026

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants