Fix OOM validation on vulkan by kristoff3r · Pull Request #9359 · gfx-rs/wgpu

kristoff3r · 2026-04-02T11:46:22Z

Connections
Fixes #8479

Description
I investigated the issue above and found a few issues with the Vulkan OOM validation:

It would eagerly fail if any matching heap was out of memory, but on some systems there are multiple usable heaps, so it should really check if all matching heaps are out.
The checking logic was duplicated between check_if_oom and error_if_would_oom_on_resource_allocation, and both of them have the issue above. On top of that error_if_would_oom_on_resource_allocation would ignore heaps with vk::MemoryPropertyFlags::LAZILY_ALLOCATED | vk::MemoryPropertyFlags::PROTECTED while check_if_oom would not which seems wrong? I'm not sure.

To fix I moved the shared logic into a helper function and fixed the issue there. I also simplified the code a bit.

Testing
The cts tests pass now, at least on my machine.

Checklist

Run cargo fmt.
Run taplo format.
Run cargo clippy --tests. If applicable, add:
- --target wasm32-unknown-unknown
Run cargo xtask test to run tests.
If this contains user-facing changes, add a CHANGELOG.md entry.

ErichDonGubler

I've reviewed everything here, and the code itself looks good. So long as the design is what we want, then I think we're good to merge. I wanted to confirm with folks who knew more (CC @teoxoy and @jimblandy): Is this design right? It makes sense to me that, if there are multiple heaps, and we multiplex between them to avoid OOMs anyway, then we should ensure that no heaps could accept an allocation before declaring an OOM.

andyleiserson · 2026-04-03T20:19:26Z

I'd also like @teoxoy to take a look. If checking for space on any heap is an empirical improvement, then maybe it's better, but it seems like what we really need here is something that is smarter about figuring out which heap is actually needed, since they may not be interchangeable. Or a different approach would be to try and guess which heap is the "primary" heap, and which allocations are "normal" allocations, and only apply the OOM-prevention to those rather than to everything.

#9206 I think could be duped to #8479, and https://bugzilla.mozilla.org/show_bug.cgi?id=2028252 is another issue that may be relevant.

kristoff3r · 2026-04-06T12:48:29Z

I'd also like @teoxoy to take a look. If checking for space on any heap is an empirical improvement, then maybe it's better, but it seems like what we really need here is something that is smarter about figuring out which heap is actually needed, since they may not be interchangeable. Or a different approach would be to try and guess which heap is the "primary" heap, and which allocations are "normal" allocations, and only apply the OOM-prevention to those rather than to everything.

#9206 I think could be duped to #8479, and https://bugzilla.mozilla.org/show_bug.cgi?id=2028252 is another issue that may be relevant.

That makes sense. I think the current behavior is wrong, but it also feels brittle to hard code which heaps gpu-allocator can use, and I don't know what happens in more complicated setups like with multiple GPUs. It feels like a check like this should be conservative, isn't it better to hit the OOM behavior in some cases than for wgpu to incorrectly deny the allocation?

If you want a different strategy I can try to rework my PR.

ErichDonGubler · 2026-04-06T16:03:01Z

One of the deleted comments indicates that the reason a definitive OOM diagnosis wasn't possible was because gpu-alloc didn't afford a way to check for OOM itself. Perhaps adding upstream support for this is the better solution?

teoxoy · 2026-04-07T12:19:42Z

The problem with this approach is that an OOM situation might not be reported if multiple memory types with compatible flags exist because gpu-allocator always picks the first matching memory type (see its find_memorytype_index function).

Also, I think the check_if_oom implementation should be checking that all heaps still have space which is consistent with the implementation in the D3D12 backend.

teoxoy · 2026-04-07T12:35:55Z

I also don't see how this PR would fix webgpu:api,validation,queue,buffer_mapped:writeBuffer:* (#8479). The buffer created by the CTS test is only 8 bytes. Locally the test passes for me even without this PR.

@ErichDonGubler what is the error you encounter locally for that test? Could you leave a comment in #8479 with it?

Teo has feedback.

ErichDonGubler · 2026-04-11T17:37:17Z

I also don't see how this PR would fix webgpu:api,validation,queue,buffer_mapped:writeBuffer:* (#8479). The buffer created by the CTS test is only 8 bytes. Locally the test passes for me even without this PR.

@ErichDonGubler what is the error you encounter locally for that test? Could you leave a comment in #8479 with it?

I don't encounter an error on my M1 MacBook Pro, before and after the patch. Should I try on another platform?

andyleiserson · 2026-04-11T18:52:43Z

Responding to two items simultaneously:

Also, I think the check_if_oom implementation should be checking that all heaps still have space which is consistent with the implementation in the D3D12 backend.

I don't encounter an error on my M1 MacBook Pro, before and after the patch. Should I try on another platform?

I can understand if the intention is to detect the system has already reached an OOM condition by checking if the current condition of any heap exceeds an OOM threshold. But I don't see why every heap must be able to accommodate the new allocation?

For #9206 (which is not specifically about the writeBuffer test, but I saw numerous failing tests, so it may also be failing) the environment is Linux with Radeon Pro W7600, which has 8 GB VRAM. There is one larger heap comprising most of the VRAM, and one 256 MB heap that is host-visible. The failure was due to exhaustion of the host-visible heap. However, my recollection of the details doesn't make sense, so I'd need to go back and debug more carefully. (My recollection is the attempted allocation was 64 MB, which I connected with the gpu allocator memblock size, but why would that be checked by wgpu-hal's OOM logic?)

Another idea I just had is that possibly we want to set MemoryHints::MemoryUsage for the CTS.

teoxoy · 2026-04-13T14:52:49Z

I also don't see how this PR would fix webgpu:api,validation,queue,buffer_mapped:writeBuffer:* (#8479). The buffer created by the CTS test is only 8 bytes. Locally the test passes for me even without this PR.
@ErichDonGubler what is the error you encounter locally for that test? Could you leave a comment in #8479 with it?

I don't encounter an error on my M1 MacBook Pro, before and after the patch. Should I try on another platform?

The test was marked as failing in #8454 but it's not marked as failing in Firefox CI. Can you try to repro it locally on Linux/do you have a link with the failure in wgpu's CI (I can't find a previous workflow run in #8454)?

teoxoy · 2026-04-13T14:59:38Z

I can understand if the intention is to detect the system has already reached an OOM condition by checking if the current condition of any heap exceeds an OOM threshold. But I don't see why every heap must be able to accommodate the new allocation?

check_if_oom is only called by lose_if_oom which will cause device loss if we are out of memory. I'm not opposed to making the OOM check more granular for resource creation specifically.

Fix OOM validation on vulkan

5511d70

kristoff3r changed the title ~~Fixes OOM validation on vulkan~~ Fix OOM validation on vulkan Apr 2, 2026

ErichDonGubler previously approved these changes Apr 3, 2026

View reviewed changes

teoxoy self-assigned this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OOM validation on vulkan#9359

Fix OOM validation on vulkan#9359
kristoff3r wants to merge 1 commit intogfx-rs:trunkfrom
kristoff3r:ks/vulkan-oom

kristoff3r commented Apr 2, 2026

Uh oh!

ErichDonGubler left a comment

Uh oh!

andyleiserson commented Apr 3, 2026

Uh oh!

kristoff3r commented Apr 6, 2026

Uh oh!

ErichDonGubler commented Apr 6, 2026

Uh oh!

teoxoy commented Apr 7, 2026 •

edited

Loading

Uh oh!

teoxoy commented Apr 7, 2026

Uh oh!

ErichDonGubler commented Apr 11, 2026

Uh oh!

andyleiserson commented Apr 11, 2026

Uh oh!

teoxoy commented Apr 13, 2026

Uh oh!

teoxoy commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kristoff3r commented Apr 2, 2026

Uh oh!

ErichDonGubler left a comment

Choose a reason for hiding this comment

Uh oh!

andyleiserson commented Apr 3, 2026

Uh oh!

kristoff3r commented Apr 6, 2026

Uh oh!

ErichDonGubler commented Apr 6, 2026

Uh oh!

teoxoy commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teoxoy commented Apr 7, 2026

Uh oh!

ErichDonGubler commented Apr 11, 2026

Uh oh!

andyleiserson commented Apr 11, 2026

Uh oh!

teoxoy commented Apr 13, 2026

Uh oh!

teoxoy commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

teoxoy commented Apr 7, 2026 •

edited

Loading