Skip to content

Fix OOM validation on vulkan#9359

Open
kristoff3r wants to merge 1 commit intogfx-rs:trunkfrom
kristoff3r:ks/vulkan-oom
Open

Fix OOM validation on vulkan#9359
kristoff3r wants to merge 1 commit intogfx-rs:trunkfrom
kristoff3r:ks/vulkan-oom

Conversation

@kristoff3r
Copy link
Copy Markdown
Contributor

Connections
Fixes #8479

Description
I investigated the issue above and found a few issues with the Vulkan OOM validation:

  • It would eagerly fail if any matching heap was out of memory, but on some systems there are multiple usable heaps, so it should really check if all matching heaps are out.
  • The checking logic was duplicated between check_if_oom and error_if_would_oom_on_resource_allocation, and both of them have the issue above. On top of that error_if_would_oom_on_resource_allocation would ignore heaps with vk::MemoryPropertyFlags::LAZILY_ALLOCATED | vk::MemoryPropertyFlags::PROTECTED while check_if_oom would not which seems wrong? I'm not sure.

To fix I moved the shared logic into a helper function and fixed the issue there. I also simplified the code a bit.

Testing
The cts tests pass now, at least on my machine.

Checklist

  • Run cargo fmt.
  • Run taplo format.
  • Run cargo clippy --tests. If applicable, add:
    • --target wasm32-unknown-unknown
  • Run cargo xtask test to run tests.
  • If this contains user-facing changes, add a CHANGELOG.md entry.

@kristoff3r kristoff3r changed the title Fixes OOM validation on vulkan Fix OOM validation on vulkan Apr 2, 2026
ErichDonGubler
ErichDonGubler previously approved these changes Apr 3, 2026
Copy link
Copy Markdown
Member

@ErichDonGubler ErichDonGubler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed everything here, and the code itself looks good. So long as the design is what we want, then I think we're good to merge. I wanted to confirm with folks who knew more (CC @teoxoy and @jimblandy): Is this design right? It makes sense to me that, if there are multiple heaps, and we multiplex between them to avoid OOMs anyway, then we should ensure that no heaps could accept an allocation before declaring an OOM.

@andyleiserson
Copy link
Copy Markdown
Contributor

I'd also like @teoxoy to take a look. If checking for space on any heap is an empirical improvement, then maybe it's better, but it seems like what we really need here is something that is smarter about figuring out which heap is actually needed, since they may not be interchangeable. Or a different approach would be to try and guess which heap is the "primary" heap, and which allocations are "normal" allocations, and only apply the OOM-prevention to those rather than to everything.

#9206 I think could be duped to #8479, and https://bugzilla.mozilla.org/show_bug.cgi?id=2028252 is another issue that may be relevant.

@kristoff3r
Copy link
Copy Markdown
Contributor Author

I'd also like @teoxoy to take a look. If checking for space on any heap is an empirical improvement, then maybe it's better, but it seems like what we really need here is something that is smarter about figuring out which heap is actually needed, since they may not be interchangeable. Or a different approach would be to try and guess which heap is the "primary" heap, and which allocations are "normal" allocations, and only apply the OOM-prevention to those rather than to everything.

#9206 I think could be duped to #8479, and https://bugzilla.mozilla.org/show_bug.cgi?id=2028252 is another issue that may be relevant.

That makes sense. I think the current behavior is wrong, but it also feels brittle to hard code which heaps gpu-allocator can use, and I don't know what happens in more complicated setups like with multiple GPUs. It feels like a check like this should be conservative, isn't it better to hit the OOM behavior in some cases than for wgpu to incorrectly deny the allocation?

If you want a different strategy I can try to rework my PR.

@ErichDonGubler
Copy link
Copy Markdown
Member

One of the deleted comments indicates that the reason a definitive OOM diagnosis wasn't possible was because gpu-alloc didn't afford a way to check for OOM itself. Perhaps adding upstream support for this is the better solution?

@teoxoy
Copy link
Copy Markdown
Member

teoxoy commented Apr 7, 2026

The problem with this approach is that an OOM situation might not be reported if multiple memory types with compatible flags exist because gpu-allocator always picks the first matching memory type (see its find_memorytype_index function).

Also, I think the check_if_oom implementation should be checking that all heaps still have space which is consistent with the implementation in the D3D12 backend.

@teoxoy
Copy link
Copy Markdown
Member

teoxoy commented Apr 7, 2026

I also don't see how this PR would fix webgpu:api,validation,queue,buffer_mapped:writeBuffer:* (#8479). The buffer created by the CTS test is only 8 bytes. Locally the test passes for me even without this PR.

@ErichDonGubler what is the error you encounter locally for that test? Could you leave a comment in #8479 with it?

@teoxoy teoxoy self-assigned this Apr 7, 2026
@ErichDonGubler ErichDonGubler dismissed their stale review April 7, 2026 22:21

Teo has feedback.

@ErichDonGubler
Copy link
Copy Markdown
Member

I also don't see how this PR would fix webgpu:api,validation,queue,buffer_mapped:writeBuffer:* (#8479). The buffer created by the CTS test is only 8 bytes. Locally the test passes for me even without this PR.

@ErichDonGubler what is the error you encounter locally for that test? Could you leave a comment in #8479 with it?

I don't encounter an error on my M1 MacBook Pro, before and after the patch. Should I try on another platform?

@andyleiserson
Copy link
Copy Markdown
Contributor

Responding to two items simultaneously:

Also, I think the check_if_oom implementation should be checking that all heaps still have space which is consistent with the implementation in the D3D12 backend.

I don't encounter an error on my M1 MacBook Pro, before and after the patch. Should I try on another platform?

I can understand if the intention is to detect the system has already reached an OOM condition by checking if the current condition of any heap exceeds an OOM threshold. But I don't see why every heap must be able to accommodate the new allocation?

For #9206 (which is not specifically about the writeBuffer test, but I saw numerous failing tests, so it may also be failing) the environment is Linux with Radeon Pro W7600, which has 8 GB VRAM. There is one larger heap comprising most of the VRAM, and one 256 MB heap that is host-visible. The failure was due to exhaustion of the host-visible heap. However, my recollection of the details doesn't make sense, so I'd need to go back and debug more carefully. (My recollection is the attempted allocation was 64 MB, which I connected with the gpu allocator memblock size, but why would that be checked by wgpu-hal's OOM logic?)

Another idea I just had is that possibly we want to set MemoryHints::MemoryUsage for the CTS.

@teoxoy
Copy link
Copy Markdown
Member

teoxoy commented Apr 13, 2026

I also don't see how this PR would fix webgpu:api,validation,queue,buffer_mapped:writeBuffer:* (#8479). The buffer created by the CTS test is only 8 bytes. Locally the test passes for me even without this PR.
@ErichDonGubler what is the error you encounter locally for that test? Could you leave a comment in #8479 with it?

I don't encounter an error on my M1 MacBook Pro, before and after the patch. Should I try on another platform?

The test was marked as failing in #8454 but it's not marked as failing in Firefox CI. Can you try to repro it locally on Linux/do you have a link with the failure in wgpu's CI (I can't find a previous workflow run in #8454)?

@teoxoy
Copy link
Copy Markdown
Member

teoxoy commented Apr 13, 2026

I can understand if the intention is to detect the system has already reached an OOM condition by checking if the current condition of any heap exceeds an OOM threshold. But I don't see why every heap must be able to accommodate the new allocation?

check_if_oom is only called by lose_if_oom which will cause device loss if we are out of memory. I'm not opposed to making the OOM check more granular for resource creation specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vulkan fails validation in webgpu:api,validation,queue,buffer_mapped:writeBuffer:*

4 participants