Suballocate DX12 buffer creation#3163
Conversation
|
Nice to see these changes @Elabajaba, if there are any features that wgpu would like to see in our allocator please just file an issue on the repo. |
|
Blocked on #3207 until Mozilla gets around to vendoring windows-rs. |
Codecov Report
@@ Coverage Diff @@
## master #3163 +/- ##
==========================================
+ Coverage 64.30% 64.36% +0.05%
==========================================
Files 83 85 +2
Lines 42270 42397 +127
==========================================
+ Hits 27181 27287 +106
- Misses 15089 15110 +21
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
|
Just stumbling upon this PR, I'll see to accelerating Traverse-Research/gpu-allocator#138 so that you're unblocked on that regard! |
|
As #3207 is basically perma-blocked until further notice, I think we should work around the situation by having a feature flag and falling back to the old behavior when it's disabled. This will let us continue to innovate, and also not force the issue with moz. |
…and which is the slow path
cwfitzgerald
left a comment
There was a problem hiding this comment.
Thank you so much for all this work! Looks great!
Checklist
[ ] Blocked on Migrate to Windows-rs from winapi #3207Worked around by feature gating it behind thewindows_rsfeaturecargo clippy.presserTraverse-Research/gpu-allocator#138 lands, and see if migrating to presser would be neededConnections
~~ Blocked on #3207 ~~ Worked around by feature gating it behind the
windows_rsfeaturecloses #2720
Description
DX12 is currently quite slow in wgpu. This uses gpu-allocator to batch together allocations into heaps and uses CreatePlacedResource instead of CreateCommittedResource to create buffers and textures, which leads to large performance gains (~30-50% in "normal" scenarios, with significantly larger gains in write_buffer heavy scenarios (~250x+ in an unrealistic scenario where it calls write_buffer 1000x in a loop, going from ~1fps to ~250fps)), and in my testing no performance decreases.
Testing
Tested the examples, ran cargo test, backported it to 0.14 and tested against bevy+bistro, and tested against a modified water example where it loops the render write_buffer 1000x times on the main thread, 500x each on 2 scoped threads, or 100x each on 10 scoped threads to make sure multithreading wouldn't panic.
It was quite a bit faster in all of these scenarios, except for bevy+bistro at 4k where it was heavily gpu limited and ran about the same.
Potential Future Improvements
ZwAllocateLocallyUniqueIdtaking almost 40% of the time in the 1000x looped write_buffer test