Skip to content

Track the dirty status of individual elements in AtomicSparseBufferVec.#24078

Open
pcwalton wants to merge 2 commits intobevyengine:mainfrom
pcwalton:per-element-sparse-buffers
Open

Track the dirty status of individual elements in AtomicSparseBufferVec.#24078
pcwalton wants to merge 2 commits intobevyengine:mainfrom
pcwalton:per-element-sparse-buffers

Conversation

@pcwalton
Copy link
Copy Markdown
Contributor

@pcwalton pcwalton commented May 2, 2026

Today, AtomicSparseBufferVec tracks the dirty status of individual pages of elements and performs a sparse upload when the number of modified pages is less than 15% of the total number of pages. (The default size of a page is 256 elements.) The reason why it doesn't track the dirty status of individual elements and instead only tracks pages is that it was assumed that frequently-changed elements would tend to cluster together, leading to low fragmentation. Unfortunately, though, this assumption has turned out to be false in practice. We extract meshes from the main world in parallel, so mesh instances end up scattered throughout in the MeshInputUniform buffer as the various extraction threads send the meshes they extract over a shared channel. Because of this, real-world workloads tend to dirty a disproportionately-large number of pages, even if they're only modifying a few mesh instances. The end result is that we rarely ever perform sparse updates unless no mesh instances have been updated at all, largely defeating the purpose of AtomicSparseBufferVec.

This patch fixes the issue by tracking the dirty status of individual elements, not just of individual pages. For efficiency, we now use a two-level atomically-updated bit vector to track the dirty status of elements. The lower level, dirty_bits, is a simple flat list of bits, one for each element and grouped into 64-bit words, 0 for "not modified" and 1 for "modified". The higher level, summary, contains one bit for each 64-bit word in the lower level, which is 0 if no elements in that word have been modified and 1 if at least one element in that word has been modified. (In other words, each bit in summary represents the logical or of every bit in the corresponding word in dirty_bits.) When searching for modified elements to upload sparsely, we use bit manipulation instructions on the summary words to skip up to 64 words in dirty_bits (i.e. 64² = 4096 elements) at a time.

Because the bit manipulation that this PR performs is tricky, it's factored out into separate functions that are individually tested via proptest randomized testing. This caught several bugs, some of which I believe to also be present in the existing code. Testing also verified that sparse buffer uploads are properly memory-bound as expected and that the dirty bit tracking has little overhead in practice.

The motivation for this PR was the discovery that bevy_city wasn't performing sparse uploads. Unfortunately, even with this patch, bevy_city still doesn't perform sparse uploads, because the number of moving cars (approximately 18%) exceeds 15% of the total mesh instances, and so sparse uploads aren't useful. I believe that bevy_city should be changed to increase the ratio of static buildings to cars in order to represent a more realistic workload. Once that's done, this patch should be helpful to help bevy_city scale, especially once transforms receive their own buffer.

`AtomicSparseBufferVec`.

Today, `AtomicSparseBufferVec` tracks the dirty status of individual
*pages* of elements and performs a sparse upload when the number of
modified pages is less than 15% of the total number of pages. (The
default size of a page is 256 elements.) The reason why it doesn't track
the dirty status of individual elements and instead only tracks pages is
that it was assumed that frequently-changed elements would tend to
cluster together, leading to low fragmentation. Unfortunately, though,
this assumption has turned out to be false in practice. We extract
meshes from the main world in parallel, so mesh instances end up
scattered throughout in the `MeshInputUniform` buffer as the various
extraction threads send the meshes they extract over a shared channel.
Because of this, real-world workloads tend to dirty a
disproportionately-large number of pages, even if they're only modifying
a few mesh instances. The end result is that we rarely ever perform
sparse updates unless no mesh instances have been updated at all,
largely defeating the purpose of `AtomicSparseBufferVec`.

This patch fixes the issue by tracking the dirty status of individual
elements, not just of individual pages. For efficiency, we now use a
two-level atomically-updated bit vector to track the dirty status of
elements. The lower level, `dirty_bits`, is a simple flat list of bits,
one for each element and grouped into 64-bit words, 0 for "not modified"
and 1 for "modified". The higher level, `summary`, contains one bit for
each 64-bit word in the lower level, which is 0 if no elements in that
word have been modified and 1 if at least one element in that word has
been modified. (In other words, each bit in `summary` represents the
logical *or* of every bit in the corresponding word in `dirty_bits`.)
When searching for modified elements to upload sparsely, we use bit
manipulation instructions on the summary words to skip up to 64 words in
`dirty_bits` (i.e. 64² = 4096 elements) at a time.

Because the bit manipulation that this PR performs is tricky, it's
factored out into separate functions that are individually tested via
`proptest` randomized testing. This caught several bugs, some of which I
believe to also be present in the existing code. Testing also verified
that sparse buffer uploads are properly memory-bound as expected and
that the dirty bit tracking has little overhead in practice.

The motivation for this PR was the discovery that `bevy_city` wasn't
performing sparse uploads. Unfortunately, even with this patch,
`bevy_city` still doesn't perform sparse uploads, because the number of
moving cars (approximately 18%) exceeds 15% of the total mesh instances,
and so sparse uploads aren't useful. I believe that `bevy_city` should
be changed to increase the ratio of static buildings to cars in order to
represent a more realistic workload. Once that's done, this patch should
be helpful to help `bevy_city` scale, especially once transforms receive
their own buffer.
@pcwalton pcwalton added the A-Rendering Drawing game state to the screen label May 2, 2026
@github-project-automation github-project-automation Bot moved this to Needs SME Triage in Rendering May 2, 2026
@pcwalton pcwalton added C-Performance A change motivated by improving speed, memory usage or compile times C-Bug An unexpected or incorrect behavior S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels May 2, 2026
@cart cart closed this May 5, 2026
@github-project-automation github-project-automation Bot moved this from Needs SME Triage to Done in Rendering May 5, 2026
@cart cart reopened this May 5, 2026
@github-project-automation github-project-automation Bot moved this from Done to Needs SME Triage in Rendering May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Rendering Drawing game state to the screen C-Bug An unexpected or incorrect behavior C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward

Projects

Status: Needs SME Triage

Development

Successfully merging this pull request may close these issues.

2 participants