prov/shm: two performance fixes for the inject path#1
Closed
yinliaws wants to merge 2 commits intozachdworkin:lockfrom
Closed
prov/shm: two performance fixes for the inject path#1yinliaws wants to merge 2 commits intozachdworkin:lockfrom
yinliaws wants to merge 2 commits intozachdworkin:lockfrom
Conversation
smr_format_inject() calls smr_freestack_get_index() on every inject
send to populate cmd->data.inject_buf_index. That helper computes
index = (pointer - base - entry_base_offset) / fs->object_size
where fs->object_size is a runtime struct field, so the compiler emits
a divq instruction (~20-30 cycles on x86, ~50+ cycles on Arm).
However for the inject pool specifically, object_size is
sizeof(struct smr_inject_buf) == SMR_INJECT_SIZE == 4096, which is a
compile-time constant and a power of two. Inline the computation here
and replace the division with a 1-cycle shift.
perf annotate showed smr_freestack_get_index's divq accounting for
~45% of smr_do_inject samples at 4KB BW on AMD Zen 3. The shift
replacement preserves correctness; no other behavior changes.
Measured impact (fi_rdm_tagged_bw, 5-round median vs libfabric main):
Platform Size Before After
Graviton 2K -27% +15% (+42 pp)
Graviton 4K -21% +7% (+28 pp)
P5EN 2K +7% +20% (+13 pp)
P5EN 4K +3% +10% (+7 pp)
Arm benefits the most because the ARMv8 integer divider is slower than
x86's.
The peer-owned inject pool design requires every inject send to acquire
peer_smr->fs_lock to pop a buffer from the peer's freestack. Under
high send rates (fi_rdm_tagged_bw with window=2000), this lock becomes
the dominant cost for small-to-medium messages (512B-4KB), producing
regressions of -17% to -49% versus libfabric main.
Amortize the lock by caching SMR_INJECT_CACHE_BATCH (32) buffers
per-peer in the sending endpoint. The fast path (cache hit) is a
single array load plus counter decrement with no atomic operations.
The slow path (cache empty) acquires the peer lock once and batch-pops
up to BATCH buffers in a single critical section.
Data structures:
- struct smr_inject_cache: fixed-size array of buffer pointers + count
- struct smr_ep::inject_cache[SMR_MAX_PEERS]: one cache per peer
(placed at end of smr_ep to preserve cacheline layout of hot fields)
On endpoint close, cached buffers are pushed back to their peers'
freestacks under the appropriate lock so nothing leaks.
Memory cost: SMR_INJECT_CACHE_BATCH(32) * SMR_MAX_PEERS(256) * 8
= 64KB per endpoint, plus 256*4 bytes for the count fields.
Measured impact (fi_rdm_tagged_bw, 5-round median vs libfabric main,
PR ofiwg#12109 baseline shown for comparison):
Platform Size PR ofiwg#12109 This patch
Graviton 512B -28% +23%
Graviton 1K -32% +20%
Graviton 2K -27% +15%
P4D 256B -23% +60%
P4D 1K -40% +9%
P5EN 256B -34% +52%
P5EN 2K -39% +8%
Combined with the previous shift commit, regressions at 256B-2K are
fully recovered across Graviton, AMD, and Intel platforms tested.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on top of: ofiwg#12109
Summary
PR ofiwg#12109 moved the SHM inject buffer freestack from the sender's region into
the peer's region, protected by a new
peer_smr->fs_lock. This was neededfor correctness but caused measurable throughput regressions of -17% to
-49% versus libfabric main at 512 B–4 KB messages on
fi_rdm_tagged_bw.This series adds two focused fixes that recover the regression:
prov/shm: replace inject index divq with a shift- replaces a 64-bitdivision in
smr_format_inject()with a one-cycle shift. Independent ofthis series; applies to main-branch libfabric too. Largest impact on Arm.
prov/shm: add per-peer batched inject buffer cache- amortizespeer_smr->fs_lockover batches of 32 buffers. Fast path is lock-free(cache array + counter), slow path takes the lock once and pops 32 buffers.