prov/shm: two performance fixes for the inject path by yinliaws · Pull Request #1 · zachdworkin/libfabric

yinliaws · 2026-05-05T04:25:10Z

Stacked on top of: ofiwg#12109

Summary

PR ofiwg#12109 moved the SHM inject buffer freestack from the sender's region into
the peer's region, protected by a new peer_smr->fs_lock. This was needed
for correctness but caused measurable throughput regressions of -17% to
-49% versus libfabric main at 512 B–4 KB messages on fi_rdm_tagged_bw.

This series adds two focused fixes that recover the regression:

prov/shm: replace inject index divq with a shift - replaces a 64-bit
division in smr_format_inject() with a one-cycle shift. Independent of
this series; applies to main-branch libfabric too. Largest impact on Arm.
prov/shm: add per-peer batched inject buffer cache - amortizes
peer_smr->fs_lock over batches of 32 buffers. Fast path is lock-free
(cache array + counter), slow path takes the lock once and pops 32 buffers.

smr_format_inject() calls smr_freestack_get_index() on every inject send to populate cmd->data.inject_buf_index. That helper computes index = (pointer - base - entry_base_offset) / fs->object_size where fs->object_size is a runtime struct field, so the compiler emits a divq instruction (~20-30 cycles on x86, ~50+ cycles on Arm). However for the inject pool specifically, object_size is sizeof(struct smr_inject_buf) == SMR_INJECT_SIZE == 4096, which is a compile-time constant and a power of two. Inline the computation here and replace the division with a 1-cycle shift. perf annotate showed smr_freestack_get_index's divq accounting for ~45% of smr_do_inject samples at 4KB BW on AMD Zen 3. The shift replacement preserves correctness; no other behavior changes. Measured impact (fi_rdm_tagged_bw, 5-round median vs libfabric main): Platform Size Before After Graviton 2K -27% +15% (+42 pp) Graviton 4K -21% +7% (+28 pp) P5EN 2K +7% +20% (+13 pp) P5EN 4K +3% +10% (+7 pp) Arm benefits the most because the ARMv8 integer divider is slower than x86's.

The peer-owned inject pool design requires every inject send to acquire peer_smr->fs_lock to pop a buffer from the peer's freestack. Under high send rates (fi_rdm_tagged_bw with window=2000), this lock becomes the dominant cost for small-to-medium messages (512B-4KB), producing regressions of -17% to -49% versus libfabric main. Amortize the lock by caching SMR_INJECT_CACHE_BATCH (32) buffers per-peer in the sending endpoint. The fast path (cache hit) is a single array load plus counter decrement with no atomic operations. The slow path (cache empty) acquires the peer lock once and batch-pops up to BATCH buffers in a single critical section. Data structures: - struct smr_inject_cache: fixed-size array of buffer pointers + count - struct smr_ep::inject_cache[SMR_MAX_PEERS]: one cache per peer (placed at end of smr_ep to preserve cacheline layout of hot fields) On endpoint close, cached buffers are pushed back to their peers' freestacks under the appropriate lock so nothing leaks. Memory cost: SMR_INJECT_CACHE_BATCH(32) * SMR_MAX_PEERS(256) * 8 = 64KB per endpoint, plus 256*4 bytes for the count fields. Measured impact (fi_rdm_tagged_bw, 5-round median vs libfabric main, PR ofiwg#12109 baseline shown for comparison): Platform Size PR ofiwg#12109 This patch Graviton 512B -28% +23% Graviton 1K -32% +20% Graviton 2K -27% +15% P4D 256B -23% +60% P4D 1K -40% +9% P5EN 256B -34% +52% P5EN 2K -39% +8% Combined with the previous shift commit, regressions at 256B-2K are fully recovered across Graviton, AMD, and Intel platforms tested.

yinliaws added 2 commits May 4, 2026 21:16

zachdworkin force-pushed the lock branch from 915f9f0 to 30340a0 Compare May 5, 2026 15:24

yinliaws closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/shm: two performance fixes for the inject path#1

prov/shm: two performance fixes for the inject path#1
yinliaws wants to merge 2 commits intozachdworkin:lockfrom
yinliaws:shm-inject-perf-fixes

yinliaws commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yinliaws commented May 5, 2026

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant