Skip to content

prov/shm: Revert to a lock-unlock inject pool#12109

Open
zachdworkin wants to merge 7 commits intoofiwg:mainfrom
zachdworkin:lock
Open

prov/shm: Revert to a lock-unlock inject pool#12109
zachdworkin wants to merge 7 commits intoofiwg:mainfrom
zachdworkin:lock

Conversation

@zachdworkin
Copy link
Copy Markdown
Contributor

@zachdworkin zachdworkin commented Apr 3, 2026

Remove the parallel command-inject resources and revert to using a lock-unlock inject buffer pool.

Update the inject protocol to use the old method.

There is a performance regression when using the "new shm" command-inject parallel data structure. This is due to the sender not being able to complete its transmission until the receiver returns the sender's command to the sender's return queue. In the old lock-unlock method the sender would allocate receiver side resources, copy its data into the receiver inject buffer and then complete. The old method allows MPIs and applications to assume that their inject message transmissions will complete quickly and since the new method does not complete as quickly it is likely the reason for this regression.

In addition we need to move the inject stack above the cmd stack in the shm region. This is because most of the time the new inject protocol will not need to use the cmd stack. If we swap the ordering in the region then we have to do less jumps through memory to get the resources we need.

We can also use the ofi freestack for the cmd stack because we only need to pop and push on the sender side so we don't need the extra functionality of the smr_freestack.

@zachdworkin
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

1 similar comment
@j-xiong
Copy link
Copy Markdown
Contributor

j-xiong commented Apr 5, 2026

bot:aws:retest

j-xiong
j-xiong previously approved these changes Apr 5, 2026
@zachdworkin
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

@zachdworkin
Copy link
Copy Markdown
Contributor Author

@sunkuamzn what is the AWS failure with this PR?

@shijin-aws
Copy link
Copy Markdown
Contributor

shijin-aws commented Apr 15, 2026

@zachdworkin There is single node unexpected message test timeout


server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.49.36 'timeout 1800 /bin/bash --login -c '"'"' FI_HMEM=system FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr12109-undebug/install/fabtests/bin/fi_unexpected_msg -e rdm -M 2048 -I 5 -f efa -v -S 512 -p efa -E=9234'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.49.36 'timeout 1800 /bin/bash --login -c '"'"' FI_HMEM=system FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr12109-undebug/install/fabtests/bin/fi_unexpected_msg -e rdm -M 2048 -I 5 -f efa -v -S 512 -p efa -E=9234 172.31.49.36'"'"''
client_stdout:

client returncode: 124
server_stdout:

server returncode: 124

Place lighter protocol fields in the same cache-line grab
as the atomic-queue cmd_entry grab. This way we if we are
using a cpu without prefetcing algorithms (to grab the
adjacent cacheline for us) we are optimizing the access
of the fields we need for the lightweight/fast protocols.
The heavier/slower protocols which use the second cache
line fields will be unaffected by this change on older
cpus since they already need to access both cache-lines.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Remove the parallel command-inject resources and revert
to using a lock-unlock inject buffer pool.

Update the inject protocol to use the old method.

There is a performance regression when using the "new shm"
command-inject parallel data structure. This is due to the
sender not being able to complete its transmission until
the receiver returns the sender's command to the sender's
return queue. In the old lock-unlock method the sender
would allocate receiver side resources, copy its data into
the receiver inject buffer and then complete. The old
method allows MPIs and applications to assume that their
inject message transmissions will complete quickly and
since the new method does not complete `as` quickly it is
likely the reason for this regression.

This will also revert to the "old-shm" method of buffering
all unexpected inject messages

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Replace hdr.status with hdr.smr_flags to indicate any
error. This error will use the flag SMR_OP_ERROR for the
sender to process its errors on return cmd.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
@zachdworkin zachdworkin force-pushed the lock branch 2 times, most recently from bf4cfb0 to edd82eb Compare May 7, 2026 19:25
zachdworkin and others added 4 commits May 7, 2026 13:00
Command stack is less likely to be used in the inject
protocol when resources are on the receiver side. If
the inject pool is above it then we have to jump less,
and do not have to jump over the command stack, when
accessing it.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
SAR should never be handling 0 byte copies anymore
since the inject protocol can handle delivery complete.
Instead we will assert to make sure we aren't
accidentally doing a 0-byte copy in SAR.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants