Skip to content

prov/efa: Introduce new protocol interface and implement it for the TX path#12002

Open
sunkuamzn wants to merge 30 commits intoofiwg:mainfrom
sunkuamzn:proto-callback
Open

prov/efa: Introduce new protocol interface and implement it for the TX path#12002
sunkuamzn wants to merge 30 commits intoofiwg:mainfrom
sunkuamzn:proto-callback

Conversation

@sunkuamzn
Copy link
Copy Markdown
Contributor

@sunkuamzn sunkuamzn commented Mar 14, 2026

A series of commits that introduce the new protocol interface for the efa fabric. The TX path for all protocols is implemented. RX, RMA and atomic paths will be implemented in future commits.

@sunkuamzn sunkuamzn requested a review from a team March 14, 2026 01:31
Comment thread prov/efa/src/rdm/efa_rdm_cq.c Outdated
Comment thread prov/efa/src/rdm/efa_rdm_proto.h Outdated
Comment thread prov/efa/src/rdm/efa_rdm_proto.h Outdated
int (*send_pkes_posted)(struct efa_rdm_ep *ep, struct efa_rdm_ope *txe);

/* TX utitlities */
int req_pkt_type;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need 4 ints here? Can we put this in 1 int?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not unless we rearrange the pkt_type order - which breaks backwards compatibility. I could remove the tagged ones but I kept them there for completeness.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not have enums fro these? also should they not at least be unsigned?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No and probably. But I can't change them to unsigned in this PR.

Comment thread prov/efa/src/rdm/efa_rdm_proto.h Outdated
* posting a Long CTS RTM pke or to update the number of in flight reads
* and read bytes
*/
int (*send_pkes_posted)(struct efa_rdm_ep *ep, struct efa_rdm_ope *txe);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is confusing to me b/c "Send pkes posted" reads like you are going to re-send (verb) the already posted packet entries. I like something like after_pkes_posted. I don't think send needs to be in the name b/c it is not in the construct_pkes name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it to after_tx_pkes_posted

Also, construct_tx_pkes sounds better than construct_pkes

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posted_pkes_handle ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked handle_tx_pkes_posted. Let me know what you think

Comment thread prov/efa/src/rdm/efa_rdm_proto.h Outdated
Comment thread prov/efa/src/rdm/efa_rdm_msg.c Outdated
Comment thread prov/efa/src/rdm/efa_rdm_msg.c Outdated
}


EFA_DBG(FI_LOG_EP_DATA,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make these debug statements per protocol and have them more detailed about which protocol is used and how many pke's are being sent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logging

uint8_t gen;

/**@brief Callback function called in TX and RX paths */
void (*callback)(struct efa_rdm_pke *pkt_entry);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's please give it a more meaningful name

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not intentionally choosing a meaningless name.. If you have better ideas, please let me know.

Comment thread prov/efa/src/rdm/efa_rdm_proto.c Outdated
#include "efa.h"
#include "efa_rdm_proto.h"

int efa_rdm_proto_select_send_protocol(struct efa_rdm_proto **proto,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proto/protocol - can we be consistent? (I vote for the more verbose option)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

efa_rdm_proto is the prefix. We use prefixes like that through the code. Not sure if it the function names have to match that prefix everywhere

Comment thread prov/efa/src/rdm/efa_rdm_proto.h
Comment thread prov/efa/src/rdm/efa_rdm_proto.h

struct efa_rdm_proto efa_rdm_proto_eager = {
.construct_pkes = &efa_rdm_proto_eager_construct_pkes,
.req_pkt_type = EFA_RDM_EAGER_MSGRTM_PKT,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we create a request packet type struct?

Comment thread prov/efa/src/rdm/efa_rdm_proto_eager.c Outdated
Comment thread prov/efa/src/rdm/efa_rdm_proto_eager.h Outdated

extern struct efa_rdm_proto efa_rdm_proto_eager;

int efa_rdm_proto_eager_construct_pkes(struct efa_rdm_ep *ep,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these are callbacks, they should be static and local to the source file

Comment thread prov/efa/src/rdm/efa_rdm_proto.c
Comment thread prov/efa/src/rdm/efa_rdm_proto.c Outdated
Comment thread prov/efa/src/rdm/efa_rdm_proto.c Outdated
Comment thread prov/efa/src/rdm/efa_rdm_proto.c
return err;
use_p2p = err;

for (int i = 0; i < EFA_RDM_MAX_PROTO; ++i) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand this, let's discuss offline please

Comment thread prov/efa/src/rdm/efa_rdm_pke_cmd.c
@@ -268,7 +260,7 @@ void efa_rdm_pke_handle_sent(struct efa_rdm_pke *pkt_entry, int pkt_type, struct
break;
case EFA_RDM_EAGER_MSGRTM_PKT:
case EFA_RDM_EAGER_TAGRTM_PKT:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you not combine these cases now?

Comment thread prov/efa/src/rdm/efa_rdm_pke_rtm.h
Comment thread prov/efa/src/rdm/efa_rdm_proto_eager.c
Comment thread prov/efa/src/rdm/efa_rdm_proto_eager.c Outdated
Comment thread prov/efa/src/rdm/efa_rdm_pke_cmd.c

/* If a protocol is found, use it. Otherwise, fall back to the old code
* path */
if (proto) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move it to a function?

Comment thread prov/efa/test/efa_unit_test_ep.c
@sunkuamzn sunkuamzn changed the title prov/efa: Introduce new protocol interface and implement it for the TX path of the eager protocol prov/efa: Introduce new protocol interface and implement it for the TX path Mar 25, 2026
available_tx_pkts = ep->efa_max_outstanding_tx_ops -
ep->efa_outstanding_tx_ops - ep->efa_rnr_queued_pkt_cnt;

if (OFI_UNLIKELY(available_tx_pkts == 0)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of this check b/c it optimizes an error path, and adds more branching to the normal path. I think a better change would be to split the ope buffpool into a TX buffpool that has a hard limit and an RX buff pool which is unlimited for unexpected messages, and then we can rely on the efa_rdm_ep_alloc_txe() check to fail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the medium and runting protocols require more than 1 available tx pkt. We were previously doing the check twice anyway. After moving the check here, I'm skipping the check for protocols that need 1 tx pkt.

assert(0);
}

dlist_insert_tail(&txe->ep_entry, &ep->txe_list);
Copy link
Copy Markdown
Contributor

@a-szegel a-szegel Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to drop the lock here? If so, can you add it to your commit message?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is actually in #12010

I fixed it there

Add a callback function pointer to the efa_rdm_pke struct. When set, the
CQ send completion handler invokes the callback instead of dispatching
through the pkt_type-based switch statement in
efa_rdm_pke_handle_send_completion.

This allows each protocol to define its own send completion logic
directly on the packet entry, decoupling completion handling from the
centralized switch. The callback is checked conditionally so the old
code path continues to work for protocols that have not yet been
migrated.

The member is has a generic name "callback" because the same callback
will be used for two different purposes in the RX path in the future.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto.h defining the efa_rdm_proto struct which serves as
the interface for all EFA RDM protocols. The interface includes:

- can_use_protocol_for_send: determines if this protocol can handle a
  given send operation based on message size, peer capabilities, etc.
- construct_tx_pkes: builds the TX packet entries for the operation
- handle_tx_pkes_posted: post-send hook for protocol-specific
  bookkeeping
- Packet type fields for msg/tagged/DC variants

Add efa_rdm_proto.c with the protocol registry array and the
efa_rdm_proto_send_pkes_posted_no_op default implementation.

The RX path will be introduced in later commits.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_eager.c and efa_rdm_proto_eager.h defining the eager
protocol using the efa_rdm_proto interface. The eager protocol sends
messages that fit within a single MTU-sized packet.

This commit adds the protocol struct with packet type definitions and a
can_use_for_send function that checks if the message fits in a single
packet after accounting for header size. The construct_tx_pkes and send
completion callbacks are placeholders that will be implemented in
subsequent commits.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_select_send_protocol() which iterates through the
registered protocols in priority order and selects the first one whose
can_use_protocol_for_send() returns true. The function precomputes
header flags and tagged/DC packet type variants once before the loop to
avoid redundant computation across protocols.

The function also initializes TXE fields (iov, desc, mr) from the
fi_msg before protocol selection so that efa_rdm_ope_try_fill_desc can
attempt internal memory registration for non-eager protocols.

For now only the eager protocol is registered. Messages that do not fit
eager fall through and return proto=NULL, signaling the caller to use
the old code path.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Move the eager RTM packet construction logic from efa_rdm_pke_rtm.c
into efa_rdm_proto_eager.c, implementing the construct_tx_pkes callback.

The function initializes the TXE via efa_rdm_proto_txe_fill, allocates a
single TX packet entry, initializes the RTM header with the message
payload, and sets the per-packet send completion callback. Both regular
and delivery-complete (DC) eager packets are supported, with DC packets
using a separate callback and header struct that includes the send_id.

Remove the old efa_rdm_pke_init_eager_msgrtm/tagrtm functions and their
DC variants from efa_rdm_pke_rtm.c.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Move the eager RTM send completion handling from efa_rdm_pke_rtm.c into
efa_rdm_proto_eager.c. Two callbacks are implemented:

- Regular eager: verifies payload size matches total length, calls
  efa_rdm_ope_handle_send_completed to report the CQ completion and
  release the TXE, then releases the TX packet entry.
- DC eager: checks efa_rdm_txe_dc_ready_for_release which requires both
  efa_outstanding_tx_ops == 0 and receipt received. Only releases the
  TXE when both conditions are met.

Replace the old eager case in efa_rdm_pke_handle_send_completion with an
assertion to catch any packets that incorrectly bypass the new callback.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Wire the new protocol interface into efa_rdm_msg_generic_send. The
function now calls efa_rdm_proto_select_send_protocol first. If a
protocol is selected, it calls construct_tx_pkes to build the packets,
sends them via efa_rdm_pke_sendv, and calls handle_tx_pkes_posted.

If no protocol is selected (proto == NULL), the function falls back to
the existing code path using efa_rdm_txe_construct and
efa_rdm_msg_post_rtm. This allows incremental migration: as each
protocol is added to the registry, it will be picked up automatically
while unimplemented protocols continue to use the old path.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_txe_fill() which initializes a TXE for use by the new
protocol interface. The logic is derived from efa_rdm_txe_construct but
differs in that it does not set the mr, desc, or total_len fields.
These fields are already populated earlier in
efa_rdm_proto_select_send_protocol where they are needed for memory
registration via efa_rdm_ope_try_fill_desc.

This function is called by each protocol's construct_tx_pkes
implementation to set up the remaining TXE fields (op, peer, flags,
iov, cq_entry, etc.) before building the packet entries.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_medium.c and efa_rdm_proto_medium.h implementing the
medium message protocol. This protocol handles messages too large for a
single eager packet but within the medium threshold (64KB by default for
system memory).

The construct_tx_pkes function splits message data across multiple
packet entries, balancing data sizes for performance and respecting
memory alignment requirements. Each packet carries a segment offset and
the total message length in its header so the receiver can reassemble
the message.

For DC medium packets, the header layout differs from non-DC (send_id
and padding precede msg_length), so header fields are set using the
appropriate struct for each variant.

The old medium RTM init functions are removed from efa_rdm_pke_rtm.c.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add a human-readable name string to the efa_rdm_proto struct. The name
is set by each protocol (e.g. "EAGER", "MEDIUM", "LONGCTS") and is
used in log messages emitted during protocol selection to aid debugging.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
@sunkuamzn sunkuamzn force-pushed the proto-callback branch 2 times, most recently from c5449b8 to 109f4cb Compare April 11, 2026 02:13
Add efa_rdm_proto_longcts.c and efa_rdm_proto_longcts.h implementing
the long CTS (Credit-based Transfer with Send) protocol. This protocol
handles messages larger than the medium threshold when RDMA read is not
available or the sender's memory is not registered.

The sender transmits the first data chunk along with the RTM header,
then waits for a CTS (Clear To Send) packet from the receiver before
sending the remaining data in CTSDATA packets. The send completion
callback tracks bytes_acked and only reports completion when all data
has been acknowledged. For DC long CTS, the TXE is held until both all
send completions and the receipt packet have been received.

The old long CTS RTM init functions are removed from efa_rdm_pke_rtm.c.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add peer and use_p2p parameters to the can_use_protocol_for_send
callback. The long read and runt read protocols need to check peer
capabilities (RDMA read interop) and whether peer-to-peer memory access
is available when deciding if they can handle a send operation.

Update all existing protocol implementations (eager, medium, long CTS)
to accept the new parameters.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_longread.c and efa_rdm_proto_longread.h implementing
the long read message protocol. Instead of sending the message data, the
sender transmits a single RTM packet containing the memory keys and
addresses of its registered buffers, and the receiver performs RDMA
reads to fetch the data directly.

The can_use_for_send function requires p2p availability, registered
memory descriptors, the message meeting the minimum read size threshold,
and peer RDMA read interop support. The send completion callback simply
releases the TX packet entry since the data transfer is driven by the
receiver. The handle_tx_pkes_posted callback increments the domain's
read message in-flight counter.

The old long read RTM init functions are removed from efa_rdm_pke_rtm.c.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_runtread.c and efa_rdm_proto_runtread.h implementing
the runt read protocol. This protocol combines an initial runt data
transfer (sent with the RTM header, similar to medium) with RDMA
reads for the remaining data. It is selected when the message is large
enough for read-based transfer, p2p is available, and the peer supports
RDMA read.

The construct_tx_pkes function splits the runt portion across multiple
packets while including the RDMA read IOVs in the packet header
so the receiver can fetch the remaining data via RDMA read.

The old runt read RTM init functions and the
efa_rdm_txe_prepare_to_be_read helper are removed from
efa_rdm_pke_rtm.c and efa_rdm_ope.c.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
All RTM packet types (eager, medium, long CTS, long read, runt read)
have been migrated to the new protocol interface where each protocol
constructs its packets directly. The shared
efa_rdm_pke_init_rtm_with_payload helper is no longer called by any code
path and can be safely removed.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Now that all standard RTM protocols (eager, medium, long CTS, long read,
and runt read) have been migrated to the new protocol interface, the old
code path in efa_rdm_msg_generic_send is only needed for zero-copy sends
which bypass protocol selection entirely.

Remove the old RTM selection logic (efa_rdm_msg_select_rtm), the
efa_rdm_msg_post_rtm function, and the associated TXE construction and
handshake enforcement code from the fallback path. The remaining path
handles only the zero-copy case where data is sent directly to the
peer's user receive queue without protocol headers.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
The new protocol selection logic in efa_rdm_proto_select_send_protocol
attempts memory registration early and selects a non-read-based protocol
if registration fails. This eliminates the need for the runtime fallback
from read-based to CTS-based protocols that
efa_rdm_ope_post_send_fallback provided.

Remove the function and its declaration.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Some unit tests call efa_rdm_pke_handle_send_completion directly on
eager packet entries, which now triggers an assertion since eager
packets
have been migrated to the callback path.

Add efa_unit_test_pke_handle_send_completion utility in efa_unit_tests.h
that checks for a callback first and falls through to the old handler if
none is set. Update the RNR and DC OPE tests to use this utility. Add a
TODO to remove the utility once all protocols are migrated.

Also set the DC eager callback on manually constructed packet entries in
the DC TXE release tests.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Remove test_efa_rdm_ep_dc_send_queue_before_handshake and
test_efa_rdm_ep_dc_send_queue_limit_before_handshake. The new protocol
code path does not queue DC sends pending handshake completion because
the refactored path targets backwards compatibility only to v2.0, and
the delivery complete feature was introduced in v1.12. All peers at v2.0
or later are guaranteed to support DC.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Rewrite test_efa_rdm_pke_handle_longcts_rtm_send_completion to use the
new protocol interface. The test verifies that
efa_rdm_proto_longcts_handle_rtm_send_completion returns early for
zero-payload READ NACK packets without touching the TXE or writing a
CQ completion.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_unit_test_proto.c with tests verifying protocol selection:
- Eager for small/zero-length messages
- Medium for mid-size messages (16KB)
- Long CTS as fallback for large messages without p2p
- Long read over long CTS when p2p and RDMA read available
- Runt read over long read when runt conditions met
- Priority ordering (eager before medium)

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Append eager protocol tests to efa_unit_test_proto.c:
- test_proto_eager_construct_pkes_single_pke: verifies 1 PKE with callback
- test_proto_eager_send_completion_releases_txe: verifies TXE release
- test_proto_eager_assigns_msg_id: verifies msg_id from peer->next_msg_id

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Append medium protocol tests to efa_unit_test_proto.c:
- test_proto_medium_construct_pkes_multi_pke: verifies multi-PKE for 16KB
- test_proto_medium_send_completion_tracks_bytes_acked: verifies TXE
  release only after all PKEs complete

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Append long CTS test to efa_unit_test_proto.c:
- test_proto_longcts_construct_pkes_single_rtm: verifies exactly 1 RTM
  packet with callback and TXE correctly set

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Append long read test to efa_unit_test_proto.c:
- test_proto_longread_construct_pkes_has_read_iovs: verifies 1 packet
  with pkt_size including RDMA read IOV entries

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Append runt read test to efa_unit_test_proto.c:
- test_proto_runtread_construct_pkes_has_runt_and_read_iovs: verifies
  multiple PKEs with callbacks, and bytes_runt set between 0 and total_len

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants