prov/efa: Introduce new protocol interface and implement it for the TX path#12002
prov/efa: Introduce new protocol interface and implement it for the TX path#12002sunkuamzn wants to merge 30 commits intoofiwg:mainfrom
Conversation
| int (*send_pkes_posted)(struct efa_rdm_ep *ep, struct efa_rdm_ope *txe); | ||
|
|
||
| /* TX utitlities */ | ||
| int req_pkt_type; |
There was a problem hiding this comment.
Do we really need 4 ints here? Can we put this in 1 int?
There was a problem hiding this comment.
Not unless we rearrange the pkt_type order - which breaks backwards compatibility. I could remove the tagged ones but I kept them there for completeness.
There was a problem hiding this comment.
do we not have enums fro these? also should they not at least be unsigned?
There was a problem hiding this comment.
No and probably. But I can't change them to unsigned in this PR.
| * posting a Long CTS RTM pke or to update the number of in flight reads | ||
| * and read bytes | ||
| */ | ||
| int (*send_pkes_posted)(struct efa_rdm_ep *ep, struct efa_rdm_ope *txe); |
There was a problem hiding this comment.
This name is confusing to me b/c "Send pkes posted" reads like you are going to re-send (verb) the already posted packet entries. I like something like after_pkes_posted. I don't think send needs to be in the name b/c it is not in the construct_pkes name.
There was a problem hiding this comment.
I can change it to after_tx_pkes_posted
Also, construct_tx_pkes sounds better than construct_pkes
There was a problem hiding this comment.
I picked handle_tx_pkes_posted. Let me know what you think
| } | ||
|
|
||
|
|
||
| EFA_DBG(FI_LOG_EP_DATA, |
There was a problem hiding this comment.
Should we make these debug statements per protocol and have them more detailed about which protocol is used and how many pke's are being sent?
9a97526 to
ccdcb30
Compare
| uint8_t gen; | ||
|
|
||
| /**@brief Callback function called in TX and RX paths */ | ||
| void (*callback)(struct efa_rdm_pke *pkt_entry); |
There was a problem hiding this comment.
let's please give it a more meaningful name
There was a problem hiding this comment.
I'm not intentionally choosing a meaningless name.. If you have better ideas, please let me know.
| #include "efa.h" | ||
| #include "efa_rdm_proto.h" | ||
|
|
||
| int efa_rdm_proto_select_send_protocol(struct efa_rdm_proto **proto, |
There was a problem hiding this comment.
proto/protocol - can we be consistent? (I vote for the more verbose option)
There was a problem hiding this comment.
efa_rdm_proto is the prefix. We use prefixes like that through the code. Not sure if it the function names have to match that prefix everywhere
|
|
||
| struct efa_rdm_proto efa_rdm_proto_eager = { | ||
| .construct_pkes = &efa_rdm_proto_eager_construct_pkes, | ||
| .req_pkt_type = EFA_RDM_EAGER_MSGRTM_PKT, |
There was a problem hiding this comment.
should we create a request packet type struct?
|
|
||
| extern struct efa_rdm_proto efa_rdm_proto_eager; | ||
|
|
||
| int efa_rdm_proto_eager_construct_pkes(struct efa_rdm_ep *ep, |
There was a problem hiding this comment.
if these are callbacks, they should be static and local to the source file
| return err; | ||
| use_p2p = err; | ||
|
|
||
| for (int i = 0; i < EFA_RDM_MAX_PROTO; ++i) { |
There was a problem hiding this comment.
not sure I understand this, let's discuss offline please
| @@ -268,7 +260,7 @@ void efa_rdm_pke_handle_sent(struct efa_rdm_pke *pkt_entry, int pkt_type, struct | |||
| break; | |||
| case EFA_RDM_EAGER_MSGRTM_PKT: | |||
| case EFA_RDM_EAGER_TAGRTM_PKT: | |||
There was a problem hiding this comment.
can you not combine these cases now?
|
|
||
| /* If a protocol is found, use it. Otherwise, fall back to the old code | ||
| * path */ | ||
| if (proto) { |
ccdcb30 to
4111576
Compare
| available_tx_pkts = ep->efa_max_outstanding_tx_ops - | ||
| ep->efa_outstanding_tx_ops - ep->efa_rnr_queued_pkt_cnt; | ||
|
|
||
| if (OFI_UNLIKELY(available_tx_pkts == 0)) { |
There was a problem hiding this comment.
I am not a fan of this check b/c it optimizes an error path, and adds more branching to the normal path. I think a better change would be to split the ope buffpool into a TX buffpool that has a hard limit and an RX buff pool which is unlimited for unexpected messages, and then we can rely on the efa_rdm_ep_alloc_txe() check to fail.
There was a problem hiding this comment.
Only the medium and runting protocols require more than 1 available tx pkt. We were previously doing the check twice anyway. After moving the check here, I'm skipping the check for protocols that need 1 tx pkt.
| assert(0); | ||
| } | ||
|
|
||
| dlist_insert_tail(&txe->ep_entry, &ep->txe_list); |
There was a problem hiding this comment.
Did you mean to drop the lock here? If so, can you add it to your commit message?
There was a problem hiding this comment.
This commit is actually in #12010
I fixed it there
Add a callback function pointer to the efa_rdm_pke struct. When set, the CQ send completion handler invokes the callback instead of dispatching through the pkt_type-based switch statement in efa_rdm_pke_handle_send_completion. This allows each protocol to define its own send completion logic directly on the packet entry, decoupling completion handling from the centralized switch. The callback is checked conditionally so the old code path continues to work for protocols that have not yet been migrated. The member is has a generic name "callback" because the same callback will be used for two different purposes in the RX path in the future. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto.h defining the efa_rdm_proto struct which serves as the interface for all EFA RDM protocols. The interface includes: - can_use_protocol_for_send: determines if this protocol can handle a given send operation based on message size, peer capabilities, etc. - construct_tx_pkes: builds the TX packet entries for the operation - handle_tx_pkes_posted: post-send hook for protocol-specific bookkeeping - Packet type fields for msg/tagged/DC variants Add efa_rdm_proto.c with the protocol registry array and the efa_rdm_proto_send_pkes_posted_no_op default implementation. The RX path will be introduced in later commits. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_eager.c and efa_rdm_proto_eager.h defining the eager protocol using the efa_rdm_proto interface. The eager protocol sends messages that fit within a single MTU-sized packet. This commit adds the protocol struct with packet type definitions and a can_use_for_send function that checks if the message fits in a single packet after accounting for header size. The construct_tx_pkes and send completion callbacks are placeholders that will be implemented in subsequent commits. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_select_send_protocol() which iterates through the registered protocols in priority order and selects the first one whose can_use_protocol_for_send() returns true. The function precomputes header flags and tagged/DC packet type variants once before the loop to avoid redundant computation across protocols. The function also initializes TXE fields (iov, desc, mr) from the fi_msg before protocol selection so that efa_rdm_ope_try_fill_desc can attempt internal memory registration for non-eager protocols. For now only the eager protocol is registered. Messages that do not fit eager fall through and return proto=NULL, signaling the caller to use the old code path. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Move the eager RTM packet construction logic from efa_rdm_pke_rtm.c into efa_rdm_proto_eager.c, implementing the construct_tx_pkes callback. The function initializes the TXE via efa_rdm_proto_txe_fill, allocates a single TX packet entry, initializes the RTM header with the message payload, and sets the per-packet send completion callback. Both regular and delivery-complete (DC) eager packets are supported, with DC packets using a separate callback and header struct that includes the send_id. Remove the old efa_rdm_pke_init_eager_msgrtm/tagrtm functions and their DC variants from efa_rdm_pke_rtm.c. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Move the eager RTM send completion handling from efa_rdm_pke_rtm.c into efa_rdm_proto_eager.c. Two callbacks are implemented: - Regular eager: verifies payload size matches total length, calls efa_rdm_ope_handle_send_completed to report the CQ completion and release the TXE, then releases the TX packet entry. - DC eager: checks efa_rdm_txe_dc_ready_for_release which requires both efa_outstanding_tx_ops == 0 and receipt received. Only releases the TXE when both conditions are met. Replace the old eager case in efa_rdm_pke_handle_send_completion with an assertion to catch any packets that incorrectly bypass the new callback. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Wire the new protocol interface into efa_rdm_msg_generic_send. The function now calls efa_rdm_proto_select_send_protocol first. If a protocol is selected, it calls construct_tx_pkes to build the packets, sends them via efa_rdm_pke_sendv, and calls handle_tx_pkes_posted. If no protocol is selected (proto == NULL), the function falls back to the existing code path using efa_rdm_txe_construct and efa_rdm_msg_post_rtm. This allows incremental migration: as each protocol is added to the registry, it will be picked up automatically while unimplemented protocols continue to use the old path. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_txe_fill() which initializes a TXE for use by the new protocol interface. The logic is derived from efa_rdm_txe_construct but differs in that it does not set the mr, desc, or total_len fields. These fields are already populated earlier in efa_rdm_proto_select_send_protocol where they are needed for memory registration via efa_rdm_ope_try_fill_desc. This function is called by each protocol's construct_tx_pkes implementation to set up the remaining TXE fields (op, peer, flags, iov, cq_entry, etc.) before building the packet entries. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_medium.c and efa_rdm_proto_medium.h implementing the medium message protocol. This protocol handles messages too large for a single eager packet but within the medium threshold (64KB by default for system memory). The construct_tx_pkes function splits message data across multiple packet entries, balancing data sizes for performance and respecting memory alignment requirements. Each packet carries a segment offset and the total message length in its header so the receiver can reassemble the message. For DC medium packets, the header layout differs from non-DC (send_id and padding precede msg_length), so header fields are set using the appropriate struct for each variant. The old medium RTM init functions are removed from efa_rdm_pke_rtm.c. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add a human-readable name string to the efa_rdm_proto struct. The name is set by each protocol (e.g. "EAGER", "MEDIUM", "LONGCTS") and is used in log messages emitted during protocol selection to aid debugging. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
c5449b8 to
109f4cb
Compare
Add efa_rdm_proto_longcts.c and efa_rdm_proto_longcts.h implementing the long CTS (Credit-based Transfer with Send) protocol. This protocol handles messages larger than the medium threshold when RDMA read is not available or the sender's memory is not registered. The sender transmits the first data chunk along with the RTM header, then waits for a CTS (Clear To Send) packet from the receiver before sending the remaining data in CTSDATA packets. The send completion callback tracks bytes_acked and only reports completion when all data has been acknowledged. For DC long CTS, the TXE is held until both all send completions and the receipt packet have been received. The old long CTS RTM init functions are removed from efa_rdm_pke_rtm.c. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add peer and use_p2p parameters to the can_use_protocol_for_send callback. The long read and runt read protocols need to check peer capabilities (RDMA read interop) and whether peer-to-peer memory access is available when deciding if they can handle a send operation. Update all existing protocol implementations (eager, medium, long CTS) to accept the new parameters. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_longread.c and efa_rdm_proto_longread.h implementing the long read message protocol. Instead of sending the message data, the sender transmits a single RTM packet containing the memory keys and addresses of its registered buffers, and the receiver performs RDMA reads to fetch the data directly. The can_use_for_send function requires p2p availability, registered memory descriptors, the message meeting the minimum read size threshold, and peer RDMA read interop support. The send completion callback simply releases the TX packet entry since the data transfer is driven by the receiver. The handle_tx_pkes_posted callback increments the domain's read message in-flight counter. The old long read RTM init functions are removed from efa_rdm_pke_rtm.c. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_runtread.c and efa_rdm_proto_runtread.h implementing the runt read protocol. This protocol combines an initial runt data transfer (sent with the RTM header, similar to medium) with RDMA reads for the remaining data. It is selected when the message is large enough for read-based transfer, p2p is available, and the peer supports RDMA read. The construct_tx_pkes function splits the runt portion across multiple packets while including the RDMA read IOVs in the packet header so the receiver can fetch the remaining data via RDMA read. The old runt read RTM init functions and the efa_rdm_txe_prepare_to_be_read helper are removed from efa_rdm_pke_rtm.c and efa_rdm_ope.c. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
All RTM packet types (eager, medium, long CTS, long read, runt read) have been migrated to the new protocol interface where each protocol constructs its packets directly. The shared efa_rdm_pke_init_rtm_with_payload helper is no longer called by any code path and can be safely removed. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Now that all standard RTM protocols (eager, medium, long CTS, long read, and runt read) have been migrated to the new protocol interface, the old code path in efa_rdm_msg_generic_send is only needed for zero-copy sends which bypass protocol selection entirely. Remove the old RTM selection logic (efa_rdm_msg_select_rtm), the efa_rdm_msg_post_rtm function, and the associated TXE construction and handshake enforcement code from the fallback path. The remaining path handles only the zero-copy case where data is sent directly to the peer's user receive queue without protocol headers. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
The new protocol selection logic in efa_rdm_proto_select_send_protocol attempts memory registration early and selects a non-read-based protocol if registration fails. This eliminates the need for the runtime fallback from read-based to CTS-based protocols that efa_rdm_ope_post_send_fallback provided. Remove the function and its declaration. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Some unit tests call efa_rdm_pke_handle_send_completion directly on eager packet entries, which now triggers an assertion since eager packets have been migrated to the callback path. Add efa_unit_test_pke_handle_send_completion utility in efa_unit_tests.h that checks for a callback first and falls through to the old handler if none is set. Update the RNR and DC OPE tests to use this utility. Add a TODO to remove the utility once all protocols are migrated. Also set the DC eager callback on manually constructed packet entries in the DC TXE release tests. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Remove test_efa_rdm_ep_dc_send_queue_before_handshake and test_efa_rdm_ep_dc_send_queue_limit_before_handshake. The new protocol code path does not queue DC sends pending handshake completion because the refactored path targets backwards compatibility only to v2.0, and the delivery complete feature was introduced in v1.12. All peers at v2.0 or later are guaranteed to support DC. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Rewrite test_efa_rdm_pke_handle_longcts_rtm_send_completion to use the new protocol interface. The test verifies that efa_rdm_proto_longcts_handle_rtm_send_completion returns early for zero-payload READ NACK packets without touching the TXE or writing a CQ completion. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_unit_test_proto.c with tests verifying protocol selection: - Eager for small/zero-length messages - Medium for mid-size messages (16KB) - Long CTS as fallback for large messages without p2p - Long read over long CTS when p2p and RDMA read available - Runt read over long read when runt conditions met - Priority ordering (eager before medium) Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Append eager protocol tests to efa_unit_test_proto.c: - test_proto_eager_construct_pkes_single_pke: verifies 1 PKE with callback - test_proto_eager_send_completion_releases_txe: verifies TXE release - test_proto_eager_assigns_msg_id: verifies msg_id from peer->next_msg_id Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Append medium protocol tests to efa_unit_test_proto.c: - test_proto_medium_construct_pkes_multi_pke: verifies multi-PKE for 16KB - test_proto_medium_send_completion_tracks_bytes_acked: verifies TXE release only after all PKEs complete Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Append long CTS test to efa_unit_test_proto.c: - test_proto_longcts_construct_pkes_single_rtm: verifies exactly 1 RTM packet with callback and TXE correctly set Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Append long read test to efa_unit_test_proto.c: - test_proto_longread_construct_pkes_has_read_iovs: verifies 1 packet with pkt_size including RDMA read IOV entries Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Append runt read test to efa_unit_test_proto.c: - test_proto_runtread_construct_pkes_has_runt_and_read_iovs: verifies multiple PKEs with callbacks, and bytes_runt set between 0 and total_len Signed-off-by: Sai Sunku <sunkusa@amazon.com>
109f4cb to
30f6a47
Compare
A series of commits that introduce the new protocol interface for the efa fabric. The TX path for all protocols is implemented. RX, RMA and atomic paths will be implemented in future commits.