Skip to content

prov/efa: Introduce new protocol interface and implement it for the eager protocol in the TX path#12144

Open
sunkuamzn wants to merge 11 commits into
ofiwg:mainfrom
sunkuamzn:proto-callback-eager
Open

prov/efa: Introduce new protocol interface and implement it for the eager protocol in the TX path#12144
sunkuamzn wants to merge 11 commits into
ofiwg:mainfrom
sunkuamzn:proto-callback-eager

Conversation

@sunkuamzn
Copy link
Copy Markdown
Contributor

A series of commits that introduce the new protocol interface for the efa fabric. The TX path for the eager protocols is implemented. Other protocols, RX, RMA and atomic paths are implemented in #12002

@sunkuamzn sunkuamzn force-pushed the proto-callback-eager branch from 99faebb to 9e224cf Compare April 15, 2026 01:40
Comment thread prov/efa/src/rdm/efa_rdm_pke.h Outdated
Comment thread prov/efa/src/rdm/efa_rdm_cq.c Outdated
#endif
efa_rdm_pke_handle_send_completion(pkt_entry);
efa_rdm_cq_increment_pkt_entry_gen(pkt_entry);
if (pkt_entry->callback) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a function would be neat here, e.g. - packet_completion_handle.

Also, should we assert on certain packet types having the callbacks set or not?

Comment thread prov/efa/src/rdm/efa_rdm_proto.h
Comment thread prov/efa/src/rdm/efa_rdm_proto.c Outdated
dlist_insert_tail(&txe->peer_entry, &txe->peer->txe_list);
}

txe->internal_flags = 0;
Copy link
Copy Markdown
Contributor

@talavr-amazon talavr-amazon Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memset to 0 at the top?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the goal is to zero out only the fields we care about and save a few instructions

txe->cq_entry.buf =
OFI_LIKELY(txe->cq_entry.len > 0) ? txe->iov[0].iov_base : NULL;

/* set flags */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be neat to have txe_flags_set function

Copy link
Copy Markdown
Contributor

@talavr-amazon talavr-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor styling comments

*/

/*
* Description of the protocol
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should it be on top?

} else {
pkt_entry->callback(pkt_entry);
}
efa_rdm_cq_increment_pkt_entry_gen(pkt_entry);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it be better to make efa_rdm_cq_increment_pkt_entry_gen part of the callback .. ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that. The callback is part of the protocol. The gen counter is used by the progress engine to keep track of duplicated completions. So I felt that it's better to leave the gen counter increment here


switch (op) {
case ofi_op_tagged:
txe->cq_entry.flags = FI_TRANSMIT | FI_MSG | FI_TAGGED;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe we can define an array which index is ofi_op_code and value is the flag
, since the op code is an enum

enum {
	ofi_op_msg,
	ofi_op_tagged,
	ofi_op_read_req,
	ofi_op_read_rsp,
	ofi_op_write,
	ofi_op_write_async,
	ofi_op_atomic,
	ofi_op_atomic_fetch,
	ofi_op_atomic_compare,
	ofi_op_read_async,
	ofi_op_max,
};

so can get rid of the switch case by using the index, and keep an if for tag case to fill the tag fields

Comment thread prov/efa/Makefile.include Outdated
Comment thread prov/efa/src/rdm/efa_rdm_proto.c
Comment thread prov/efa/src/rdm/efa_rdm_proto.c Outdated
break;
case EFA_RDM_DC_EAGER_MSGRTM_PKT:
case EFA_RDM_DC_EAGER_TAGRTM_PKT:
assert(0 && "Eager protocol moved to refactored code path");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a packet construction ? seems it should be part of your next commit

Comment thread prov/efa/src/rdm/efa_rdm_proto_eager.c
else
delivery_complete_requested = flags & FI_DELIVERY_COMPLETE;

req_pkt_type = delivery_complete_requested ?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the protocol selection function already implement the same logic for delivery_complete_requested, req_pkt_type, etc. Why implementing them here again?

#include "efa_rdm_pke_utils.h"
#include "efa_rdm_pke_req.h"

#include "efa_mr.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only active the new code in this commit, should the earlier assertion in the switch cases added only in this commit?

Comment thread prov/efa/src/rdm/efa_rdm_msg.c Outdated
return txe->total_len <= max_rtm_data_capacity;
}

struct efa_rdm_proto efa_rdm_proto_eager = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should go at the bottom


/* TX path callbacks - one callback for each packet type that this protocol uses
*/
void efa_rdm_proto_eager_handle_rtm_send_completion(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: documentation?

Comment thread prov/efa/src/rdm/efa_rdm_proto.c Outdated
struct efa_rdm_proto **proto)
{
return 0;
/* TODO: Handle memory registration of user buffers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something you're planning to complete as part of this refactor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's already implemented for medium and read based protocols in the full PR #12002

Comment thread prov/efa/src/rdm/efa_rdm_proto.c Outdated
assert(data_offset == 0 && data_size == -1);
ret = efa_rdm_pke_init_eager_msgrtm(pkt_entry, ope);
break;
case EFA_RDM_EAGER_TAGRTM_PKT:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not an opportunity to united these cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire function will be removed when the refactor is complete, so I'm not sure it matters


/* First try to use the refactored code path */
err = efa_rdm_proto_select_send_protocol(ep, peer, msg, op, flags, txe,
&proto);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split the new and old paths to two different functions

talavr-amazon
talavr-amazon previously approved these changes Apr 17, 2026
sunkuamzn added 4 commits May 15, 2026 22:47
Add a callback function pointer to the efa_rdm_pke struct. When set, the
CQ send completion handler invokes the callback instead of dispatching
through the pkt_type-based switch statement in
efa_rdm_pke_handle_send_completion.

This allows each protocol to define its own send completion logic
directly on the packet entry, decoupling completion handling from the
centralized switch. The callback is checked conditionally so the old
code path continues to work for protocols that have not yet been
migrated.

The member is has a generic name "callback" because the same callback
will be used for two different purposes in the RX path in the future.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto.h defining the efa_rdm_proto struct which serves as
the interface for all EFA RDM protocols. The interface includes:

- can_use_protocol_for_send: determines if this protocol can handle a
  given send operation based on message size, peer capabilities, etc.
- construct_tx_pkes: builds the TX packet entries for the operation
- handle_tx_pkes_posted: post-send hook for protocol-specific
  bookkeeping
- Packet type fields for msg/tagged/DC variants

Add efa_rdm_proto.c with the protocol registry array and the
efa_rdm_proto_send_pkes_posted_no_op default implementation.

The RX path will be introduced in later commits.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_txe_fill() which initializes a TXE for use by the new
protocol interface. The logic is derived from efa_rdm_txe_construct but
differs in that it does not set the mr, desc, or total_len fields.
These fields will be populated earlier in
efa_rdm_proto_select_send_protocol where they are needed for memory
registration via efa_rdm_ope_try_fill_desc.

This function is called by each protocol's construct_tx_pkes
implementation to set up the remaining TXE fields (op, peer, flags,
iov, cq_entry, etc.) before building the packet entries.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_rdm_proto_eager.c and efa_rdm_proto_eager.h defining the eager
protocol using the efa_rdm_proto interface. The eager protocol sends
messages that fit within a single MTU-sized packet.

This commit adds the protocol struct with packet type definitions and a
can_use_for_send function that checks if the message fits in a single
packet after accounting for header size. The construct_tx_pkes and send
completion callbacks are placeholders that will be implemented in
subsequent commits.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
sunkuamzn added 7 commits May 15, 2026 23:00
Add efa_rdm_proto_select_send_protocol() which iterates through the
registered protocols in priority order and selects the first one whose
can_use_protocol_for_send() returns true. The function precomputes
header flags and tagged/DC packet type variants once before the loop to
avoid redundant computation across protocols.

The function also initializes TXE fields (iov, desc, mr) from the
fi_msg before protocol selection so that efa_rdm_ope_try_fill_desc can
attempt internal memory registration for non-eager protocols.

For now only the eager protocol is registered. Messages that do not fit
eager fall through and return proto=NULL, signaling the caller to use
the old code path.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Move the eager RTM packet construction logic from efa_rdm_pke_rtm.c
into efa_rdm_proto_eager.c, implementing the construct_tx_pkes callback.

The function initializes the TXE via efa_rdm_proto_txe_fill, allocates a
single TX packet entry, initializes the RTM header with the message
payload, and sets the per-packet send completion callback. Both regular
and delivery-complete (DC) eager packets are supported, with DC packets
using a separate callback and header struct that includes the send_id.

Remove the old efa_rdm_pke_init_eager_msgrtm/tagrtm functions and their
DC variants from efa_rdm_pke_rtm.c.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Move the eager RTM send completion handling from efa_rdm_pke_rtm.c into
efa_rdm_proto_eager.c. Two callbacks are implemented:

- Regular eager: verifies payload size matches total length, calls
  efa_rdm_ope_handle_send_completed to report the CQ completion and
  release the TXE, then releases the TX packet entry.
- DC eager: checks efa_rdm_txe_dc_ready_for_release which requires both
  efa_outstanding_tx_ops == 0 and receipt received. Only releases the
  TXE when both conditions are met.

Replace the old eager case in efa_rdm_pke_handle_send_completion with an
assertion to catch any packets that incorrectly bypass the new callback.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Wire the new protocol interface into efa_rdm_msg_generic_send. The
function now calls efa_rdm_proto_select_send_protocol first. If a
protocol is selected, it calls construct_tx_pkes to build the packets,
sends them via efa_rdm_pke_sendv, and calls handle_tx_pkes_posted.

If no protocol is selected (proto == NULL), the function falls back to
the existing code path using efa_rdm_txe_construct and
efa_rdm_msg_post_rtm. This allows incremental migration: as each
protocol is added to the registry, it will be picked up automatically
while unimplemented protocols continue to use the old path.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add efa_unit_test_proto.c with tests verifying protocol selection:
- Eager for small/zero-length messages

Add eager protocol tests to efa_unit_test_proto.c:
- test_proto_eager_construct_pkes_single_pke: verifies 1 PKE with callback
- test_proto_eager_send_completion_releases_txe: verifies TXE release
- test_proto_eager_assigns_msg_id: verifies msg_id from peer->next_msg_id

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Some unit tests call efa_rdm_pke_handle_send_completion directly on
eager packet entries, which now triggers an assertion since eager
packets
have been migrated to the callback path.

Add efa_unit_test_pke_handle_send_completion utility in efa_unit_tests.h
that checks for a callback first and falls through to the old handler if
none is set. Update the RNR and DC OPE tests to use this utility. Add a
TODO to remove the utility once all protocols are migrated.

Also set the DC eager callback on manually constructed packet entries in
the DC TXE release tests.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Add test verifying that:
- A send is queued when peer_may_have_zcpy_rx is true and handshake
  has not been received yet
- The queued OPE is dequeued and the packet is posted once handshake
  arrives and progress is called

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
@sunkuamzn sunkuamzn force-pushed the proto-callback-eager branch from 9561f23 to ae68528 Compare May 15, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants