Skip to content

prov/efa: support hardware counter#12114

Open
jiaxiyan wants to merge 11 commits intoofiwg:mainfrom
jiaxiyan:hw-cntr-support
Open

prov/efa: support hardware counter#12114
jiaxiyan wants to merge 11 commits intoofiwg:mainfrom
jiaxiyan:hw-cntr-support

Conversation

@jiaxiyan
Copy link
Copy Markdown
Contributor

@jiaxiyan jiaxiyan commented Apr 6, 2026

Implement cntr_open_ext in fi_efa_ops_gda to create hardware completion
counters using ibv_create_comp_cntr from rdma-core.
Application can optionally provide its own memory for the completion and error
counts, enabling zero-copy observation of completion progress by
co-located processes or devices.
Implement fi_ops_cntr operations (read, readerr, add, adderr, set,
seterr) that delegate to the corresponding ibv_*_comp_cntr functions.

@jiaxiyan jiaxiyan marked this pull request as draft April 6, 2026 20:28
@jiaxiyan jiaxiyan force-pushed the hw-cntr-support branch 2 times, most recently from 63d3a4a to 4e231eb Compare April 7, 2026 19:21
Comment thread prov/efa/src/efa_cntr.h Outdated
Comment thread fabtests/prov/efa/src/efa_hw_cntr_test.c
Comment thread prov/efa/src/efa_cntr.c Outdated
Comment thread prov/efa/src/efa_hw_cntr.c Outdated
Comment thread man/fi_efa.7.md
Comment thread prov/efa/src/efa_base_ep.c
Comment thread fabtests/prov/efa/src/efa_hw_cntr_test.c Outdated
@jiaxiyan jiaxiyan force-pushed the hw-cntr-support branch 9 times, most recently from 0c7d981 to b0a5ead Compare May 1, 2026 18:47
@jiaxiyan jiaxiyan marked this pull request as ready for review May 1, 2026 18:47
@jiaxiyan jiaxiyan changed the title prov/efa: Extend GDA domain ops to support hardware counter prov/efa: support hardware counter May 1, 2026
@jiaxiyan jiaxiyan requested a review from a team May 1, 2026 18:47
@jiaxiyan
Copy link
Copy Markdown
Contributor Author

jiaxiyan commented May 1, 2026

@mrgolin Could you review this? Thanks!

Comment thread prov/efa/src/efa_prov_info.c
Comment thread prov/efa/src/efa_hw_cntr.c Outdated
efa_cq = container_of(poll_list_entry->cq, struct efa_cq, ibv_cq);
/* prevent another thread polling cq at the same time */
ofi_genlock_lock(&efa_cq->util_cq.ep_list_lock);
efa_cq_drain_ibv_cq(poll_list_entry->cq);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just drain the cq without writing it to util cq. Even if application bind the cq as FI_SELECTIVE_COMPLETION, there can still be operations they want completions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we still need to efa_cq_poll_ibv_cq?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. we do not want to lose libfabric cqes in this case

Comment thread prov/efa/src/fi_ext_efa.h
Comment thread prov/efa/src/fi_ext_efa.h
@jiaxiyan jiaxiyan force-pushed the hw-cntr-support branch 2 times, most recently from 501a4be to a2dc8f3 Compare May 5, 2026 21:43
talavr-amazon
talavr-amazon previously approved these changes May 6, 2026
Comment thread prov/efa/src/efa_device.h
size_t qp_table_sz_m1;
struct ofi_genlock qp_table_lock;
int urandom_fd;
uint32_t max_comp_cntr;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a comment here explaining what this field is used for?

Comment thread prov/efa/src/efa_device.c Outdated
efa_device->device_caps = 0;
#endif
efa_device->max_comp_cntr = 0;
#if HAVE_IBV_DEVICE_ATTR_EX_MAX_COMP_CNTR
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move to a function? max_comp_counter_set

Comment thread prov/efa/src/efa_device.c
{
int err;
size_t qp_table_size;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please expand the commit message with an explanation re the purpose of this field

jiaxiyan added 6 commits May 6, 2026 14:22
cntr_cnt in domain_attr is the optimal number of completion counters
supported by the domain.
According to man page, it may be a fixed value of the maximum number of counters
supported by the underlying hardware, or may be a dynamic value, based on
the default attributes of the domain.
Set it as the maximum number of counters supported by EFA device, or
leave it as 0 when hardware counter is not supported.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
For efa-direct, set max_cntr_value and max_err_cntr_value via fi_getinfo
based on the comp_count_max_value and err_count_max_value from EFA device
and user hints.
The protocol path cannot use hardware counter because it generates
multiple completion events per user operation.
For API version < 2.5, default to UINT64_MAX.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Implement hardware counter open/close and fi_ops_cntr operations (read,
readerr, add, adderr, set, seterr, wait) that delegate to the
corresponding ibv_*_comp_cntr functions from rdma-core.
Note that CQ still needs to be progressed during fi_cntr_read/readerr
to complete WQE in SQ and RQ.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
… memory

Add cntr_open_ext to fi_efa_ops_gda to create hardware completion
counters with optional application-provided external memory for the
completion and error counts, enabling zero-copy observation of
completion progress by co-located processes or devices.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Attach hardware completion counter to QP with ibv_qp_attach_comp_cntr after QP is created
in RESET state during ep enable. We cannot do this during ep bind because QP
is not created yet.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add efa_hw_cntr_wait() which polls the hardware completion counter
until it reaches the requested threshold or the timeout expires.
Uses exponential backoff starting at 1 microsecond, doubling each
iteration for up to 5 attempts, or repeat 1ms when user asked for
infinite timeout.
Also fixed efa_cntr_wait since it didn't handle infinite timeout
correctly.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
jiaxiyan added 5 commits May 6, 2026 14:22
Add fi_efa_hw_cntr fabtest that exercises hardware counters through
MSG pingpong operations. The test opens counters via cntr_open_ext
from the GDA domain ops, binds them as txcntr/rxcntr, and uses the
existing ft_get_cntr_comp path for completion tracking.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add RMA write support to fi_efa_hw_cntr via the -o write option.
This adds rma_write() and run_rma() functions, and the API_OPTS
parsing to select between MSG pingpong (default) and RMA write.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add --external-mem flag to fi_efa_hw_cntr that enables external
user-provided memory mode. When set, the test allocates buffers and
passes them via FI_EFA_MEMORY_LOCATION_VA with the
FI_EFA_COMP_CNTR_INIT_WITH_EXTERNAL_MEM flag to cntr_open_ext.

Add corresponding pytest cases for pingpong and RMA write with
external memory.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Hardware counter requires firmware support. Add environment variable
FI_EFA_USE_HW_CNTR that is not registered via fi_param_define so we
can control when to enable it without exposing the variable to applications.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Guard all the entry points to hardware counter with FI_EFA_USE_HW_CNTR,
which is default to false until we enable it.
Enable fabtests and unit tests with FI_EFA_USE_HW_CNTR=1.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

cntr = container_of(cntr_fid, struct efa_cntr, util_cntr.cntr_fid);

/* Progress CQ to complete WQE in SQ and RQ */
Copy link
Copy Markdown
Contributor

@shijin-aws shijin-aws May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is for avoiding cq overrun and resource management (WQ), but I still think this is not desirable per the goal of hw cntr read: a cheaper way to get the completion numbers without involving heavy weighted CQ poll. If we want a way to avoid cq overrun, that can be a documented requirement for application, or a separate change to protect cq overrun elsewhere. Meanwhile, the efa-direct fabric support FI_PROGRESS_AUTO which doesn't require application use fi_cntr_read to progress the completions. So polling cq here is awkward to me.

cc @amitrad-aws @bwbarrett

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since efa-direct claims FI_RM_DISABLED, resource management is the application's responsibility. NCCL GIN will read the counter value directly from hardware without calling fi_cntr_read, so it needs to poll the CQ separately to reclaim queue resources. I want to remove this internal cq polling from the hardware counter path to make this consistent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

efa-direct claims FI_RM_ENABLED today:

/* EFA direct path retries indefinitely when Receiver Not Ready (RNR) */
prov_info->domain_attr->resource_mgmt = FI_RM_ENABLED;

Copy link
Copy Markdown
Contributor

@a-szegel a-szegel May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed that we do need to poll the CQ in the fi_cntr_read() path, even with HW counters?

Copy link
Copy Markdown
Contributor

@shijin-aws shijin-aws May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to call out this doesn't make much sense even if it is the safest approach. Also as I can tell we still bump util counters in efa_cq_poll_ibv_cq , because the PR still bind the cntr to util_ep in fi_ep_bind. Then why don't we read from util cntrs except for FI_REMOTE_WRITE (where there is no completions on the target side of fi_write) which even doesn't have any hardware limit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct thing to do in all cases, please make sure the team is aligned and switch the implementation to the decided on approach.


cntr = container_of(cntr_fid, struct efa_cntr, util_cntr.cntr_fid);

/* Progress CQ to complete WQE in SQ and RQ */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct thing to do in all cases, please make sure the team is aligned and switch the implementation to the decided on approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants