prov/efa: support hardware counter#12114
Conversation
63d3a4a to
4e231eb
Compare
4e231eb to
3c6c1e2
Compare
3c6c1e2 to
37a80cc
Compare
0c7d981 to
b0a5ead
Compare
|
@mrgolin Could you review this? Thanks! |
| efa_cq = container_of(poll_list_entry->cq, struct efa_cq, ibv_cq); | ||
| /* prevent another thread polling cq at the same time */ | ||
| ofi_genlock_lock(&efa_cq->util_cq.ep_list_lock); | ||
| efa_cq_drain_ibv_cq(poll_list_entry->cq); |
There was a problem hiding this comment.
I don't think we can just drain the cq without writing it to util cq. Even if application bind the cq as FI_SELECTIVE_COMPLETION, there can still be operations they want completions
There was a problem hiding this comment.
Does that mean we still need to efa_cq_poll_ibv_cq?
There was a problem hiding this comment.
Right. we do not want to lose libfabric cqes in this case
501a4be to
a2dc8f3
Compare
| size_t qp_table_sz_m1; | ||
| struct ofi_genlock qp_table_lock; | ||
| int urandom_fd; | ||
| uint32_t max_comp_cntr; |
There was a problem hiding this comment.
nit: a comment here explaining what this field is used for?
| efa_device->device_caps = 0; | ||
| #endif | ||
| efa_device->max_comp_cntr = 0; | ||
| #if HAVE_IBV_DEVICE_ATTR_EX_MAX_COMP_CNTR |
There was a problem hiding this comment.
nit: move to a function? max_comp_counter_set
| { | ||
| int err; | ||
| size_t qp_table_size; | ||
|
|
There was a problem hiding this comment.
nit: please expand the commit message with an explanation re the purpose of this field
cntr_cnt in domain_attr is the optimal number of completion counters supported by the domain. According to man page, it may be a fixed value of the maximum number of counters supported by the underlying hardware, or may be a dynamic value, based on the default attributes of the domain. Set it as the maximum number of counters supported by EFA device, or leave it as 0 when hardware counter is not supported. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
For efa-direct, set max_cntr_value and max_err_cntr_value via fi_getinfo based on the comp_count_max_value and err_count_max_value from EFA device and user hints. The protocol path cannot use hardware counter because it generates multiple completion events per user operation. For API version < 2.5, default to UINT64_MAX. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Implement hardware counter open/close and fi_ops_cntr operations (read, readerr, add, adderr, set, seterr, wait) that delegate to the corresponding ibv_*_comp_cntr functions from rdma-core. Note that CQ still needs to be progressed during fi_cntr_read/readerr to complete WQE in SQ and RQ. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
… memory Add cntr_open_ext to fi_efa_ops_gda to create hardware completion counters with optional application-provided external memory for the completion and error counts, enabling zero-copy observation of completion progress by co-located processes or devices. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Attach hardware completion counter to QP with ibv_qp_attach_comp_cntr after QP is created in RESET state during ep enable. We cannot do this during ep bind because QP is not created yet. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add efa_hw_cntr_wait() which polls the hardware completion counter until it reaches the requested threshold or the timeout expires. Uses exponential backoff starting at 1 microsecond, doubling each iteration for up to 5 attempts, or repeat 1ms when user asked for infinite timeout. Also fixed efa_cntr_wait since it didn't handle infinite timeout correctly. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add fi_efa_hw_cntr fabtest that exercises hardware counters through MSG pingpong operations. The test opens counters via cntr_open_ext from the GDA domain ops, binds them as txcntr/rxcntr, and uses the existing ft_get_cntr_comp path for completion tracking. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add RMA write support to fi_efa_hw_cntr via the -o write option. This adds rma_write() and run_rma() functions, and the API_OPTS parsing to select between MSG pingpong (default) and RMA write. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add --external-mem flag to fi_efa_hw_cntr that enables external user-provided memory mode. When set, the test allocates buffers and passes them via FI_EFA_MEMORY_LOCATION_VA with the FI_EFA_COMP_CNTR_INIT_WITH_EXTERNAL_MEM flag to cntr_open_ext. Add corresponding pytest cases for pingpong and RMA write with external memory. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Hardware counter requires firmware support. Add environment variable FI_EFA_USE_HW_CNTR that is not registered via fi_param_define so we can control when to enable it without exposing the variable to applications. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Guard all the entry points to hardware counter with FI_EFA_USE_HW_CNTR, which is default to false until we enable it. Enable fabtests and unit tests with FI_EFA_USE_HW_CNTR=1. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
|
|
||
| cntr = container_of(cntr_fid, struct efa_cntr, util_cntr.cntr_fid); | ||
|
|
||
| /* Progress CQ to complete WQE in SQ and RQ */ |
There was a problem hiding this comment.
I know this is for avoiding cq overrun and resource management (WQ), but I still think this is not desirable per the goal of hw cntr read: a cheaper way to get the completion numbers without involving heavy weighted CQ poll. If we want a way to avoid cq overrun, that can be a documented requirement for application, or a separate change to protect cq overrun elsewhere. Meanwhile, the efa-direct fabric support FI_PROGRESS_AUTO which doesn't require application use fi_cntr_read to progress the completions. So polling cq here is awkward to me.
There was a problem hiding this comment.
Since efa-direct claims FI_RM_DISABLED, resource management is the application's responsibility. NCCL GIN will read the counter value directly from hardware without calling fi_cntr_read, so it needs to poll the CQ separately to reclaim queue resources. I want to remove this internal cq polling from the hardware counter path to make this consistent.
There was a problem hiding this comment.
efa-direct claims FI_RM_ENABLED today:
libfabric/prov/efa/src/efa_prov_info.c
Lines 103 to 104 in 0ede325
There was a problem hiding this comment.
I thought we agreed that we do need to poll the CQ in the fi_cntr_read() path, even with HW counters?
There was a problem hiding this comment.
I just want to call out this doesn't make much sense even if it is the safest approach. Also as I can tell we still bump util counters in efa_cq_poll_ibv_cq , because the PR still bind the cntr to util_ep in fi_ep_bind. Then why don't we read from util cntrs except for FI_REMOTE_WRITE (where there is no completions on the target side of fi_write) which even doesn't have any hardware limit
There was a problem hiding this comment.
This is not the correct thing to do in all cases, please make sure the team is aligned and switch the implementation to the decided on approach.
|
|
||
| cntr = container_of(cntr_fid, struct efa_cntr, util_cntr.cntr_fid); | ||
|
|
||
| /* Progress CQ to complete WQE in SQ and RQ */ |
There was a problem hiding this comment.
This is not the correct thing to do in all cases, please make sure the team is aligned and switch the implementation to the decided on approach.
Implement cntr_open_ext in fi_efa_ops_gda to create hardware completion
counters using ibv_create_comp_cntr from rdma-core.
Application can optionally provide its own memory for the completion and error
counts, enabling zero-copy observation of completion progress by
co-located processes or devices.
Implement fi_ops_cntr operations (read, readerr, add, adderr, set,
seterr) that delegate to the corresponding ibv_*_comp_cntr functions.