fabtests: Add new fi_mr_abort fabtest#12113
Conversation
29260c4 to
7b6aadf
Compare
|
bot:aws:retest ... testing to see if AWS PR CI is working. |
7b6aadf to
d89fc4b
Compare
|
bot:aws:retest |
| * Test aborting in-flight RMA operations by closing MRs. | ||
| * | ||
| * Test modes: | ||
| * - Initiator close: client posts RMA ops, client closes its local MRs |
There was a problem hiding this comment.
Would have been easier to review if the different tests are in different commits
|
|
||
| #include "shared.h" | ||
|
|
||
| enum cancel_mode { |
| */ | ||
| struct op_ctx { | ||
| struct fi_context2 context; | ||
| int mr_idx; /* which mr_slot this op belongs to */ |
There was a problem hiding this comment.
nit: unsigned here and for completed?
| struct fid_mr *mr; | ||
| void *desc; | ||
| uint64_t key; | ||
| int posted; /* number of ops posted using this MR */ |
| static struct mr_slot *slots; | ||
| static struct op_ctx *op_arr; | ||
| static int *cancel_order; | ||
| static int num_mrs = 8192; |
| /* Remote side MR info */ | ||
| static struct fi_rma_iov *remote_arr; | ||
|
|
||
| #define MR_ABORT_KEY_BASE 0x1000 |
| pytest/efa/conftest.py \ | ||
| pytest/efa/efa_common.py \ | ||
| pytest/efa/test_flood_peer.py \ | ||
| pytest/efa/test_mr_abort.py \ |
There was a problem hiding this comment.
should this be part of the next commit?
|
|
||
| srand(time(NULL)); | ||
|
|
||
| while ((op = getopt(argc, argv, |
There was a problem hiding this comment.
nit: a function for parsing options? or at least for the help section?
| case TEST_ABORT: | ||
| for (i = 0; i < opts.iterations; i++) { | ||
| ret = register_mrs(remote_access_for_op()); | ||
| if (ret) |
| import pytest | ||
| from common import ClientServerTest | ||
|
|
||
| failing_test_skip = pytest.mark.skip(reason="efa/efa-direct fabrics do not currently pass fi_mr_abort testing") |
There was a problem hiding this comment.
nit: failing_test_skip is generic but the message is specific to fi_mr_abort test
| #define MR_ABORT_KEY_BASE 0x1000 | ||
| #define CQ_TIMEOUT_MS 30000 | ||
|
|
||
| static uint64_t local_access_for_op(void) |
There was a problem hiding this comment.
Can you reuse ft_info_to_mr_access?
| /* | ||
| * Test 1: Fill-and-abort (RMA) | ||
| * | ||
| * Initiator mode: fill TX queue with RMA ops, then close local MRs. |
There was a problem hiding this comment.
There are two sets of terms: initiator, target and client, server. Test will be easier to understand if there's only one set.
| * In target-close mode, the server actively closes MRs when | ||
| * signaled by the client. | ||
| */ | ||
| static int run_server(void) |
There was a problem hiding this comment.
A lot of code is duplicated between run_client and run_server. Switching on the test_mode first and then calling the relevant function for the server/client is better.
| remote_fi_addr, &o->context); | ||
| } | ||
|
|
||
| static void shuffle(int *arr, int n) |
There was a problem hiding this comment.
consider making this a utility function in common code
Add a new functional test that verifies behavior when memory regions are closed while operations are in-flight. This tests the abort-via- MR-close approach for canceling RDMA operations. Test modes (-T): abort - Fill TX queue, close MRs, verify all ops get completions partial - 2 MRs on same buffer, close 1, verify the other survives send - Same as abort but with fi_send/fi_recv tagged - Same as abort but with fi_tsend/fi_trecv Supports both initiator-side close (-R initiator) where the sender closes its local MRs, and target-side close (-R target) where the receiver closes its remote MRs via write-with-imm signaling. Configurable cancel order (-C reverse|random), operations per MR (-N), MR slot count (-W), and transfer size (-S). Server teardown order (-D ep_first|mr_first) controls whether the endpoint is closed before or after MR deregistration. EFA curently requires ep_first for send/tagged tests since posted RX buffer MRs cannot be deregistered while the endpoint is still open. The reuse check (write + read round-trip after abort) only runs for RMA test modes since send/tagged do not request FI_RMA capability. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Add pytest/efa/test_mr_abort.py with functional tests covering
MR close during in-flight operations across multiple dimensions:
test_mr_abort: RMA write/read/writedata × reverse/random
cancel order × initiator/target close
× 1 or 4 ops per MR
test_mr_abort_partial: 2 MRs on same buffer, close 1
test_mr_abort_send: fi_send at 4KB/64KB/256KB × reverse/random
× initiator/target × 1 or 4 ops per MR
test_mr_abort_tagged: fi_tsend at 4KB/64KB/256KB × reverse/random
× initiator/target × 1 or 4 ops per MR
Skips efa-direct+tagged (unsupported), efa-direct sends >8KB
(max send size), and target-close for send/tagged (EFA does not
currently support canceling posted RX buffers).
All tests except test_mr_abort_partial are marked skip pending
EFA firmware/provider fixes for MR abort support.
Parametrized over fabric (efa, efa-direct) via existing conftest.
Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Add printf calls between posting, MR close, and CQ drain in the writedata abort path. These introduce a small delay that gives the EFA device time to begin processing writedata WQEs before MR deregistration, exposing a firmware bug where the flush of partially-processed writedata WQEs triggers a QP internal error (prov_errno=2). Without these prints the client races through post/close so fast that all ops are canceled before reaching the device, masking the bug. The prints will be removed once the firmware issue is resolved. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
d89fc4b to
79b59f9
Compare
The man pages currently say:
but the EFA provider would like to support closing MR's when data transfer operations are in-flight.
This fabtest has some efa-specific provider error codes in it... but it should still be usable by every provider. I can move it to prov/efa if it is not generic enough to be shared across providers.