prov/verbs: Automatic GPU-NIC affinity#12203
Conversation
There was a problem hiding this comment.
In general, looks good, see the nitpicks in line.
Please sign the commits (use git commit -s) otherwise the DCO check won't pass.
It would be nice to have a few sentences in the commit message to describe what the commit does. For the first commit, consider changing the title to prov/verbs: Add infrastructure for ..., and the commit message can be a little bit more detailed than the rest.
The appveyor CI failure is due to setenv and unsetenv not being available on Windows. Need to implement wrappers (using _putenv_s) in include/windows/osd.h.
I would prefer the new test being standalone instead of being part of the existing getinfo test. This way the new test can be configured to only run on machines that makes the test meaningful.
| .nic_affinity_policy = "none", | ||
| .affinity_device = NULL, |
There was a problem hiding this comment.
Try to keep the = aligned like the code segment above.
There was a problem hiding this comment.
I think we have different tab width. Hope that I fixed what you meant.
| if (!ancestor) { | ||
| return VRB_PROXIMITY_UNKNOWN; | ||
| } | ||
|
|
||
| if (ancestor->type == HWLOC_OBJ_BRIDGE) { | ||
| return VRB_PROXIMITY_BRIDGE; | ||
| } | ||
|
|
||
| if (ancestor->type == HWLOC_OBJ_PACKAGE) { | ||
| return VRB_PROXIMITY_PACKAGE; | ||
| } |
There was a problem hiding this comment.
Convention here is not to use { and } around single line body for conditional statements.
| if (nic_a->proximity < nic_b->proximity) { | ||
| return -1; | ||
| } | ||
| if (nic_a->proximity > nic_b->proximity) { | ||
| return 1; | ||
| } |
| for (cur = *info; cur; cur = cur->next) { | ||
| entries_count++; | ||
| } | ||
| if (entries_count == 0) { | ||
| return FI_SUCCESS; | ||
| } |
d48ba04 to
c66e17f
Compare
Introduce a NIC affinity framework to enable NIC ordering in fi_getinfo() results based on GPU-NIC proximity. The 'none' policy preserves current behavior (no reordering) and serves as the default for backward compatibility. It establishes the infrastructure for policy-based handlers that will be added in subsequent commits. Signed-off-by: Gad Arbel <gad.arbel@intel.com>
c3fb7b3 to
f936ae3
Compare
|
The Can we make a new test for the affinity settings and leave the original getinfo test unchanged? |
f936ae3 to
887cb0d
Compare
Thanks for the Windows tip - that was helpful. |
581dcc6 to
84fbcdb
Compare
Implement the 'manual' policy that allows users to specify explicit GPU-to-NIC mappings via a configuration file. This policy reads a text-based config file containing PCI address to NIC name mappings and reorders fi_getinfo() results to prioritize the mapped NIC. Signed-off-by: Gad Arbel <gad.arbel@intel.com>
Implement the 'auto' policy that automatically detects and prioritizes NICs based on PCIe topology proximity to the target GPU device using hwloc. This policy calculates PCIe proximity levels between the GPU and each NIC, then reorders the fi_getinfo() list to place closer NICs first. Signed-off-by: Gad Arbel <gad.arbel@intel.com>
Add hwloc library detection and linking to the verbs provider build system. Hwloc is optional and used by the 'auto' policy to query PCIe topology for automatic GPU-NIC proximity detection. When hwloc is not available, the 'auto' policy falls back to 'none' behavior. Signed-off-by: Gad Arbel <gad.arbel@intel.com>
Add test suite for GPU-NIC affinity policies. Tests validate that the affinity feature does not break existing functionality by checking failure states and verifying that all original fi_info entries are preserved in the result list. Tests do not verify reordering correctness - the goal is to ensure that in the worst case, users receive a different permutation of results rather than missing or corrupted data. Signed-off-by: Gad Arbel <gad.arbel@intel.com>
84fbcdb to
9e2efa6
Compare
|
No description provided.