Skip to content

[v2.4.x] prov/efa: use dmabuf for CUDA MR registration#12216

Open
sara4dev wants to merge 1 commit intoofiwg:v2.4.xfrom
sara4dev:fix-libfabric-bug-12019-in-v2.4.x-branch
Open

[v2.4.x] prov/efa: use dmabuf for CUDA MR registration#12216
sara4dev wants to merge 1 commit intoofiwg:v2.4.xfrom
sara4dev:fix-libfabric-bug-12019-in-v2.4.x-branch

Conversation

@sara4dev
Copy link
Copy Markdown

@sara4dev sara4dev commented May 5, 2026

Summary

  • include CUDA in the EFA provider's implicit dmabuf MR registration path
  • use the existing ofi_hmem_get_dmabuf_fd() and ibv_reg_dmabuf_mr() flow for FI_HMEM_CUDA
  • remove the stale TODO that said CUDA still needed this fallback

Root Cause

CUDA HMEM registrations without FI_MR_DMABUF were not included in the EFA provider's dmabuf path, unlike Neuron and ROCr. That made CUDA memory fall through to plain ibv_reg_mr() with a GPU virtual address, which can fail with EFAULT.

References #12019.

Validation

Tested in a GB200 cluster

@sara4dev sara4dev marked this pull request as ready for review May 5, 2026 06:51
@a-szegel
Copy link
Copy Markdown
Contributor

a-szegel commented May 5, 2026

Can you please sign your commit? git commit --amend -s

Comment thread prov/efa/src/efa_mr.c
@shijin-aws
Copy link
Copy Markdown
Contributor

shijin-aws commented May 7, 2026

@sara4dev These fixes are already part of v2.5.x as part of c9e3c0c, would u mind trying it ? or you prefer to stay on v2.4.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants