Skip to content

Add NIXL backend #6016

Open
x41lakazam wants to merge 45 commits intomainfrom
dispatch_combine/nixl_backend
Open

Add NIXL backend #6016
x41lakazam wants to merge 45 commits intomainfrom
dispatch_combine/nixl_backend

Conversation

@x41lakazam
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 26, 2026

Review updated until commit 7283aa8

Description

  • Add NIXL backend for GPU-to-GPU RDMA transfers in multi-device communication

  • Implement tensor registration, metadata exchange, and transfer preparation/post/wait APIs

  • Add NIXL build option (NVFUSER_BUILD_WITH_NIXL) to CMake and Python build system

  • Include comprehensive tests for transfer handles, validation, and end-to-end transfers

Changes walkthrough

Relevant files
Enhancement
5 files
nixl.h
Define NixlBackend and NixlTransferHandle classes               
+221/-0 
nixl.cpp
Implement NIXL backend with UCX for GPU transfers               
+474/-0 
multidevice.h
Add kNixl to CommunicatorBackend enum                                       
+1/-1     
communicator.h
Add nixl_available_ flag and backend check                             
+7/-1     
communicator.cpp
Initialize nixl_available_ and add NIXL case to output     
+9/-1     
Configuration changes
2 files
CMakeLists.txt
Add NVFUSER_STANDALONE_BUILD_WITH_NIXL option and configuration
+39/-0   
utils.py
Add build_with_nixl config and cmake flag                               
+4/-0     
Documentation
1 files
setup.py
Document NVFUSER_BUILD_WITH_NIXL build option                       
+3/-0     
Tests
1 files
test_multidevice_nixl.cpp
Add tests for NIXL backend functionality                                 
+289/-0 

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Spin loop performance concern

The waitTransfer() function at line 385-399 uses a busy-wait spin loop to poll transfer status.
This can cause high CPU usage and may not be optimal for latency-sensitive workloads.
Consider if this should yield to other threads or use a more efficient waiting mechanism.

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
  NVF_ERROR(handle.isValid(), "Cannot wait on an invalid handle");
  NVF_ERROR(handle.impl_->posted, "Transfer has not been posted yet");

  // TODO - check this spin loop
  NixlXferStatus xfer_status;
  do {
    xfer_status = getTransferStatus(handle);
    NVF_ERROR(
        xfer_status != NixlXferStatus::kError,
        "NIXL transfer completed with an error");
  } while (xfer_status == NixlXferStatus::kInProgress);

  handle.impl_->posted = false;
}
Metadata exchange scalability

The exchangeMetadata() function performs O(world_size) iterations to fetch metadata from all peers
and uses a barrier. This may not scale well to large distributed configurations.
Consider if there's a more efficient approach for metadata exchange.

void NixlBackend::Impl::exchangeMetadata() {
  nixl_blob_t local_md;
  nixl_status_t md_status = agent_->getLocalMD(local_md);
  NVF_ERROR(
      md_status == NIXL_SUCCESS,
      "NIXL getLocalMD failed with status ",
      static_cast<int>(md_status));

  auto* store = communicator_.getTcpStore();
  const auto my_rank = communicator_.deviceId();
  const auto world_size = communicator_.size();

  std::string md_key_prefix = "nixl_agent_md_rank_";
  store->set(
      md_key_prefix + std::to_string(my_rank),
      std::vector<uint8_t>(local_md.begin(), local_md.end()));

  for (int64_t rank = 0; rank < world_size; ++rank) {
    if (rank == my_rank) {
      continue;
    }
    // Fetch & load MD
    auto bytes = store->get(md_key_prefix + std::to_string(rank));
    nixl_blob_t remote_md(bytes.begin(), bytes.end());
    std::string remote_agent_name;
    nixl_status_t status = agent_->loadRemoteMD(remote_md, remote_agent_name);
    NVF_ERROR(
        status == NIXL_SUCCESS,
        "NIXL loadRemoteMD failed for rank ",
        rank,
        " with status ",
        static_cast<int>(status));
  }

  // Barrier before deleting keys so no rank reads a deleted key.
  communicator_.barrier();

  store->deleteKey(md_key_prefix + std::to_string(my_rank));
  metadata_exchanged_ = true;
}
Probe mechanism validation

The UCX CUDA support probe (lines 168-210) is a good defensive addition. However, verify that
the probe correctly handles all edge cases where UCX might claim VRAM support but actually
misclassify memory. Consider adding logging when the probe fails.

// Probe: verify that VRAM (CUDA GPU memory) is actually usable with
// the UCX backend. Some UCX installations lack CUDA support, causing
// registerMem to silently misclassify VRAM as host memory. We detect
// this by registering a small buffer and asking NIXL to prepare a
// local descriptor list for VRAM -- if no backend claims VRAM, the
// probe fails and we mark the backend as unavailable.
{
  constexpr int64_t kProbeBytes = 1;
  auto probe = at::empty(
      {kProbeBytes},
      at::TensorOptions().dtype(at::kByte).device(
          at::kCUDA, communicator.deviceId()));
  size_t nbytes = static_cast<size_t>(probe.nbytes());
  uintptr_t addr = reinterpret_cast<uintptr_t>(probe.data_ptr());
  uint32_t dev_idx = static_cast<uint32_t>(probe.device().index());

  NVF_ERROR(nbytes > 0, "NIXL probe: unexpected zero-byte tensor");
  NVF_ERROR(addr != 0, "NIXL probe: null data pointer");

  nixl_reg_dlist_t reg_dlist(VRAM_SEG);
  reg_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixl_status_t reg_status = impl->agent_->registerMem(reg_dlist);
  if (reg_status != NIXL_SUCCESS) {
    return nullptr;
  }

  nixl_xfer_dlist_t xfer_dlist(VRAM_SEG);
  xfer_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixlDlistH* dlist_handle = nullptr;
  nixl_status_t prep_status =
      impl->agent_->prepXferDlist(NIXL_INIT_AGENT, xfer_dlist, dlist_handle);

  if (dlist_handle) {
    impl->agent_->releasedDlistH(dlist_handle);
  }
  impl->agent_->deregisterMem(reg_dlist);

  if (prep_status != NIXL_SUCCESS) {
    return nullptr;
  }
}

Copy link
Copy Markdown
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.

Comment thread csrc/multidevice/communicator.h Outdated
Comment thread csrc/multidevice/nixl.h
Comment thread csrc/multidevice/nixl.h Outdated
Comment thread csrc/multidevice/nixl.h
Comment thread csrc/multidevice/nixl.h Outdated
Comment thread csrc/multidevice/nixl.cpp
Comment thread csrc/multidevice/nixl.cpp
Comment thread csrc/multidevice/nixl.cpp
Comment thread csrc/multidevice/nixl.cpp
Comment thread csrc/multidevice/nixl.cpp
#endif
}

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely important, I suggest we leave it for another PR, to keep this one simple and not too big

@samnordmann
Copy link
Copy Markdown
Collaborator

@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package

@samnordmann
Copy link
Copy Markdown
Collaborator

samnordmann commented Feb 26, 2026

unless it is already shipped in some DLFW package

https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029

@samnordmann
Copy link
Copy Markdown
Collaborator

Note the build error in the CI

  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:144:17: error: private field 'communicator_' is not used [-Werror,-Wunused-private-field]
    144 |   Communicator& communicator_;
        |                 ^
  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:146:8: error: private field 'metadata_exchanged_' is not used [-Werror,-Wunused-private-field]
    146 |   bool metadata_exchanged_ = false;
        |        ^

@x41lakazam x41lakazam marked this pull request as ready for review March 2, 2026 16:23
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 2, 2026

Greptile Summary

This PR introduces NixlBackend, a new one-sided RDMA transfer backend for nvfuser's multidevice path using the NIXL library over UCX. It adds CMake detection, Python dependency tooling, a C++ singleton implementation with memory registration and transfer lifecycle management, and end-to-end tests.

  • Core implementation (nixl.cpp / nixl.h): singleton NixlBackend wraps a nixlAgent + UCX backend, with collective registerTensors/deregisterTensors that exchange agent metadata via TCPStore, and a prepare/post/wait transfer API with a pimpl pattern for #ifdef USE_NIXL gating.
  • CMake wiring (CMakeLists.txt): adds nixl.cpp to sources and the handle_nixl.cmake dependency script, but is missing target_link_libraries(codegen_internal PRIVATE __nvfuser_nixl) and target_compile_definitions(codegen_internal PRIVATE USE_NIXL) — so USE_NIXL is never defined and the entire feature compiles as stubs even when NVFUSER_BUILD_WITH_NIXL=ON.
  • Install tooling (tools/install-nixl.sh): supports both pip-wheel and from-source (UCX+CUDA) installation modes with auto-detection.

Confidence Score: 3/5

Not safe to merge — the CMake wiring to define USE_NIXL and link __nvfuser_nixl into codegen_internal is absent, making the entire feature compile as stubs regardless of the build flag.

The missing CMake link/compile-definition block means USE_NIXL is never defined, so all #ifdef USE_NIXL blocks compile as empty stubs. Additionally, unresolved error-handling gaps around collective barriers (loadRemoteMD failure deadlock, getLocalMD failure leaving peers in store->get) and the null-TCPStore crash path remain from earlier review rounds.

CMakeLists.txt needs the target_link_libraries and target_compile_definitions blocks; csrc/multidevice/nixl.cpp needs attention for the remaining error-handling and barrier-deadlock paths

Important Files Changed

Filename Overview
csrc/multidevice/nixl.cpp Core NIXL backend implementation: agent creation, memory registration, metadata exchange, and transfer lifecycle; several robustness issues remain open (null TCPStore, posted flag not reset on error, waitTransfer spin with no timeout)
csrc/multidevice/nixl.h Public API header with TensorDesc, serialization helpers, and NixlBackend/NixlTransferHandle declarations; design and documentation are clear
CMakeLists.txt Adds NIXL source and option but is missing target_link_libraries and target_compile_definitions(USE_NIXL) for codegen_internal, so the feature is dead even when NVFUSER_BUILD_WITH_NIXL=ON
cmake/deps/handle_nixl.cmake Correct find_path/find_library logic and interface target creation; the CUDA version check via nixl._pkg.name is fragile (private attribute, silently skipped)
csrc/multidevice/communicator.cpp Sets nixl_available_ based solely on USE_NIXL compile-time flag, not actual runtime availability from NixlBackend::isAvailable()
tests/cpp/test_multidevice_nixl.cpp Good test coverage for validation, end-to-end read/write transfers, and round-trip registration; all tests skip gracefully when NIXL is unavailable
tools/install-nixl.sh Install script supporting pip and from-source (UCX+CUDA) modes with auto-detection; logic is clear and functional

Sequence Diagram

sequenceDiagram
    participant App
    participant NixlBackend
    participant NixlImpl as NixlBackend::Impl
    participant Agent as nixlAgent (UCX)
    participant Store as TCPStore

    App->>NixlBackend: getInstance()
    NixlBackend->>NixlImpl: Impl::create(communicator)
    NixlImpl->>Agent: createBackend(UCX)
    NixlImpl->>Agent: registerMem(probe) + prepXferDlist (VRAM probe)
    NixlImpl-->>NixlBackend: impl_ set (or nullptr if probe fails)

    App->>NixlBackend: registerTensors
    NixlBackend->>NixlImpl: registerTensors
    NixlImpl->>Agent: registerMem(dlist)
    NixlImpl->>NixlImpl: exchangeMetadata()
    NixlImpl->>Agent: getLocalMD()
    NixlImpl->>Store: set(nixl_agent_md_rank_N, local_md)
    loop for each peer rank
        NixlImpl->>Store: get(nixl_agent_md_rank_peer)
        NixlImpl->>Agent: loadRemoteMD(remote_md)
    end
    NixlImpl->>Store: barrier() then deleteKey

    App->>NixlBackend: prepareTransfer(local_descs, remote_descs, op)
    NixlBackend->>NixlImpl: prepareTransfer
    NixlImpl->>Agent: createXferReq(op, local_dlist, remote_dlist, agent_name)
    NixlImpl-->>App: NixlTransferHandle

    App->>NixlBackend: postTransfer(handle)
    NixlImpl->>Agent: postXferReq(handle)

    App->>NixlBackend: waitTransfer(handle)
    loop poll until done
        NixlImpl->>Agent: getXferStatus(handle)
    end
    NixlImpl-->>App: transfer complete
Loading

Reviews (20): Last reviewed commit: "Merge branch 'main' into dispatch_combin..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread csrc/multidevice/nixl.cpp Outdated
Comment thread csrc/multidevice/nixl.cpp
Comment thread csrc/multidevice/nixl.cpp
Comment thread csrc/multidevice/communicator.h Outdated
@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

@samnordmann
Copy link
Copy Markdown
Collaborator

!test

@xwang233 could you please add permission to @x41lakazam to launch CI? Or indicate how to do?
Thanks!

@samnordmann
Copy link
Copy Markdown
Collaborator

!test

@samnordmann
Copy link
Copy Markdown
Collaborator

!test

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!build

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

Copy link
Copy Markdown
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you

@samnordmann
Copy link
Copy Markdown
Collaborator

!test

@samnordmann samnordmann requested a review from wujingyue March 23, 2026 13:55
Comment thread CMakeLists.txt
Comment on lines +1308 to +1312
message(STATUS " NIXL_FOUND : ${NIXL_FOUND}")
if(NIXL_FOUND)
message(STATUS " NIXL_INCLUDE_DIR: ${NIXL_INCLUDE_DIR}")
message(STATUS " NIXL_LIBRARY : ${NIXL_LIBRARY}")
endif()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please report this in cmake/deps/handle_nixl.cmake

Copy link
Copy Markdown
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the new backend! I'll do another round after my initial comments are addressed. I'll defer the build system changes to @mdavis36

Comment thread csrc/multidevice/nixl.h Outdated
struct TensorDesc {
uintptr_t addr;
size_t size;
uint32_t dev; // CUDA device index (tensor.device().index())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t dev; // CUDA device index (tensor.device().index())
uint32_t local_rank; // CUDA device index (tensor.device().index())

Comment thread csrc/multidevice/nixl.h Outdated
// Helper functions for serializing and deserializing tensors descriptors for
// TCP store
struct TensorDesc {
uintptr_t addr;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uintptr_t addr;
void* addr;

unless we do a lot of pointer arithmetics on this. I haven't seen that just yet in this PR.

Comment thread csrc/multidevice/nixl.h Outdated
// TCP store
struct TensorDesc {
uintptr_t addr;
size_t size;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t size;
int64_t size;

https://google.github.io/styleguide/cppguide.html#Integer_Types => "On Unsigned Integers"

Comment thread csrc/multidevice/nixl.h Outdated
Comment on lines +199 to +200
const std::vector<TensorDesc>& local_descs,
const std::vector<TensorDesc>& remote_descs,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you code-comment what these arguments mean?

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!build

@x41lakazam
Copy link
Copy Markdown
Collaborator Author

!test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants