Add NIXL backend by x41lakazam · Pull Request #6016 · NVIDIA/Fuser

x41lakazam · 2026-02-26T08:30:36Z

No description provided.

github-actions · 2026-02-26T08:54:21Z

Review updated until commit 7283aa8

Description

Add NIXL backend for GPU-to-GPU RDMA transfers in multi-device communication
Implement tensor registration, metadata exchange, and transfer preparation/post/wait APIs
Add NIXL build option (NVFUSER_BUILD_WITH_NIXL) to CMake and Python build system
Include comprehensive tests for transfer handles, validation, and end-to-end transfers

Changes walkthrough

Relevant files

Enhancement

5 files

nixl.h `Define NixlBackend and NixlTransferHandle classes`	+221/-0
nixl.cpp `Implement NIXL backend with UCX for GPU transfers`	+474/-0
multidevice.h `Add kNixl to CommunicatorBackend enum`	+1/-1
communicator.h `Add nixl_available_ flag and backend check`	+7/-1
communicator.cpp `Initialize nixl_available_ and add NIXL case to output`	+9/-1

Configuration changes

2 files

CMakeLists.txt `Add NVFUSER_STANDALONE_BUILD_WITH_NIXL option and configuration`	+39/-0
utils.py `Add build_with_nixl config and cmake flag`	+4/-0

Documentation

1 files

setup.py `Document NVFUSER_BUILD_WITH_NIXL build option`	+3/-0

Tests

1 files

test_multidevice_nixl.cpp `Add tests for NIXL backend functionality`	+289/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Spin loop performance concern

The waitTransfer() function at line 385-399 uses a busy-wait spin loop to poll transfer status.
This can cause high CPU usage and may not be optimal for latency-sensitive workloads.
Consider if this should yield to other threads or use a more efficient waiting mechanism.

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
  NVF_ERROR(handle.isValid(), "Cannot wait on an invalid handle");
  NVF_ERROR(handle.impl_->posted, "Transfer has not been posted yet");

  // TODO - check this spin loop
  NixlXferStatus xfer_status;
  do {
    xfer_status = getTransferStatus(handle);
    NVF_ERROR(
        xfer_status != NixlXferStatus::kError,
        "NIXL transfer completed with an error");
  } while (xfer_status == NixlXferStatus::kInProgress);

  handle.impl_->posted = false;
}

Metadata exchange scalability

The exchangeMetadata() function performs O(world_size) iterations to fetch metadata from all peers
and uses a barrier. This may not scale well to large distributed configurations.
Consider if there's a more efficient approach for metadata exchange.

void NixlBackend::Impl::exchangeMetadata() {
  nixl_blob_t local_md;
  nixl_status_t md_status = agent_->getLocalMD(local_md);
  NVF_ERROR(
      md_status == NIXL_SUCCESS,
      "NIXL getLocalMD failed with status ",
      static_cast<int>(md_status));

  auto* store = communicator_.getTcpStore();
  const auto my_rank = communicator_.deviceId();
  const auto world_size = communicator_.size();

  std::string md_key_prefix = "nixl_agent_md_rank_";
  store->set(
      md_key_prefix + std::to_string(my_rank),
      std::vector<uint8_t>(local_md.begin(), local_md.end()));

  for (int64_t rank = 0; rank < world_size; ++rank) {
    if (rank == my_rank) {
      continue;
    }
    // Fetch & load MD
    auto bytes = store->get(md_key_prefix + std::to_string(rank));
    nixl_blob_t remote_md(bytes.begin(), bytes.end());
    std::string remote_agent_name;
    nixl_status_t status = agent_->loadRemoteMD(remote_md, remote_agent_name);
    NVF_ERROR(
        status == NIXL_SUCCESS,
        "NIXL loadRemoteMD failed for rank ",
        rank,
        " with status ",
        static_cast<int>(status));
  }

  // Barrier before deleting keys so no rank reads a deleted key.
  communicator_.barrier();

  store->deleteKey(md_key_prefix + std::to_string(my_rank));
  metadata_exchanged_ = true;
}

Probe mechanism validation

The UCX CUDA support probe (lines 168-210) is a good defensive addition. However, verify that
the probe correctly handles all edge cases where UCX might claim VRAM support but actually
misclassify memory. Consider adding logging when the probe fails.

// Probe: verify that VRAM (CUDA GPU memory) is actually usable with
// the UCX backend. Some UCX installations lack CUDA support, causing
// registerMem to silently misclassify VRAM as host memory. We detect
// this by registering a small buffer and asking NIXL to prepare a
// local descriptor list for VRAM -- if no backend claims VRAM, the
// probe fails and we mark the backend as unavailable.
{
  constexpr int64_t kProbeBytes = 1;
  auto probe = at::empty(
      {kProbeBytes},
      at::TensorOptions().dtype(at::kByte).device(
          at::kCUDA, communicator.deviceId()));
  size_t nbytes = static_cast<size_t>(probe.nbytes());
  uintptr_t addr = reinterpret_cast<uintptr_t>(probe.data_ptr());
  uint32_t dev_idx = static_cast<uint32_t>(probe.device().index());

  NVF_ERROR(nbytes > 0, "NIXL probe: unexpected zero-byte tensor");
  NVF_ERROR(addr != 0, "NIXL probe: null data pointer");

  nixl_reg_dlist_t reg_dlist(VRAM_SEG);
  reg_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixl_status_t reg_status = impl->agent_->registerMem(reg_dlist);
  if (reg_status != NIXL_SUCCESS) {
    return nullptr;
  }

  nixl_xfer_dlist_t xfer_dlist(VRAM_SEG);
  xfer_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixlDlistH* dlist_handle = nullptr;
  nixl_status_t prep_status =
      impl->agent_->prepXferDlist(NIXL_INIT_AGENT, xfer_dlist, dlist_handle);

  if (dlist_handle) {
    impl->agent_->releasedDlistH(dlist_handle);
  }
  impl->agent_->deregisterMem(reg_dlist);

  if (prep_status != NIXL_SUCCESS) {
    return nullptr;
  }
}

samnordmann

Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.

samnordmann · 2026-02-26T16:07:56Z

+#endif
+}
+
+void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {


This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?

Definitely important, I suggest we leave it for another PR, to keep this one simple and not too big

samnordmann · 2026-02-26T16:19:07Z

@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package

samnordmann · 2026-02-26T16:58:21Z

unless it is already shipped in some DLFW package

https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029

samnordmann · 2026-03-02T10:27:38Z

Note the build error in the CI

  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:144:17: error: private field 'communicator_' is not used [-Werror,-Wunused-private-field]
    144 |   Communicator& communicator_;
        |                 ^
  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:146:8: error: private field 'metadata_exchanged_' is not used [-Werror,-Wunused-private-field]
    146 |   bool metadata_exchanged_ = false;
        |        ^

greptile-apps · 2026-03-02T16:30:02Z

Greptile Summary

This PR introduces NixlBackend, a new one-sided RDMA transfer backend for nvfuser's multidevice path using the NIXL library over UCX. It adds CMake detection, Python dependency tooling, a C++ singleton implementation with memory registration and transfer lifecycle management, and end-to-end tests.

Core implementation (nixl.cpp / nixl.h): singleton NixlBackend wraps a nixlAgent + UCX backend, with collective registerTensors/deregisterTensors that exchange agent metadata via TCPStore, and a prepare/post/wait transfer API with a pimpl pattern for #ifdef USE_NIXL gating.
CMake wiring (CMakeLists.txt): adds nixl.cpp to sources and the handle_nixl.cmake dependency script, but is missing target_link_libraries(codegen_internal PRIVATE __nvfuser_nixl) and target_compile_definitions(codegen_internal PRIVATE USE_NIXL) — so USE_NIXL is never defined and the entire feature compiles as stubs even when NVFUSER_BUILD_WITH_NIXL=ON.
Install tooling (tools/install-nixl.sh): supports both pip-wheel and from-source (UCX+CUDA) installation modes with auto-detection.

Confidence Score: 3/5

Not safe to merge — the CMake wiring to define USE_NIXL and link __nvfuser_nixl into codegen_internal is absent, making the entire feature compile as stubs regardless of the build flag.

The missing CMake link/compile-definition block means USE_NIXL is never defined, so all #ifdef USE_NIXL blocks compile as empty stubs. Additionally, unresolved error-handling gaps around collective barriers (loadRemoteMD failure deadlock, getLocalMD failure leaving peers in store->get) and the null-TCPStore crash path remain from earlier review rounds.

CMakeLists.txt needs the target_link_libraries and target_compile_definitions blocks; csrc/multidevice/nixl.cpp needs attention for the remaining error-handling and barrier-deadlock paths

Important Files Changed

Filename	Overview
csrc/multidevice/nixl.cpp	Core NIXL backend implementation: agent creation, memory registration, metadata exchange, and transfer lifecycle; several robustness issues remain open (null TCPStore, posted flag not reset on error, waitTransfer spin with no timeout)
csrc/multidevice/nixl.h	Public API header with TensorDesc, serialization helpers, and NixlBackend/NixlTransferHandle declarations; design and documentation are clear
CMakeLists.txt	Adds NIXL source and option but is missing target_link_libraries and target_compile_definitions(USE_NIXL) for codegen_internal, so the feature is dead even when NVFUSER_BUILD_WITH_NIXL=ON
cmake/deps/handle_nixl.cmake	Correct find_path/find_library logic and interface target creation; the CUDA version check via nixl._pkg.name is fragile (private attribute, silently skipped)
csrc/multidevice/communicator.cpp	Sets nixl_available_ based solely on USE_NIXL compile-time flag, not actual runtime availability from NixlBackend::isAvailable()
tests/cpp/test_multidevice_nixl.cpp	Good test coverage for validation, end-to-end read/write transfers, and round-trip registration; all tests skip gracefully when NIXL is unavailable
tools/install-nixl.sh	Install script supporting pip and from-source (UCX+CUDA) modes with auto-detection; logic is clear and functional

Sequence Diagram

sequenceDiagram
    participant App
    participant NixlBackend
    participant NixlImpl as NixlBackend::Impl
    participant Agent as nixlAgent (UCX)
    participant Store as TCPStore

    App->>NixlBackend: getInstance()
    NixlBackend->>NixlImpl: Impl::create(communicator)
    NixlImpl->>Agent: createBackend(UCX)
    NixlImpl->>Agent: registerMem(probe) + prepXferDlist (VRAM probe)
    NixlImpl-->>NixlBackend: impl_ set (or nullptr if probe fails)

    App->>NixlBackend: registerTensors
    NixlBackend->>NixlImpl: registerTensors
    NixlImpl->>Agent: registerMem(dlist)
    NixlImpl->>NixlImpl: exchangeMetadata()
    NixlImpl->>Agent: getLocalMD()
    NixlImpl->>Store: set(nixl_agent_md_rank_N, local_md)
    loop for each peer rank
        NixlImpl->>Store: get(nixl_agent_md_rank_peer)
        NixlImpl->>Agent: loadRemoteMD(remote_md)
    end
    NixlImpl->>Store: barrier() then deleteKey

    App->>NixlBackend: prepareTransfer(local_descs, remote_descs, op)
    NixlBackend->>NixlImpl: prepareTransfer
    NixlImpl->>Agent: createXferReq(op, local_dlist, remote_dlist, agent_name)
    NixlImpl-->>App: NixlTransferHandle

    App->>NixlBackend: postTransfer(handle)
    NixlImpl->>Agent: postXferReq(handle)

    App->>NixlBackend: waitTransfer(handle)
    loop poll until done
        NixlImpl->>Agent: getXferStatus(handle)
    end
    NixlImpl-->>App: transfer complete

_{Reviews (20): Last reviewed commit: "Merge branch 'main' into dispatch_combin..." | Re-trigger Greptile}

greptile-apps

_{9 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

x41lakazam · 2026-03-11T10:11:56Z

!test

samnordmann · 2026-03-11T10:15:18Z

!test

@xwang233 could you please add permission to @x41lakazam to launch CI? Or indicate how to do?
Thanks!

samnordmann · 2026-03-11T10:15:28Z

!test

samnordmann · 2026-03-11T13:51:34Z

!test

x41lakazam · 2026-03-12T10:28:41Z

!build

…all-nixl.sh

x41lakazam · 2026-03-14T21:58:22Z

!test

x41lakazam · 2026-03-17T10:12:11Z

!test

x41lakazam · 2026-03-17T12:55:44Z

!test

x41lakazam · 2026-03-23T13:41:10Z

!test

samnordmann

LGTM! Thank you

samnordmann · 2026-03-23T13:54:32Z

!test

mdavis36 · 2026-03-23T15:06:33Z

+message(STATUS "  NIXL_FOUND                         : ${NIXL_FOUND}")
+if(NIXL_FOUND)
+  message(STATUS "    NIXL_INCLUDE_DIR: ${NIXL_INCLUDE_DIR}")
+  message(STATUS "    NIXL_LIBRARY    : ${NIXL_LIBRARY}")
+endif()


Please report this in cmake/deps/handle_nixl.cmake

wujingyue

Thanks for adding the new backend! I'll do another round after my initial comments are addressed. I'll defer the build system changes to @mdavis36

wujingyue · 2026-03-23T19:53:09Z

+struct TensorDesc {
+  uintptr_t addr;
+  size_t size;
+  uint32_t dev; // CUDA device index (tensor.device().index())


Suggested change

uint32_t dev; // CUDA device index (tensor.device().index())

uint32_t local_rank; // CUDA device index (tensor.device().index())

wujingyue · 2026-03-23T19:55:01Z

+// Helper functions for serializing and deserializing tensors descriptors for
+// TCP store
+struct TensorDesc {
+  uintptr_t addr;


Suggested change

uintptr_t addr;

void* addr;

unless we do a lot of pointer arithmetics on this. I haven't seen that just yet in this PR.

wujingyue · 2026-03-23T19:55:48Z

+// TCP store
+struct TensorDesc {
+  uintptr_t addr;
+  size_t size;


Suggested change

size_t size;

int64_t size;

https://google.github.io/styleguide/cppguide.html#Integer_Types => "On Unsigned Integers"

wujingyue · 2026-03-23T19:56:44Z

+      const std::vector<TensorDesc>& local_descs,
+      const std::vector<TensorDesc>& remote_descs,


Can you code-comment what these arguments mean?

x41lakazam · 2026-05-05T13:12:00Z

!build

x41lakazam · 2026-05-06T11:15:44Z

!test

samnordmann and others added 9 commits January 21, 2026 02:11

first working dispatch and combine primitive for k=1

cf77bdb

add comments and cleanup

66e7811

add kernel based a2av and cuda backend for d/c

dda9aa7

unstable - add nixl backend

7aa2de8

unstable

9a8a377

add python build changes for nixl

0f21528

fix typo

6144827

merge main

04a9133

restore main:

b32587a

fix bug where zero-length buffer was passed to nixl

f8a94fc

samnordmann reviewed Feb 26, 2026

View reviewed changes

Reduce probe size to 1

a6b6f87

Address PR comments.

95460af

x41lakazam added the Multi-GPU label Mar 2, 2026

x41lakazam marked this pull request as ready for review March 2, 2026 16:23

greptile-apps Bot reviewed Mar 2, 2026

View reviewed changes

Comment thread csrc/multidevice/nixl.cpp Outdated

x41lakazam added 4 commits March 4, 2026 11:56

typos

41ec0ac

set getAgentName to inline

d63ffd7

fix comments in nixl.cpp

86e5028

clean ifdef USE_NIXL statements

7283aa8

greptile-apps Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread csrc/multidevice/nixl.cpp

Comment thread csrc/multidevice/nixl.cpp

Comment thread csrc/multidevice/communicator.h Outdated

x41lakazam added 4 commits March 8, 2026 11:58

inline exchangeMetadata inside registerTensors

a085c54

include deviceId (rank) inside TensorDesc

13ae58f

remove useless handleImpl.isPrepared

149c15a

add thread yield in wait transfer loop

1b41788

x41lakazam added 5 commits March 10, 2026 17:44

Replace NVFUSER_STANDALONE_BUILD_WITH_NIXL by NVFUSER_BUILD_WITH_NIXL

67a92f0

move nixl linkage from CmakeList to handle_nixl.cmake

dae35aa

move TPL locs to handle_nixl.cmake

a9fb56d

Fix - move nixl linkage to handle_nixl.cmake

e364949

fix linter

f74078f

Add NIXL to CI image

9847a11

x41lakazam and others added 2 commits March 11, 2026 18:36

remove import nixl from install-nixl.sh

40758d6

Add transitive shared libs deps for nixl

bf471ea

Add nixl*.mesonpy.libs and nixl*.libs as shared lib dirs in CI's inst…

df6c384

…all-nixl.sh

try to make nixl tests work

1c10779

remove nixl from clang build

5eca79f

Merge branch 'main' into dispatch_combine/nixl_backend

deddb46

samnordmann approved these changes Mar 23, 2026

View reviewed changes

samnordmann requested a review from wujingyue March 23, 2026 13:55

mdavis36 reviewed Mar 23, 2026

View reviewed changes

wujingyue reviewed Mar 23, 2026

View reviewed changes

Address PR comments

5a568fc

Merge branch 'main' into dispatch_combine/nixl_backend

99a08ee

	uint32_t dev; // CUDA device index (tensor.device().index())
	uint32_t local_rank; // CUDA device index (tensor.device().index())

		const std::vector<TensorDesc>& local_descs,
		const std::vector<TensorDesc>& remote_descs,

Conversation

x41lakazam commented Feb 26, 2026

Uh oh!

github-actions Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

samnordmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samnordmann Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

x41lakazam Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

samnordmann commented Feb 26, 2026

Uh oh!

samnordmann commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samnordmann commented Mar 2, 2026

Uh oh!

greptile-apps Bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

x41lakazam commented Mar 11, 2026

Uh oh!

samnordmann commented Mar 11, 2026

Uh oh!

samnordmann commented Mar 11, 2026

Uh oh!

samnordmann commented Mar 11, 2026

Uh oh!

x41lakazam commented Mar 12, 2026

Uh oh!

x41lakazam commented Mar 14, 2026

Uh oh!

x41lakazam commented Mar 17, 2026

Uh oh!

x41lakazam commented Mar 17, 2026

Uh oh!

x41lakazam commented Mar 23, 2026

Uh oh!

samnordmann left a comment

Choose a reason for hiding this comment

Uh oh!

samnordmann commented Mar 23, 2026

Uh oh!

mdavis36 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 26, 2026 •

edited

Loading

samnordmann commented Feb 26, 2026 •

edited

Loading

greptile-apps Bot commented Mar 2, 2026 •

edited

Loading