Add NIXL backend #6016
Conversation
|
Review updated until commit 7283aa8 Description
|
| Relevant files | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 5 files
| ||||||||||
| Configuration changes | |||||||||||
| Documentation | 1 files
| ||||||||||
| Tests | 1 files
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Spin loop performance concern
This can cause high CPU usage and may not be optimal for latency-sensitive workloads. Consider if this should yield to other threads or use a more efficient waiting mechanism. |
samnordmann
left a comment
There was a problem hiding this comment.
Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.
| #endif | ||
| } | ||
|
|
||
| void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) { |
There was a problem hiding this comment.
This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?
There was a problem hiding this comment.
Definitely important, I suggest we leave it for another PR, to keep this one simple and not too big
|
@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package |
https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029 |
|
Note the build error in the CI |
Greptile SummaryThis PR introduces
Confidence Score: 3/5Not safe to merge — the CMake wiring to define USE_NIXL and link __nvfuser_nixl into codegen_internal is absent, making the entire feature compile as stubs regardless of the build flag. The missing CMake link/compile-definition block means USE_NIXL is never defined, so all #ifdef USE_NIXL blocks compile as empty stubs. Additionally, unresolved error-handling gaps around collective barriers (loadRemoteMD failure deadlock, getLocalMD failure leaving peers in store->get) and the null-TCPStore crash path remain from earlier review rounds. CMakeLists.txt needs the target_link_libraries and target_compile_definitions blocks; csrc/multidevice/nixl.cpp needs attention for the remaining error-handling and barrier-deadlock paths Important Files Changed
Sequence DiagramsequenceDiagram
participant App
participant NixlBackend
participant NixlImpl as NixlBackend::Impl
participant Agent as nixlAgent (UCX)
participant Store as TCPStore
App->>NixlBackend: getInstance()
NixlBackend->>NixlImpl: Impl::create(communicator)
NixlImpl->>Agent: createBackend(UCX)
NixlImpl->>Agent: registerMem(probe) + prepXferDlist (VRAM probe)
NixlImpl-->>NixlBackend: impl_ set (or nullptr if probe fails)
App->>NixlBackend: registerTensors
NixlBackend->>NixlImpl: registerTensors
NixlImpl->>Agent: registerMem(dlist)
NixlImpl->>NixlImpl: exchangeMetadata()
NixlImpl->>Agent: getLocalMD()
NixlImpl->>Store: set(nixl_agent_md_rank_N, local_md)
loop for each peer rank
NixlImpl->>Store: get(nixl_agent_md_rank_peer)
NixlImpl->>Agent: loadRemoteMD(remote_md)
end
NixlImpl->>Store: barrier() then deleteKey
App->>NixlBackend: prepareTransfer(local_descs, remote_descs, op)
NixlBackend->>NixlImpl: prepareTransfer
NixlImpl->>Agent: createXferReq(op, local_dlist, remote_dlist, agent_name)
NixlImpl-->>App: NixlTransferHandle
App->>NixlBackend: postTransfer(handle)
NixlImpl->>Agent: postXferReq(handle)
App->>NixlBackend: waitTransfer(handle)
loop poll until done
NixlImpl->>Agent: getXferStatus(handle)
end
NixlImpl-->>App: transfer complete
Reviews (20): Last reviewed commit: "Merge branch 'main' into dispatch_combin..." | Re-trigger Greptile |
|
!test |
@xwang233 could you please add permission to @x41lakazam to launch CI? Or indicate how to do? |
|
!test |
|
!test |
|
!build |
|
!test |
|
!test |
|
!test |
|
!test |
|
!test |
| message(STATUS " NIXL_FOUND : ${NIXL_FOUND}") | ||
| if(NIXL_FOUND) | ||
| message(STATUS " NIXL_INCLUDE_DIR: ${NIXL_INCLUDE_DIR}") | ||
| message(STATUS " NIXL_LIBRARY : ${NIXL_LIBRARY}") | ||
| endif() |
There was a problem hiding this comment.
Please report this in cmake/deps/handle_nixl.cmake
| struct TensorDesc { | ||
| uintptr_t addr; | ||
| size_t size; | ||
| uint32_t dev; // CUDA device index (tensor.device().index()) |
There was a problem hiding this comment.
| uint32_t dev; // CUDA device index (tensor.device().index()) | |
| uint32_t local_rank; // CUDA device index (tensor.device().index()) |
| // Helper functions for serializing and deserializing tensors descriptors for | ||
| // TCP store | ||
| struct TensorDesc { | ||
| uintptr_t addr; |
There was a problem hiding this comment.
| uintptr_t addr; | |
| void* addr; |
unless we do a lot of pointer arithmetics on this. I haven't seen that just yet in this PR.
| // TCP store | ||
| struct TensorDesc { | ||
| uintptr_t addr; | ||
| size_t size; |
There was a problem hiding this comment.
| size_t size; | |
| int64_t size; |
https://google.github.io/styleguide/cppguide.html#Integer_Types => "On Unsigned Integers"
| const std::vector<TensorDesc>& local_descs, | ||
| const std::vector<TensorDesc>& remote_descs, |
There was a problem hiding this comment.
Can you code-comment what these arguments mean?
|
!build |
|
!test |
No description provided.