Skip to content

fix: distinguish TB4 from TB5 for Thunderbolt connections#1813

Open
dogukanveziroglu wants to merge 3 commits intoexo-explore:mainfrom
dogukanveziroglu:fix/tb4-socket-connections
Open

fix: distinguish TB4 from TB5 for Thunderbolt connections#1813
dogukanveziroglu wants to merge 3 commits intoexo-explore:mainfrom
dogukanveziroglu:fix/tb4-socket-connections

Conversation

@dogukanveziroglu
Copy link
Copy Markdown

Summary

  • TB4 devices (≤40 Gb/s) don't support RDMA, but macOS Tahoe creates rdma_enX devices for all Thunderbolt ports. This caused all Thunderbolt connections to be registered as RDMAConnection, breaking placement for TB4 clusters.
  • Parse linkSpeed to distinguish TB4 from TB5: TB5 (>40 Gb/s) gets RDMAConnection, TB4 gets SocketConnection via Thunderbolt Bridge IP
  • Dashboard "RDMA NOT ENABLED" warning now only shows for actual TB5 hardware

Test plan

  • Tested on 2x Mac Mini M4 (TB4) cluster — placement now works correctly with MlxRing
  • TB5 behavior unchanged (RDMAConnection created as before)
  • Verify on TB5 hardware

Fixes #1636

@AlexCheema
Copy link
Copy Markdown
Contributor

What do you mean when you say "breaking placement for TB4 clusters".
Ring should not use RDMA, so I don't see why it would break anything?

@dogukanveziroglu
Copy link
Copy Markdown
Author

What do you mean when you say "breaking placement for TB4 clusters". Ring should not use RDMA, so I don't see why it would break anything?

You're right Ring doesn't use RDMA, so the apply.py change was unnecessary. My bad, the RDMA warning on the dashboard confused me into thinking the connection type was the issue, but it was actually ping discovery failing in my environment (macOS LNP blocking check_reachable in SSH sessions).

I'll remove the apply.py change and keep only the dashboard warning fix.

TB4 devices (≤40 Gb/s) don't support RDMA, but macOS Tahoe creates
rdma_enX devices for all Thunderbolt ports regardless. This caused Exo
to register all Thunderbolt connections as RDMAConnection, breaking
placement for TB4 clusters since the ring backend requires
SocketConnection edges.

Backend (apply.py):
- Parse link_speed from ThunderboltIdentifier to detect TB4 vs TB5
- TB5 (>40 Gb/s): create RDMAConnection as before
- TB4 (≤40 Gb/s): create SocketConnection via Thunderbolt Bridge IP

Dashboard (+page.svelte):
- Only show "RDMA NOT ENABLED" warning for actual TB5 hardware
- Filter by linkSpeed > 40 Gb/s instead of any TB interface presence

Fixes exo-explore#1636
Remove unnecessary backend changes — Ring doesn't use RDMA connections.
Only the dashboard linkSpeed check is needed to distinguish TB4 from TB5.
@dogukanveziroglu dogukanveziroglu force-pushed the fix/tb4-socket-connections branch from def062d to 533df0c Compare March 29, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] EXO Suggest enabling RDMA on Thunderbolt 4

2 participants