Skip to content

feat: peer-to-peer model downloads over LAN#1699

Open
ecohash-co wants to merge 4 commits intoexo-explore:mainfrom
ecohash-co:feat/peer-download
Open

feat: peer-to-peer model downloads over LAN#1699
ecohash-co wants to merge 4 commits intoexo-explore:mainfrom
ecohash-co:feat/peer-download

Conversation

@ecohash-co
Copy link
Copy Markdown
Contributor

Summary

When multiple nodes need the same model, every node currently downloads the full model independently from HuggingFace. This PR adds peer-to-peer model transfer so only one node downloads from HF while others fetch it over the LAN — eliminating redundant internet downloads and cutting cluster startup time roughly in half.

  • PeerFileServer: Lightweight aiohttp HTTP server on each node (port 52416) that serves model files from the local cache with Range request support
  • PeerAwareShardDownloader: Wraps ResumableShardDownloader in a decorator pattern — checks if any peer already has the model before hitting HuggingFace, falls back to HF on failure
  • Streaming relay: Followers can download from a peer while it's still downloading from HF, via .partial.meta companion files that track flushed byte boundaries
  • Zero config: Enabled by default, disable with --no-peer-download or EXO_PEER_DOWNLOAD_PORT env var

How it works

Current:   HuggingFace → Node A (full download)
           HuggingFace → Node B (full download)     ← 2× internet bandwidth

Proposed:  HuggingFace → Node A → serves over LAN → Node B
                              ↓                         ↓
                        saves locally              saves locally
  1. No leader election needed — first node to start downloading becomes the de facto seed
  2. No new gossipsub messages — reuses existing NodeDownloadProgress events and topology for peer discovery + IP resolution
  3. Graceful fallback — if peer transfer fails or is unreachable, the node falls back to HuggingFace with .partial resume support
  4. Backend-agnostic — works with MLX, tinygrad, PyTorch (disk-to-disk transfer, any engine can load)
  5. Network-agnostic — works over any LAN (Ethernet, WiFi, Thunderbolt)

Streaming relay detail

The key innovation: while Node A is downloading model.safetensors from HF, it writes a companion .partial.meta file after each 8MB chunk flush:

{"safe_bytes": 83886080, "total": 4294967296, "etag": "abc123"}

Node B's PeerFileServer reads this metadata and serves only the safe byte range — never unflushed data. Node B polls with Range requests as Node A progresses, receiving chunks as they become available. If Node A dies mid-download, Node B's retry loop times out and falls back to HF, resuming from its own .partial file.

Relationship to PR #1463

PR #1463 implements MLX-specific memory-to-memory transfer via all_sum over Thunderbolt. This PR is complementary:

This PR PR #1463
Layer Disk-to-disk (download time) Memory-to-memory (load time)
Transport HTTP over any LAN MLX distributed over Thunderbolt
Backends All (MLX, tinygrad, PyTorch) MLX only
Networks Any (Ethernet, WiFi, TB) Thunderbolt only
Code overlap None None

Files changed

New files (4 modules + 1 test)

  • src/exo/download/peer_file_server.py — HTTP file server with Range + streaming relay support
  • src/exo/download/peer_download.py — HTTP client for downloading from peers
  • src/exo/download/peer_state.py — Reads State.downloads + State.topology to find peers with models
  • src/exo/download/peer_shard_downloader.pyShardDownloader decorator: try peer, fall back to HF
  • src/exo/download/tests/test_peer_download.py — 11 tests covering server, client, and edge cases

Modified files (4 files, minimal changes)

  • src/exo/download/download_utils.py — Write .partial.meta after each chunk flush (+25 lines)
  • src/exo/download/impl_shard_downloader.py — Factory accepts optional PeerStateProvider (+7 lines)
  • src/exo/main.py — Wire up PeerFileServer + PeerStateProvider, add --no-peer-download flag (+55 lines)
  • src/exo/shared/constants.py — Add EXO_PEER_DOWNLOAD_PORT constant (+3 lines)

Test plan

  • All 11 new peer download tests pass
  • All 20 existing download tests still pass (no regressions)
  • All 79 worker tests still pass
  • Manual test: 2-node cluster, verify second node downloads from first over LAN
  • Manual test: kill the seed node mid-download, verify follower falls back to HF
  • Manual test: --no-peer-download flag disables peer download

Addresses #1257, #721, #1606

🤖 Generated with Claude Code

When multiple nodes need the same model, only one downloads from
HuggingFace while others fetch it over the LAN — eliminating redundant
internet downloads and cutting cluster startup time roughly in half.

Architecture:
- PeerFileServer: lightweight aiohttp server on each node (port 52416)
  that serves model files from local cache with Range request support
- PeerAwareShardDownloader: wraps ResumableShardDownloader, checks if any
  peer already has the model before hitting HuggingFace
- Streaming relay: followers can download from a peer while it's still
  downloading from HF, via .partial.meta companion files that track
  flushed byte boundaries
- Graceful fallback: if peer transfer fails, falls back to HuggingFace
  with .partial resume support

Key design decisions:
- No new gossipsub messages — reuses existing NodeDownloadProgress events
  and topology for peer discovery and IP resolution
- No leader election — first node to start becomes de facto seed
- Backend-agnostic — works with MLX, tinygrad, PyTorch (any engine)
- Network-agnostic — works over any LAN (Ethernet, WiFi, Thunderbolt)
- Zero config — enabled by default, disable with --no-peer-download
- Complementary to PR exo-explore#1463 (MLX memory-to-memory transfer)

Addresses: exo-explore#1257, exo-explore#721, exo-explore#1606

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@rltakashige rltakashige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick pass, not a thorough review. I'm interested to see how this performs.

def _resolve_peer_ip(self, peer_node_id: NodeId, state: State) -> str | None:
"""Resolve a peer's IP address from the topology graph."""
try:
for conn in state.topology.out_edges(self.node_id):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There likely needs to be some sort of link prioritisation, as we wouldn't want a super slow connection for this, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — addressed in 921a6bf. Peers are now sorted by:

  1. Connection type: RDMA/Thunderbolt connections rank first (lower latency, higher bandwidth)
  2. Download status: completed downloads before ongoing ones

The PeerEndpoint type now includes a connection_type field ("rdma" or "socket") populated from the topology graph.

peer_state_provider: PeerStateProvider,
) -> None:
self._inner = inner
self._peer_state = peer_state_provider
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about the worker's state being coupled with the download coordinator. Imo, since the download coordinator isn't owned by worker, we could have the worker request for a file from another node.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — fully refactored in 921a6bf. The architecture is now:

  1. Worker discovers peers (it owns the State) via a pure function discover_peers_for_model()
  2. Worker embeds peers in the StartDownload command as available_peers: list[PeerEndpoint]
  3. DownloadCoordinator receives peers in the command and passes them to the shard downloader
  4. No coupling between download coordinator and worker state — the command is self-contained

The PeerStateProvider class has been replaced with a stateless pure function. main.py no longer wires any state accessor lambda.

@rltakashige rltakashige requested a review from Evanev7 March 11, 2026 11:13
ecohash-co and others added 3 commits March 12, 2026 08:20
…ization

Addresses review feedback from @rltakashige:

1. **Decouple from worker state**: Peer discovery is now a pure function
   called by the Worker (which owns the state) when emitting StartDownload.
   Peer endpoints are embedded in the StartDownload command as
   `available_peers`, so the DownloadCoordinator stays self-contained
   and has no dependency on Worker state.

2. **Link prioritization**: Peers are sorted by connection quality —
   RDMA/Thunderbolt connections rank higher than socket connections,
   and completed downloads rank higher than ongoing ones.

Architecture change:
- PeerStateProvider class → discover_peers_for_model() pure function
- StartDownload command gains `available_peers: list[PeerEndpoint]` field
- PeerEndpoint includes `connection_type` for prioritization
- Worker computes peers at emit time → DownloadCoordinator receives them
- main.py no longer has worker-state coupling (PeerStateProvider removed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants