feat: peer-to-peer model downloads over LAN by ecohash-co · Pull Request #1699 · exo-explore/exo

ecohash-co · 2026-03-11T02:33:48Z

Summary

When multiple nodes need the same model, every node currently downloads the full model independently from HuggingFace. This PR adds peer-to-peer model transfer so only one node downloads from HF while others fetch it over the LAN — eliminating redundant internet downloads and cutting cluster startup time roughly in half.

PeerFileServer: Lightweight aiohttp HTTP server on each node (port 52416) that serves model files from the local cache with Range request support
PeerAwareShardDownloader: Wraps ResumableShardDownloader in a decorator pattern — checks if any peer already has the model before hitting HuggingFace, falls back to HF on failure
Streaming relay: Followers can download from a peer while it's still downloading from HF, via .partial.meta companion files that track flushed byte boundaries
Zero config: Enabled by default, disable with --no-peer-download or EXO_PEER_DOWNLOAD_PORT env var

How it works

Current:   HuggingFace → Node A (full download)
           HuggingFace → Node B (full download)     ← 2× internet bandwidth

Proposed:  HuggingFace → Node A → serves over LAN → Node B
                              ↓                         ↓
                        saves locally              saves locally

No leader election needed — first node to start downloading becomes the de facto seed
No new gossipsub messages — reuses existing NodeDownloadProgress events and topology for peer discovery + IP resolution
Graceful fallback — if peer transfer fails or is unreachable, the node falls back to HuggingFace with .partial resume support
Backend-agnostic — works with MLX, tinygrad, PyTorch (disk-to-disk transfer, any engine can load)
Network-agnostic — works over any LAN (Ethernet, WiFi, Thunderbolt)

Streaming relay detail

The key innovation: while Node A is downloading model.safetensors from HF, it writes a companion .partial.meta file after each 8MB chunk flush:

{"safe_bytes": 83886080, "total": 4294967296, "etag": "abc123"}

Node B's PeerFileServer reads this metadata and serves only the safe byte range — never unflushed data. Node B polls with Range requests as Node A progresses, receiving chunks as they become available. If Node A dies mid-download, Node B's retry loop times out and falls back to HF, resuming from its own .partial file.

Relationship to PR #1463

PR #1463 implements MLX-specific memory-to-memory transfer via all_sum over Thunderbolt. This PR is complementary:

	This PR	PR #1463
Layer	Disk-to-disk (download time)	Memory-to-memory (load time)
Transport	HTTP over any LAN	MLX distributed over Thunderbolt
Backends	All (MLX, tinygrad, PyTorch)	MLX only
Networks	Any (Ethernet, WiFi, TB)	Thunderbolt only
Code overlap	None	None

Files changed

New files (4 modules + 1 test)

src/exo/download/peer_file_server.py — HTTP file server with Range + streaming relay support
src/exo/download/peer_download.py — HTTP client for downloading from peers
src/exo/download/peer_state.py — Reads State.downloads + State.topology to find peers with models
src/exo/download/peer_shard_downloader.py — ShardDownloader decorator: try peer, fall back to HF
src/exo/download/tests/test_peer_download.py — 11 tests covering server, client, and edge cases

Modified files (4 files, minimal changes)

src/exo/download/download_utils.py — Write .partial.meta after each chunk flush (+25 lines)
src/exo/download/impl_shard_downloader.py — Factory accepts optional PeerStateProvider (+7 lines)
src/exo/main.py — Wire up PeerFileServer + PeerStateProvider, add --no-peer-download flag (+55 lines)
src/exo/shared/constants.py — Add EXO_PEER_DOWNLOAD_PORT constant (+3 lines)

Test plan

All 11 new peer download tests pass
All 20 existing download tests still pass (no regressions)
All 79 worker tests still pass
Manual test: 2-node cluster, verify second node downloads from first over LAN
Manual test: kill the seed node mid-download, verify follower falls back to HF
Manual test: --no-peer-download flag disables peer download

Addresses #1257, #721, #1606

🤖 Generated with Claude Code

When multiple nodes need the same model, only one downloads from HuggingFace while others fetch it over the LAN — eliminating redundant internet downloads and cutting cluster startup time roughly in half. Architecture: - PeerFileServer: lightweight aiohttp server on each node (port 52416) that serves model files from local cache with Range request support - PeerAwareShardDownloader: wraps ResumableShardDownloader, checks if any peer already has the model before hitting HuggingFace - Streaming relay: followers can download from a peer while it's still downloading from HF, via .partial.meta companion files that track flushed byte boundaries - Graceful fallback: if peer transfer fails, falls back to HuggingFace with .partial resume support Key design decisions: - No new gossipsub messages — reuses existing NodeDownloadProgress events and topology for peer discovery and IP resolution - No leader election — first node to start becomes de facto seed - Backend-agnostic — works with MLX, tinygrad, PyTorch (any engine) - Network-agnostic — works over any LAN (Ethernet, WiFi, Thunderbolt) - Zero config — enabled by default, disable with --no-peer-download - Complementary to PR exo-explore#1463 (MLX memory-to-memory transfer) Addresses: exo-explore#1257, exo-explore#721, exo-explore#1606 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rltakashige

Quick pass, not a thorough review. I'm interested to see how this performs.

rltakashige · 2026-03-11T11:08:06Z

src/exo/download/peer_state.py

+    def _resolve_peer_ip(self, peer_node_id: NodeId, state: State) -> str | None:
+        """Resolve a peer's IP address from the topology graph."""
+        try:
+            for conn in state.topology.out_edges(self.node_id):


There likely needs to be some sort of link prioritisation, as we wouldn't want a super slow connection for this, etc.

Agreed — addressed in 921a6bf. Peers are now sorted by:

Connection type: RDMA/Thunderbolt connections rank first (lower latency, higher bandwidth)

Download status: completed downloads before ongoing ones

The PeerEndpoint type now includes a connection_type field ("rdma" or "socket") populated from the topology graph.

rltakashige · 2026-03-11T11:10:43Z

src/exo/download/peer_shard_downloader.py

+        peer_state_provider: PeerStateProvider,
+    ) -> None:
+        self._inner = inner
+        self._peer_state = peer_state_provider


I'm not sure how I feel about the worker's state being coupled with the download coordinator. Imo, since the download coordinator isn't owned by worker, we could have the worker request for a file from another node.

Good call — fully refactored in 921a6bf. The architecture is now:

Worker discovers peers (it owns the State) via a pure function discover_peers_for_model()

Worker embeds peers in the StartDownload command as available_peers: list[PeerEndpoint]

DownloadCoordinator receives peers in the command and passes them to the shard downloader

No coupling between download coordinator and worker state — the command is self-contained

The PeerStateProvider class has been replaced with a stateless pure function. main.py no longer wires any state accessor lambda.

@rltakashige

…ization Addresses review feedback from @rltakashige: 1. **Decouple from worker state**: Peer discovery is now a pure function called by the Worker (which owns the state) when emitting StartDownload. Peer endpoints are embedded in the StartDownload command as `available_peers`, so the DownloadCoordinator stays self-contained and has no dependency on Worker state. 2. **Link prioritization**: Peers are sorted by connection quality — RDMA/Thunderbolt connections rank higher than socket connections, and completed downloads rank higher than ongoing ones. Architecture change: - PeerStateProvider class → discover_peers_for_model() pure function - StartDownload command gains `available_peers: list[PeerEndpoint]` field - PeerEndpoint includes `connection_type` for prioritization - Worker computes peers at emit time → DownloadCoordinator receives them - main.py no longer has worker-state coupling (PeerStateProvider removed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rltakashige reviewed Mar 11, 2026

View reviewed changes

rltakashige requested a review from Evanev7 March 11, 2026 11:13

ecohash-co and others added 3 commits March 12, 2026 08:20

Merge branch 'main' into feat/peer-download

dfa2f65

Merge branch 'main' into feat/peer-download

7d3d903

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: peer-to-peer model downloads over LAN#1699

feat: peer-to-peer model downloads over LAN#1699
ecohash-co wants to merge 4 commits intoexo-explore:mainfrom
ecohash-co:feat/peer-download

ecohash-co commented Mar 11, 2026

Uh oh!

rltakashige left a comment

Uh oh!

rltakashige Mar 11, 2026

Uh oh!

ecohash-co Mar 12, 2026

Uh oh!

rltakashige Mar 11, 2026

Uh oh!

ecohash-co Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ecohash-co commented Mar 11, 2026

Summary

How it works

Streaming relay detail

Relationship to PR #1463

Files changed

New files (4 modules + 1 test)

Modified files (4 files, minimal changes)

Test plan

Uh oh!

rltakashige left a comment

Choose a reason for hiding this comment

Uh oh!

rltakashige Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ecohash-co Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

rltakashige Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ecohash-co Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants