feat: peer-to-peer model downloads over LAN#1699
feat: peer-to-peer model downloads over LAN#1699ecohash-co wants to merge 4 commits intoexo-explore:mainfrom
Conversation
When multiple nodes need the same model, only one downloads from HuggingFace while others fetch it over the LAN — eliminating redundant internet downloads and cutting cluster startup time roughly in half. Architecture: - PeerFileServer: lightweight aiohttp server on each node (port 52416) that serves model files from local cache with Range request support - PeerAwareShardDownloader: wraps ResumableShardDownloader, checks if any peer already has the model before hitting HuggingFace - Streaming relay: followers can download from a peer while it's still downloading from HF, via .partial.meta companion files that track flushed byte boundaries - Graceful fallback: if peer transfer fails, falls back to HuggingFace with .partial resume support Key design decisions: - No new gossipsub messages — reuses existing NodeDownloadProgress events and topology for peer discovery and IP resolution - No leader election — first node to start becomes de facto seed - Backend-agnostic — works with MLX, tinygrad, PyTorch (any engine) - Network-agnostic — works over any LAN (Ethernet, WiFi, Thunderbolt) - Zero config — enabled by default, disable with --no-peer-download - Complementary to PR exo-explore#1463 (MLX memory-to-memory transfer) Addresses: exo-explore#1257, exo-explore#721, exo-explore#1606 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rltakashige
left a comment
There was a problem hiding this comment.
Quick pass, not a thorough review. I'm interested to see how this performs.
src/exo/download/peer_state.py
Outdated
| def _resolve_peer_ip(self, peer_node_id: NodeId, state: State) -> str | None: | ||
| """Resolve a peer's IP address from the topology graph.""" | ||
| try: | ||
| for conn in state.topology.out_edges(self.node_id): |
There was a problem hiding this comment.
There likely needs to be some sort of link prioritisation, as we wouldn't want a super slow connection for this, etc.
There was a problem hiding this comment.
Agreed — addressed in 921a6bf. Peers are now sorted by:
- Connection type: RDMA/Thunderbolt connections rank first (lower latency, higher bandwidth)
- Download status: completed downloads before ongoing ones
The PeerEndpoint type now includes a connection_type field ("rdma" or "socket") populated from the topology graph.
| peer_state_provider: PeerStateProvider, | ||
| ) -> None: | ||
| self._inner = inner | ||
| self._peer_state = peer_state_provider |
There was a problem hiding this comment.
I'm not sure how I feel about the worker's state being coupled with the download coordinator. Imo, since the download coordinator isn't owned by worker, we could have the worker request for a file from another node.
There was a problem hiding this comment.
Good call — fully refactored in 921a6bf. The architecture is now:
- Worker discovers peers (it owns the State) via a pure function
discover_peers_for_model() - Worker embeds peers in the
StartDownloadcommand asavailable_peers: list[PeerEndpoint] - DownloadCoordinator receives peers in the command and passes them to the shard downloader
- No coupling between download coordinator and worker state — the command is self-contained
The PeerStateProvider class has been replaced with a stateless pure function. main.py no longer wires any state accessor lambda.
…ization Addresses review feedback from @rltakashige: 1. **Decouple from worker state**: Peer discovery is now a pure function called by the Worker (which owns the state) when emitting StartDownload. Peer endpoints are embedded in the StartDownload command as `available_peers`, so the DownloadCoordinator stays self-contained and has no dependency on Worker state. 2. **Link prioritization**: Peers are sorted by connection quality — RDMA/Thunderbolt connections rank higher than socket connections, and completed downloads rank higher than ongoing ones. Architecture change: - PeerStateProvider class → discover_peers_for_model() pure function - StartDownload command gains `available_peers: list[PeerEndpoint]` field - PeerEndpoint includes `connection_type` for prioritization - Worker computes peers at emit time → DownloadCoordinator receives them - main.py no longer has worker-state coupling (PeerStateProvider removed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
When multiple nodes need the same model, every node currently downloads the full model independently from HuggingFace. This PR adds peer-to-peer model transfer so only one node downloads from HF while others fetch it over the LAN — eliminating redundant internet downloads and cutting cluster startup time roughly in half.
ResumableShardDownloaderin a decorator pattern — checks if any peer already has the model before hitting HuggingFace, falls back to HF on failure.partial.metacompanion files that track flushed byte boundaries--no-peer-downloadorEXO_PEER_DOWNLOAD_PORTenv varHow it works
NodeDownloadProgressevents and topology for peer discovery + IP resolution.partialresume supportStreaming relay detail
The key innovation: while Node A is downloading
model.safetensorsfrom HF, it writes a companion.partial.metafile after each 8MB chunk flush:{"safe_bytes": 83886080, "total": 4294967296, "etag": "abc123"}Node B's
PeerFileServerreads this metadata and serves only the safe byte range — never unflushed data. Node B polls with Range requests as Node A progresses, receiving chunks as they become available. If Node A dies mid-download, Node B's retry loop times out and falls back to HF, resuming from its own.partialfile.Relationship to PR #1463
PR #1463 implements MLX-specific memory-to-memory transfer via
all_sumover Thunderbolt. This PR is complementary:Files changed
New files (4 modules + 1 test)
src/exo/download/peer_file_server.py— HTTP file server with Range + streaming relay supportsrc/exo/download/peer_download.py— HTTP client for downloading from peerssrc/exo/download/peer_state.py— ReadsState.downloads+State.topologyto find peers with modelssrc/exo/download/peer_shard_downloader.py—ShardDownloaderdecorator: try peer, fall back to HFsrc/exo/download/tests/test_peer_download.py— 11 tests covering server, client, and edge casesModified files (4 files, minimal changes)
src/exo/download/download_utils.py— Write.partial.metaafter each chunk flush (+25 lines)src/exo/download/impl_shard_downloader.py— Factory accepts optionalPeerStateProvider(+7 lines)src/exo/main.py— Wire upPeerFileServer+PeerStateProvider, add--no-peer-downloadflag (+55 lines)src/exo/shared/constants.py— AddEXO_PEER_DOWNLOAD_PORTconstant (+3 lines)Test plan
--no-peer-downloadflag disables peer downloadAddresses #1257, #721, #1606
🤖 Generated with Claude Code