feat: asymmetric tensor parallelism for heterogeneous clusters#1821
feat: asymmetric tensor parallelism for heterogeneous clusters#1821Gajesh2007 wants to merge 2 commits intoexo-explore:mainfrom
Conversation
When nodes have different amounts of RAM, standard 50/50 tensor parallelism fails because the smaller node can't hold half the weights. This adds asymmetric TP which splits each weight tensor proportionally to available memory (e.g. 75/25), so both GPUs compute every layer simultaneously. Tested on Qwen3.5-122B-A10B (130GB 8-bit) across 128GB + 48GB Macs via TB5 RDMA, achieving 18.8 tok/s with both GPUs active on every token. This is ~4x faster than pipeline parallelism on the same setup. Addresses exo-explore#953 (arbitrary tensor parallel splits). Changes: - Add AsymmetricTensorShardMetadata with per-node ratio field - Add Sharding.AsymmetricTensor variant - Add asymmetric_parallel.py with ratio solver and per-layer sharding for GatedDeltaNet, Attention, SparseMoeBlock, and dense MLP - Integrate into placement_utils and shard_and_load Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Store rank-0's ratio on all shards so both ranks agree on the split point (fixes misaligned weight slices on rank 1) - Remove num_attention_heads access from ModelCard (field doesn't exist) - Gate AsymmetricTensor on qwen3_5 family only (worker raises on others) - Sort cycle by memory so rank 0 is always the larger-memory node
|
Pushed a fix addressing four integration bugs caught in review:
These bugs existed because our hardware test bypassed the exo placement/loading pipeline (hardcoded ratios, called the sharding module directly). The sharding math itself is correct — verified at 41.8 tok/s on Qwen3.5-122B across 128GB + 48GB Macs. Note on model support: This has only been tested on Qwen3.5-122B-A10B 8-bit. The worker currently raises a clear error for unsupported architectures. If anyone wants to help expand coverage, the main work is adding per-architecture sharding functions in |
Summary
Adds asymmetric tensor parallelism for clusters where nodes have different amounts of RAM. When standard 50/50 tensor parallelism fails because the smaller node can't hold half the weights, this splits each weight tensor proportionally (e.g. 75/25) so both GPUs compute every layer simultaneously.
Addresses #953 (arbitrary tensor parallel splits).
Problem
When connecting a 128GB Mac to a 48GB Mac, a 130GB model can't use tensor parallelism today — 65GB per side doesn't fit on the 48GB node. The only option is pipeline parallelism, where one GPU sits idle while the other computes.
Solution
Asymmetric TP splits weights proportionally to available memory. For 128GB + 48GB:
all_sumstill works correctly:(x_a @ W_a^T) + (x_b @ W_b^T) = x @ W^Tregardless of split ratioIncludes a ratio solver that finds valid split points where all dimensions (attention heads, MoE intermediate size, etc.) divide cleanly and satisfy quantization alignment constraints.
Auto-upgrades
Tensor→AsymmetricTensorin placement when it detects the equal split won't fit on the smallest node but the combined memory is sufficient.Testing
Hardware tested:
Model tested:
Results:
Asymmetric TP is ~8x faster than pipeline on the same hardware.
Automated tests: 7 new unit tests for the ratio solver covering Qwen3.5, Llama-70B, and Nemotron dimensions, plus edge cases (impossible dimensions, >2 nodes).
CI: 0 basedpyright errors, 0 ruff errors, 347/347 tests pass.
Architecture support
Currently supports Qwen3.5 (GatedDeltaNet + Attention + SparseMoeBlock). Raises a clear error for unsupported architectures. Follow-up PRs can add NemotronH, DeepSeek V3, Llama, etc. — each needs its own sharding strategy tested on real hardware.
Changes
src/exo/shared/types/worker/shards.py— newSharding.AsymmetricTensorenum +AsymmetricTensorShardMetadatawith per-noderatiofieldsrc/exo/master/placement.py— auto-upgrade logic when equal TP won't fitsrc/exo/master/placement_utils.py—get_shard_assignments_for_asymmetric_tensor_parallel()src/exo/worker/engines/mlx/utils_mlx.py— handleAsymmetricTensorShardMetadatainshard_and_loadsrc/exo/worker/engines/mlx/asymmetric_parallel.py— ratio solver + per-layer shardingsrc/exo/worker/tests/unittests/test_mlx/test_asymmetric_parallel.py— 7 unit testsPrior art
The concept of asymmetric TP exists in research (HexGen ICML 2024, HAPT, AutoHet) and has implementations for CUDA (HexGen, veScale). This is the first implementation for MLX on Apple Silicon.