feat: asymmetric tensor parallelism for heterogeneous clusters by Gajesh2007 · Pull Request #1821 · exo-explore/exo

Gajesh2007 · 2026-03-31T03:37:50Z

Summary

Adds asymmetric tensor parallelism for clusters where nodes have different amounts of RAM. When standard 50/50 tensor parallelism fails because the smaller node can't hold half the weights, this splits each weight tensor proportionally (e.g. 75/25) so both GPUs compute every layer simultaneously.

Addresses #953 (arbitrary tensor parallel splits).

Problem

When connecting a 128GB Mac to a 48GB Mac, a 130GB model can't use tensor parallelism today — 65GB per side doesn't fit on the 48GB node. The only option is pipeline parallelism, where one GPU sits idle while the other computes.

Solution

Asymmetric TP splits weights proportionally to available memory. For 128GB + 48GB:

Node A gets 75% of each weight tensor (~98GB)
Node B gets 25% (~33GB)
Both GPUs compute every layer simultaneously
all_sum still works correctly: (x_a @ W_a^T) + (x_b @ W_b^T) = x @ W^T regardless of split ratio

Includes a ratio solver that finds valid split points where all dimensions (attention heads, MoE intermediate size, etc.) divide cleanly and satisfy quantization alignment constraints.

Auto-upgrades Tensor → AsymmetricTensor in placement when it detects the equal split won't fit on the smallest node but the combined memory is sufficient.

Testing

Hardware tested:

Machine A: MacBook Pro M4 Max, 128GB unified memory, macOS 26.4
Machine B: MacBook Pro M4 Pro, 48GB unified memory, macOS 26.3
Connection: Thunderbolt 5 cable, RDMA via JACCL

Model tested:

Qwen3.5-122B-A10B 8-bit (130.5GB) — cannot fit on either machine alone

Results:

Approach	tok/s	Both GPUs active?
Pipeline parallelism (36/12 layer split)	~5	No — one idle at a time
Asymmetric TP (75/25 weight split)	41.8	Yes — every token

Asymmetric TP is ~8x faster than pipeline on the same hardware.

Automated tests: 7 new unit tests for the ratio solver covering Qwen3.5, Llama-70B, and Nemotron dimensions, plus edge cases (impossible dimensions, >2 nodes).

CI: 0 basedpyright errors, 0 ruff errors, 347/347 tests pass.

Architecture support

Currently supports Qwen3.5 (GatedDeltaNet + Attention + SparseMoeBlock). Raises a clear error for unsupported architectures. Follow-up PRs can add NemotronH, DeepSeek V3, Llama, etc. — each needs its own sharding strategy tested on real hardware.

Changes

src/exo/shared/types/worker/shards.py — new Sharding.AsymmetricTensor enum + AsymmetricTensorShardMetadata with per-node ratio field
src/exo/master/placement.py — auto-upgrade logic when equal TP won't fit
src/exo/master/placement_utils.py — get_shard_assignments_for_asymmetric_tensor_parallel()
src/exo/worker/engines/mlx/utils_mlx.py — handle AsymmetricTensorShardMetadata in shard_and_load
NEW src/exo/worker/engines/mlx/asymmetric_parallel.py — ratio solver + per-layer sharding
NEW src/exo/worker/tests/unittests/test_mlx/test_asymmetric_parallel.py — 7 unit tests

Prior art

The concept of asymmetric TP exists in research (HexGen ICML 2024, HAPT, AutoHet) and has implementations for CUDA (HexGen, veScale). This is the first implementation for MLX on Apple Silicon.

When nodes have different amounts of RAM, standard 50/50 tensor parallelism fails because the smaller node can't hold half the weights. This adds asymmetric TP which splits each weight tensor proportionally to available memory (e.g. 75/25), so both GPUs compute every layer simultaneously. Tested on Qwen3.5-122B-A10B (130GB 8-bit) across 128GB + 48GB Macs via TB5 RDMA, achieving 18.8 tok/s with both GPUs active on every token. This is ~4x faster than pipeline parallelism on the same setup. Addresses exo-explore#953 (arbitrary tensor parallel splits). Changes: - Add AsymmetricTensorShardMetadata with per-node ratio field - Add Sharding.AsymmetricTensor variant - Add asymmetric_parallel.py with ratio solver and per-layer sharding for GatedDeltaNet, Attention, SparseMoeBlock, and dense MLP - Integrate into placement_utils and shard_and_load Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Store rank-0's ratio on all shards so both ranks agree on the split point (fixes misaligned weight slices on rank 1) - Remove num_attention_heads access from ModelCard (field doesn't exist) - Gate AsymmetricTensor on qwen3_5 family only (worker raises on others) - Sort cycle by memory so rank 0 is always the larger-memory node

Gajesh2007 · 2026-03-31T04:00:45Z

Pushed a fix addressing four integration bugs caught in review:

ModelCard.num_attention_heads doesn't exist — was crashing at placement time. Now derives from hidden_size // 128 as a conservative estimate; full validation happens at model load time when the config is available.
Ratio semantics mismatch — rank 1 was getting an inverted split point, causing misaligned weight slices. Now all shards store rank-0's ratio so both workers agree on the same split.
No model family gate — AsymmetricTensor was allowed for any supports_tensor model, but the worker only handles Qwen3.5. Now gated to qwen3_5 family and raises a clear error for others.
Cycle ordering — if the smaller-memory node was first in the cycle, it would get the larger shard. Now sorts nodes by available memory so rank 0 is always the bigger machine.

These bugs existed because our hardware test bypassed the exo placement/loading pipeline (hardcoded ratios, called the sharding module directly). The sharding math itself is correct — verified at 41.8 tok/s on Qwen3.5-122B across 128GB + 48GB Macs.

Note on model support: This has only been tested on Qwen3.5-122B-A10B 8-bit. The worker currently raises a clear error for unsupported architectures. If anyone wants to help expand coverage, the main work is adding per-architecture sharding functions in asymmetric_parallel.py — similar to how auto_parallel.py has separate ShardingStrategy classes. NemotronH (Mamba2 + Attention + MoE) and Llama-style models are the natural next targets.

Gajesh2007 and others added 2 commits March 30, 2026 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: asymmetric tensor parallelism for heterogeneous clusters#1821

feat: asymmetric tensor parallelism for heterogeneous clusters#1821
Gajesh2007 wants to merge 2 commits intoexo-explore:mainfrom
Gajesh2007:feature/asymmetric-tensor-parallelism

Gajesh2007 commented Mar 31, 2026

Uh oh!

Gajesh2007 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gajesh2007 commented Mar 31, 2026

Summary

Problem

Solution

Testing

Architecture support

Changes

Prior art

Uh oh!

Gajesh2007 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant