Skip to content

feat: asymmetric tensor parallelism for heterogeneous clusters#1821

Open
Gajesh2007 wants to merge 2 commits intoexo-explore:mainfrom
Gajesh2007:feature/asymmetric-tensor-parallelism
Open

feat: asymmetric tensor parallelism for heterogeneous clusters#1821
Gajesh2007 wants to merge 2 commits intoexo-explore:mainfrom
Gajesh2007:feature/asymmetric-tensor-parallelism

Conversation

@Gajesh2007
Copy link
Copy Markdown

Summary

Adds asymmetric tensor parallelism for clusters where nodes have different amounts of RAM. When standard 50/50 tensor parallelism fails because the smaller node can't hold half the weights, this splits each weight tensor proportionally (e.g. 75/25) so both GPUs compute every layer simultaneously.

Addresses #953 (arbitrary tensor parallel splits).

Problem

When connecting a 128GB Mac to a 48GB Mac, a 130GB model can't use tensor parallelism today — 65GB per side doesn't fit on the 48GB node. The only option is pipeline parallelism, where one GPU sits idle while the other computes.

Solution

Asymmetric TP splits weights proportionally to available memory. For 128GB + 48GB:

  • Node A gets 75% of each weight tensor (~98GB)
  • Node B gets 25% (~33GB)
  • Both GPUs compute every layer simultaneously
  • all_sum still works correctly: (x_a @ W_a^T) + (x_b @ W_b^T) = x @ W^T regardless of split ratio

Includes a ratio solver that finds valid split points where all dimensions (attention heads, MoE intermediate size, etc.) divide cleanly and satisfy quantization alignment constraints.

Auto-upgrades TensorAsymmetricTensor in placement when it detects the equal split won't fit on the smallest node but the combined memory is sufficient.

Testing

Hardware tested:

  • Machine A: MacBook Pro M4 Max, 128GB unified memory, macOS 26.4
  • Machine B: MacBook Pro M4 Pro, 48GB unified memory, macOS 26.3
  • Connection: Thunderbolt 5 cable, RDMA via JACCL

Model tested:

  • Qwen3.5-122B-A10B 8-bit (130.5GB) — cannot fit on either machine alone

Results:

Approach tok/s Both GPUs active?
Pipeline parallelism (36/12 layer split) ~5 No — one idle at a time
Asymmetric TP (75/25 weight split) 41.8 Yes — every token

Asymmetric TP is ~8x faster than pipeline on the same hardware.

Automated tests: 7 new unit tests for the ratio solver covering Qwen3.5, Llama-70B, and Nemotron dimensions, plus edge cases (impossible dimensions, >2 nodes).

CI: 0 basedpyright errors, 0 ruff errors, 347/347 tests pass.

Architecture support

Currently supports Qwen3.5 (GatedDeltaNet + Attention + SparseMoeBlock). Raises a clear error for unsupported architectures. Follow-up PRs can add NemotronH, DeepSeek V3, Llama, etc. — each needs its own sharding strategy tested on real hardware.

Changes

  • src/exo/shared/types/worker/shards.py — new Sharding.AsymmetricTensor enum + AsymmetricTensorShardMetadata with per-node ratio field
  • src/exo/master/placement.py — auto-upgrade logic when equal TP won't fit
  • src/exo/master/placement_utils.pyget_shard_assignments_for_asymmetric_tensor_parallel()
  • src/exo/worker/engines/mlx/utils_mlx.py — handle AsymmetricTensorShardMetadata in shard_and_load
  • NEW src/exo/worker/engines/mlx/asymmetric_parallel.py — ratio solver + per-layer sharding
  • NEW src/exo/worker/tests/unittests/test_mlx/test_asymmetric_parallel.py — 7 unit tests

Prior art

The concept of asymmetric TP exists in research (HexGen ICML 2024, HAPT, AutoHet) and has implementations for CUDA (HexGen, veScale). This is the first implementation for MLX on Apple Silicon.

Gajesh2007 and others added 2 commits March 30, 2026 20:29
When nodes have different amounts of RAM, standard 50/50 tensor
parallelism fails because the smaller node can't hold half the weights.
This adds asymmetric TP which splits each weight tensor proportionally
to available memory (e.g. 75/25), so both GPUs compute every layer
simultaneously.

Tested on Qwen3.5-122B-A10B (130GB 8-bit) across 128GB + 48GB Macs
via TB5 RDMA, achieving 18.8 tok/s with both GPUs active on every
token. This is ~4x faster than pipeline parallelism on the same setup.

Addresses exo-explore#953 (arbitrary tensor parallel splits).

Changes:
- Add AsymmetricTensorShardMetadata with per-node ratio field
- Add Sharding.AsymmetricTensor variant
- Add asymmetric_parallel.py with ratio solver and per-layer sharding
  for GatedDeltaNet, Attention, SparseMoeBlock, and dense MLP
- Integrate into placement_utils and shard_and_load

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Store rank-0's ratio on all shards so both ranks agree on the split
  point (fixes misaligned weight slices on rank 1)
- Remove num_attention_heads access from ModelCard (field doesn't exist)
- Gate AsymmetricTensor on qwen3_5 family only (worker raises on others)
- Sort cycle by memory so rank 0 is always the larger-memory node
@Gajesh2007
Copy link
Copy Markdown
Author

Pushed a fix addressing four integration bugs caught in review:

  1. ModelCard.num_attention_heads doesn't exist — was crashing at placement time. Now derives from hidden_size // 128 as a conservative estimate; full validation happens at model load time when the config is available.
  2. Ratio semantics mismatch — rank 1 was getting an inverted split point, causing misaligned weight slices. Now all shards store rank-0's ratio so both workers agree on the same split.
  3. No model family gateAsymmetricTensor was allowed for any supports_tensor model, but the worker only handles Qwen3.5. Now gated to qwen3_5 family and raises a clear error for others.
  4. Cycle ordering — if the smaller-memory node was first in the cycle, it would get the larger shard. Now sorts nodes by available memory so rank 0 is always the bigger machine.

These bugs existed because our hardware test bypassed the exo placement/loading pipeline (hardcoded ratios, called the sharding module directly). The sharding math itself is correct — verified at 41.8 tok/s on Qwen3.5-122B across 128GB + 48GB Macs.


Note on model support: This has only been tested on Qwen3.5-122B-A10B 8-bit. The worker currently raises a clear error for unsupported architectures. If anyone wants to help expand coverage, the main work is adding per-architecture sharding functions in asymmetric_parallel.py — similar to how auto_parallel.py has separate ShardingStrategy classes. NemotronH (Mamba2 + Attention + MoE) and Llama-style models are the natural next targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant