Fix master routing tasks to instances with dead runners by mlpy0 · Pull Request #1804 · exo-explore/exo

mlpy0 · 2026-03-26T22:22:13Z

Problem

I hit this running tensor-parallel inference (Qwen3.5-35B, JACCL across 2 Mac Minis) under heavy load — 24 concurrent long-prompt requests. One runner crashed with SIGABRT from Metal GPU synchronization (mlx::core::Event::wait() → IOSurfaceSharedEvent waitUntilSignaledValue:timeoutMS:). The EXO process stayed alive (HTTP API on port 52415 responding, /state returning data) but the runners were dead:

Runner 84fa9aa5: RunnerFailed — "Terminated (signal=6 (Abort trap: 6))"
Runner 0131b461: RunnerShuttingDown  (killed by plan._kill_runner after sibling failed)

After this, every /v1/chat/completions request hung indefinitely. No error, no timeout — just an open connection that never gets a response.

Root Cause

Three issues in the master compound to cause the hang:

1. Command processor routes to dead instances

The TextGeneration handler in _command_processor selects the instance with the fewest active tasks. After a crash, the dead instance has 0 tasks (all previous tasks got ErrorChunk from the supervisor), so it gets picked first. The worker receives the task but plan() short-circuits to _kill_runner — the TextGeneration task is never executed. The API's _token_chunk_stream blocks forever waiting for chunks.

2. "No instance found" error is silently swallowed

The except ValueError at the end of _command_processor catches "No instance found for model X" and logs a warning, but doesn't send any event back to the API. If the model genuinely isn't loaded (or all instances have dead runners), the API creates a channel and waits for chunks that never arrive.

3. Zombie instances persist after runner crash

_plan() only deletes instances when nodes disconnect from the topology. After a runner crash, the node is still connected (EXO's main process is alive, HTTP API responds, libp2p peer is up). The instance stays in state.instances forever with dead runners — a zombie that matches model lookups but can never serve requests.

Fix

All changes in src/exo/master/main.py:

Skip instances with failed runners — Before counting an instance as available for TextGeneration routing, I check if any of its runners are in RunnerFailed state. I use the negative check (RunnerFailed) rather than a positive check (RunnerReady/RunnerRunning) because during normal startup, runners aren't in state.runners yet — a positive check would break task queuing while models are loading.
Send ErrorChunk on "no instance found" — In the except ValueError block, I send a ChunkGenerated(ErrorChunk) event for TextGeneration commands so the API gets a proper error response instead of hanging. This follows the same pattern used in runner_supervisor.py for runner crash errors.
Delete instances with failed runners in _plan() — Alongside the existing node-disconnect cleanup, I also delete instances where any runner is RunnerFailed. I use the for/else pattern to avoid double-deleting when both checks would trigger.

Testing

All existing tests pass (331 passed, 1 skipped).

When a runner crashes (e.g. SIGABRT from Metal GPU under heavy load), the master keeps routing new tasks to the dead instance because it only checks model ID and task count — not runner health. The dead instance has 0 active tasks so it gets picked first. The worker can't execute the task (plan._kill_runner short-circuits), and the API hangs forever waiting for chunks that never arrive. Three fixes: 1. Skip instances with any RunnerFailed runner in the TextGeneration command handler. Uses the negative check (RunnerFailed) rather than a positive check (RunnerReady) to avoid breaking task queuing during normal startup when runners haven't registered in state yet. 2. Send an ErrorChunk back to the API when no instance is found for a model, instead of silently swallowing the ValueError. Without this, the API's _token_chunk_stream blocks indefinitely. 3. Clean up zombie instances in _plan(). After a crash the node stays connected (EXO process alive, HTTP API responding) but runners are dead. The existing cleanup only triggers on node disconnect, so the instance persists forever. Now also deletes instances where any runner is in RunnerFailed state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix master routing tasks to instances with dead runners#1804

Fix master routing tasks to instances with dead runners#1804
mlpy0 wants to merge 1 commit intoexo-explore:mainfrom
mlpy0:fix/master-route-dead-instances

mlpy0 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mlpy0 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mlpy0 commented Mar 26, 2026 •

edited

Loading