Skip to content

Fix master routing tasks to instances with dead runners#1804

Open
mlpy0 wants to merge 1 commit intoexo-explore:mainfrom
mlpy0:fix/master-route-dead-instances
Open

Fix master routing tasks to instances with dead runners#1804
mlpy0 wants to merge 1 commit intoexo-explore:mainfrom
mlpy0:fix/master-route-dead-instances

Conversation

@mlpy0
Copy link
Copy Markdown
Contributor

@mlpy0 mlpy0 commented Mar 26, 2026

Problem

I hit this running tensor-parallel inference (Qwen3.5-35B, JACCL across 2 Mac Minis) under heavy load — 24 concurrent long-prompt requests. One runner crashed with SIGABRT from Metal GPU synchronization (mlx::core::Event::wait()IOSurfaceSharedEvent waitUntilSignaledValue:timeoutMS:). The EXO process stayed alive (HTTP API on port 52415 responding, /state returning data) but the runners were dead:

Runner 84fa9aa5: RunnerFailed — "Terminated (signal=6 (Abort trap: 6))"
Runner 0131b461: RunnerShuttingDown  (killed by plan._kill_runner after sibling failed)

After this, every /v1/chat/completions request hung indefinitely. No error, no timeout — just an open connection that never gets a response.

Root Cause

Three issues in the master compound to cause the hang:

1. Command processor routes to dead instances

The TextGeneration handler in _command_processor selects the instance with the fewest active tasks. After a crash, the dead instance has 0 tasks (all previous tasks got ErrorChunk from the supervisor), so it gets picked first. The worker receives the task but plan() short-circuits to _kill_runner — the TextGeneration task is never executed. The API's _token_chunk_stream blocks forever waiting for chunks.

2. "No instance found" error is silently swallowed

The except ValueError at the end of _command_processor catches "No instance found for model X" and logs a warning, but doesn't send any event back to the API. If the model genuinely isn't loaded (or all instances have dead runners), the API creates a channel and waits for chunks that never arrive.

3. Zombie instances persist after runner crash

_plan() only deletes instances when nodes disconnect from the topology. After a runner crash, the node is still connected (EXO's main process is alive, HTTP API responds, libp2p peer is up). The instance stays in state.instances forever with dead runners — a zombie that matches model lookups but can never serve requests.

Fix

All changes in src/exo/master/main.py:

  1. Skip instances with failed runners — Before counting an instance as available for TextGeneration routing, I check if any of its runners are in RunnerFailed state. I use the negative check (RunnerFailed) rather than a positive check (RunnerReady/RunnerRunning) because during normal startup, runners aren't in state.runners yet — a positive check would break task queuing while models are loading.

  2. Send ErrorChunk on "no instance found" — In the except ValueError block, I send a ChunkGenerated(ErrorChunk) event for TextGeneration commands so the API gets a proper error response instead of hanging. This follows the same pattern used in runner_supervisor.py for runner crash errors.

  3. Delete instances with failed runners in _plan() — Alongside the existing node-disconnect cleanup, I also delete instances where any runner is RunnerFailed. I use the for/else pattern to avoid double-deleting when both checks would trigger.

Testing

All existing tests pass (331 passed, 1 skipped).

When a runner crashes (e.g. SIGABRT from Metal GPU under heavy load),
the master keeps routing new tasks to the dead instance because it only
checks model ID and task count — not runner health. The dead instance
has 0 active tasks so it gets picked first. The worker can't execute
the task (plan._kill_runner short-circuits), and the API hangs forever
waiting for chunks that never arrive.

Three fixes:

1. Skip instances with any RunnerFailed runner in the TextGeneration
   command handler. Uses the negative check (RunnerFailed) rather than
   a positive check (RunnerReady) to avoid breaking task queuing during
   normal startup when runners haven't registered in state yet.

2. Send an ErrorChunk back to the API when no instance is found for
   a model, instead of silently swallowing the ValueError. Without this,
   the API's _token_chunk_stream blocks indefinitely.

3. Clean up zombie instances in _plan(). After a crash the node stays
   connected (EXO process alive, HTTP API responding) but runners are
   dead. The existing cleanup only triggers on node disconnect, so the
   instance persists forever. Now also deletes instances where any
   runner is in RunnerFailed state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant