Fix master routing tasks to instances with dead runners#1804
Open
mlpy0 wants to merge 1 commit intoexo-explore:mainfrom
Open
Fix master routing tasks to instances with dead runners#1804mlpy0 wants to merge 1 commit intoexo-explore:mainfrom
mlpy0 wants to merge 1 commit intoexo-explore:mainfrom
Conversation
When a runner crashes (e.g. SIGABRT from Metal GPU under heavy load), the master keeps routing new tasks to the dead instance because it only checks model ID and task count — not runner health. The dead instance has 0 active tasks so it gets picked first. The worker can't execute the task (plan._kill_runner short-circuits), and the API hangs forever waiting for chunks that never arrive. Three fixes: 1. Skip instances with any RunnerFailed runner in the TextGeneration command handler. Uses the negative check (RunnerFailed) rather than a positive check (RunnerReady) to avoid breaking task queuing during normal startup when runners haven't registered in state yet. 2. Send an ErrorChunk back to the API when no instance is found for a model, instead of silently swallowing the ValueError. Without this, the API's _token_chunk_stream blocks indefinitely. 3. Clean up zombie instances in _plan(). After a crash the node stays connected (EXO process alive, HTTP API responding) but runners are dead. The existing cleanup only triggers on node disconnect, so the instance persists forever. Now also deletes instances where any runner is in RunnerFailed state.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
I hit this running tensor-parallel inference (Qwen3.5-35B, JACCL across 2 Mac Minis) under heavy load — 24 concurrent long-prompt requests. One runner crashed with SIGABRT from Metal GPU synchronization (
mlx::core::Event::wait()→IOSurfaceSharedEvent waitUntilSignaledValue:timeoutMS:). The EXO process stayed alive (HTTP API on port 52415 responding,/statereturning data) but the runners were dead:After this, every
/v1/chat/completionsrequest hung indefinitely. No error, no timeout — just an open connection that never gets a response.Root Cause
Three issues in the master compound to cause the hang:
1. Command processor routes to dead instances
The TextGeneration handler in
_command_processorselects the instance with the fewest active tasks. After a crash, the dead instance has 0 tasks (all previous tasks got ErrorChunk from the supervisor), so it gets picked first. The worker receives the task butplan()short-circuits to_kill_runner— the TextGeneration task is never executed. The API's_token_chunk_streamblocks forever waiting for chunks.2. "No instance found" error is silently swallowed
The
except ValueErrorat the end of_command_processorcatches "No instance found for model X" and logs a warning, but doesn't send any event back to the API. If the model genuinely isn't loaded (or all instances have dead runners), the API creates a channel and waits for chunks that never arrive.3. Zombie instances persist after runner crash
_plan()only deletes instances when nodes disconnect from the topology. After a runner crash, the node is still connected (EXO's main process is alive, HTTP API responds, libp2p peer is up). The instance stays instate.instancesforever with dead runners — a zombie that matches model lookups but can never serve requests.Fix
All changes in
src/exo/master/main.py:Skip instances with failed runners — Before counting an instance as available for TextGeneration routing, I check if any of its runners are in
RunnerFailedstate. I use the negative check (RunnerFailed) rather than a positive check (RunnerReady/RunnerRunning) because during normal startup, runners aren't instate.runnersyet — a positive check would break task queuing while models are loading.Send ErrorChunk on "no instance found" — In the
except ValueErrorblock, I send aChunkGenerated(ErrorChunk)event for TextGeneration commands so the API gets a proper error response instead of hanging. This follows the same pattern used inrunner_supervisor.pyfor runner crash errors.Delete instances with failed runners in
_plan()— Alongside the existing node-disconnect cleanup, I also delete instances where any runner isRunnerFailed. I use thefor/elsepattern to avoid double-deleting when both checks would trigger.Testing
All existing tests pass (331 passed, 1 skipped).