feat(api): add agent-friendly cluster management endpoints by ecohash-co · Pull Request #1727 · exo-explore/exo

ecohash-co · 2026-03-15T03:50:38Z

Motivation

EXO has a powerful internal state model, but the only way to query cluster status programmatically is GET /state, which returns the full raw state blob. For AI agents, CLI tools, and automation scripts managing an EXO cluster, this requires parsing topology graphs, nested runner status unions, and cross-referencing node IDs across multiple state mappings.

This PR adds a set of /v1/cluster/* endpoints that provide flat, pre-digested cluster information designed for programmatic consumption. The primary use case is enabling agents and scripts to manage model lifecycle — e.g., swapping a large model in overnight for batch deep-reasoning work, then switching back to a smaller/faster model during the day.

Changes

New file: src/exo/shared/types/cluster.py — 13 Pydantic response/request models

New endpoints in src/exo/master/api.py:

Method	Path	Purpose
`GET`	`/v1/cluster`	Full cluster overview in one call (nodes, models, tasks, downloads, memory totals)
`GET`	`/v1/cluster/health`	Quick liveness check (healthy?, node count, master ID)
`GET`	`/v1/cluster/nodes`	All nodes with flat, agent-friendly summaries
`GET`	`/v1/cluster/nodes/{id}`	Single node detail
`GET`	`/v1/cluster/models`	Loaded models + active downloads
`GET`	`/v1/cluster/models/{id}/status`	Poll model readiness (for waiting on async load/swap)
`POST`	`/v1/cluster/models/load`	Load a model by name — cluster handles sharding/placement
`POST`	`/v1/cluster/models/swap`	Atomic unload-then-load in one call
`DELETE`	`/v1/cluster/models/{id}`	Unload by model name or short name (not instance ID)

New file: src/exo/master/tests/test_cluster_api.py — 13 tests with a realistic 2-node cluster fixture

Design principles

Flat fields with units — ram_available_gb, speed_mb_s, temperature_c (no nested objects to traverse)
Error messages suggest fixes — "Need 45GB, have 30GB. Currently loaded: Qwen3-30B (20GB). Unload a model to free memory."
404s list what IS available — node not found? Response includes available node IDs
Actions by name — load/unload/swap by model name, not instance ID
Status polling — GET /v1/cluster/models/{id}/status returns ready: true/false with human-readable progress for async operations

Example: day/night model swap

# 11pm cron — swap to large model for overnight batch work
curl -X POST http://localhost:52415/v1/cluster/models/swap -H 'Content-Type: application/json' -d '{
  "unload_model_id": "Qwen3-30B-A3B-4bit",
  "load_model_id": "mlx-community/MiniMax-M1-80B-A45B-4bit",
  "min_nodes": 2
}'

# Poll until ready
while ! curl -s http://localhost:52415/v1/cluster/models/MiniMax-M1-80B-A45B-4bit/status | jq -e '.ready'; do
  sleep 10
done

# 6am cron — swap back to fast model
curl -X POST http://localhost:52415/v1/cluster/models/swap -d '{
  "unload_model_id": "MiniMax-M1-80B-A45B-4bit",
  "load_model_id": "mlx-community/Qwen3-30B-A3B-4bit"
}'

Why It Works

All endpoints are implemented as methods on the existing API class, reading from self.state (the same state object the dashboard uses). Write operations (load, unload, swap) send commands through the existing self._send() pipeline, so they go through the same master election, placement, and event-sourcing path as dashboard actions. No new infrastructure or state management was added.

Response types are plain pydantic.BaseModel (not CamelCaseModel) since these are new endpoints with no existing consumers expecting camelCase — field names read naturally as-is (ram_available_gb). This is a deliberate choice to maximize readability for the programmatic consumers these endpoints target.

Test Plan

Automated Testing

13 new tests in src/exo/master/tests/test_cluster_api.py:

TestClusterHealth — healthy cluster, empty cluster
TestClusterOverview — full structure validation, memory math (340+350=690 available)
TestClusterNodes — list, detail, 404 with helpful message
TestClusterModels — loaded models, empty cluster
TestClusterModelStatus — ready model (by full ID and short name), not-loaded model, loading-in-progress with layer count

All tests use a realistic 2-node fixture (two M3 Ultra 512GB nodes with a pipeline-sharded model). Full suite: 97 tests pass (84 existing + 13 new), zero regressions.

Manual Testing

Not yet tested against a live cluster (our cluster wasn't running during development). The endpoints read from the same self.state object used by the dashboard and existing API, so the data flow is well-established. Would appreciate help testing on a live multi-node cluster.

Add /v1/cluster/* endpoints designed for programmatic cluster management by AI agents, CLI tools, and automation scripts. New endpoints: - GET /v1/cluster — full cluster overview in one call - GET /v1/cluster/health — quick liveness check - GET /v1/cluster/nodes — all nodes with flat summaries - GET /v1/cluster/nodes/{id} — single node detail - GET /v1/cluster/models — loaded models + active downloads - GET /v1/cluster/models/{id}/status — poll model readiness - POST /v1/cluster/models/load — load model by name (auto-placement) - POST /v1/cluster/models/swap — atomic unload-then-load - DELETE /v1/cluster/models/{id} — unload by model name Design principles: - Flat fields with units (ram_available_gb, speed_mb_s) - Error messages suggest fixes ("Need 45GB, have 30GB. Unload X to free Y.") - 404s list what IS available - Status polling for async operations (load/swap) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds `exo-cli` as a separate entrypoint for managing a running exo cluster over HTTP, analogous to kubectl/kubelet or obsidian-cli/obsidian. Commands: exo-cli status Cluster overview exo-cli health Quick liveness (exits 1 if down) exo-cli nodes [<id>] List or inspect nodes exo-cli models Loaded models + downloads exo-cli models status <name> Poll readiness exo-cli models load [--wait] <name> Load with auto-placement exo-cli models unload <name> Unload by name exo-cli models swap [--wait] <old> <new> Atomic model swap Key features: - --wait flag blocks until async ops complete (no polling loops in scripts) - --json flag for machine-readable output - --host/--port to target any node - Human-friendly table output by default - Zero new dependencies (stdlib urllib + argparse) Closes exo-explore#1728 Depends on exo-explore#1727 (cluster management API endpoints) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ecohash-co · 2026-03-27T16:45:42Z

Heads up — we're aware this has merge conflicts with main. Per @Evanev7's feedback on #1728 about waiting for daemonization before adding HTTP management endpoints, we'll hold off on rebasing to avoid unnecessary churn.

Happy to rebase and adapt to the new daemon architecture whenever the team signals readiness. Just ping us here or on the issue.

Evanev7 · 2026-03-28T10:52:11Z

thanks for your patience! the dgx spark integrations have proven pretty involved on this front which is current priority.

ecohash-co mentioned this pull request Mar 15, 2026

feat: exo-cli — management CLI for controlling a running exo cluster #1728

Open

ecohash-co mentioned this pull request Mar 15, 2026

feat: add exo-cli management tool for controlling a running cluster #1729

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): add agent-friendly cluster management endpoints#1727

feat(api): add agent-friendly cluster management endpoints#1727
ecohash-co wants to merge 1 commit intoexo-explore:mainfrom
ecohash-co:feat/cluster-management-api

ecohash-co commented Mar 15, 2026

Uh oh!

ecohash-co commented Mar 27, 2026

Uh oh!

Evanev7 commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ecohash-co commented Mar 15, 2026

Motivation

Changes

Design principles

Example: day/night model swap

Why It Works

Test Plan

Automated Testing

Manual Testing

Uh oh!

ecohash-co commented Mar 27, 2026

Uh oh!

Evanev7 commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants