Skip to content

feat(api): add agent-friendly cluster management endpoints#1727

Open
ecohash-co wants to merge 1 commit intoexo-explore:mainfrom
ecohash-co:feat/cluster-management-api
Open

feat(api): add agent-friendly cluster management endpoints#1727
ecohash-co wants to merge 1 commit intoexo-explore:mainfrom
ecohash-co:feat/cluster-management-api

Conversation

@ecohash-co
Copy link
Copy Markdown
Contributor

Motivation

EXO has a powerful internal state model, but the only way to query cluster status programmatically is GET /state, which returns the full raw state blob. For AI agents, CLI tools, and automation scripts managing an EXO cluster, this requires parsing topology graphs, nested runner status unions, and cross-referencing node IDs across multiple state mappings.

This PR adds a set of /v1/cluster/* endpoints that provide flat, pre-digested cluster information designed for programmatic consumption. The primary use case is enabling agents and scripts to manage model lifecycle — e.g., swapping a large model in overnight for batch deep-reasoning work, then switching back to a smaller/faster model during the day.

Changes

New file: src/exo/shared/types/cluster.py — 13 Pydantic response/request models

New endpoints in src/exo/master/api.py:

Method Path Purpose
GET /v1/cluster Full cluster overview in one call (nodes, models, tasks, downloads, memory totals)
GET /v1/cluster/health Quick liveness check (healthy?, node count, master ID)
GET /v1/cluster/nodes All nodes with flat, agent-friendly summaries
GET /v1/cluster/nodes/{id} Single node detail
GET /v1/cluster/models Loaded models + active downloads
GET /v1/cluster/models/{id}/status Poll model readiness (for waiting on async load/swap)
POST /v1/cluster/models/load Load a model by name — cluster handles sharding/placement
POST /v1/cluster/models/swap Atomic unload-then-load in one call
DELETE /v1/cluster/models/{id} Unload by model name or short name (not instance ID)

New file: src/exo/master/tests/test_cluster_api.py — 13 tests with a realistic 2-node cluster fixture

Design principles

  • Flat fields with unitsram_available_gb, speed_mb_s, temperature_c (no nested objects to traverse)
  • Error messages suggest fixes"Need 45GB, have 30GB. Currently loaded: Qwen3-30B (20GB). Unload a model to free memory."
  • 404s list what IS available — node not found? Response includes available node IDs
  • Actions by name — load/unload/swap by model name, not instance ID
  • Status pollingGET /v1/cluster/models/{id}/status returns ready: true/false with human-readable progress for async operations

Example: day/night model swap

# 11pm cron — swap to large model for overnight batch work
curl -X POST http://localhost:52415/v1/cluster/models/swap -H 'Content-Type: application/json' -d '{
  "unload_model_id": "Qwen3-30B-A3B-4bit",
  "load_model_id": "mlx-community/MiniMax-M1-80B-A45B-4bit",
  "min_nodes": 2
}'

# Poll until ready
while ! curl -s http://localhost:52415/v1/cluster/models/MiniMax-M1-80B-A45B-4bit/status | jq -e '.ready'; do
  sleep 10
done

# 6am cron — swap back to fast model
curl -X POST http://localhost:52415/v1/cluster/models/swap -d '{
  "unload_model_id": "MiniMax-M1-80B-A45B-4bit",
  "load_model_id": "mlx-community/Qwen3-30B-A3B-4bit"
}'

Why It Works

All endpoints are implemented as methods on the existing API class, reading from self.state (the same state object the dashboard uses). Write operations (load, unload, swap) send commands through the existing self._send() pipeline, so they go through the same master election, placement, and event-sourcing path as dashboard actions. No new infrastructure or state management was added.

Response types are plain pydantic.BaseModel (not CamelCaseModel) since these are new endpoints with no existing consumers expecting camelCase — field names read naturally as-is (ram_available_gb). This is a deliberate choice to maximize readability for the programmatic consumers these endpoints target.

Test Plan

Automated Testing

13 new tests in src/exo/master/tests/test_cluster_api.py:

  • TestClusterHealth — healthy cluster, empty cluster
  • TestClusterOverview — full structure validation, memory math (340+350=690 available)
  • TestClusterNodes — list, detail, 404 with helpful message
  • TestClusterModels — loaded models, empty cluster
  • TestClusterModelStatus — ready model (by full ID and short name), not-loaded model, loading-in-progress with layer count

All tests use a realistic 2-node fixture (two M3 Ultra 512GB nodes with a pipeline-sharded model). Full suite: 97 tests pass (84 existing + 13 new), zero regressions.

Manual Testing

Not yet tested against a live cluster (our cluster wasn't running during development). The endpoints read from the same self.state object used by the dashboard and existing API, so the data flow is well-established. Would appreciate help testing on a live multi-node cluster.

Add /v1/cluster/* endpoints designed for programmatic cluster management
by AI agents, CLI tools, and automation scripts.

New endpoints:
- GET  /v1/cluster           — full cluster overview in one call
- GET  /v1/cluster/health    — quick liveness check
- GET  /v1/cluster/nodes     — all nodes with flat summaries
- GET  /v1/cluster/nodes/{id} — single node detail
- GET  /v1/cluster/models    — loaded models + active downloads
- GET  /v1/cluster/models/{id}/status — poll model readiness
- POST /v1/cluster/models/load  — load model by name (auto-placement)
- POST /v1/cluster/models/swap  — atomic unload-then-load
- DELETE /v1/cluster/models/{id} — unload by model name

Design principles:
- Flat fields with units (ram_available_gb, speed_mb_s)
- Error messages suggest fixes ("Need 45GB, have 30GB. Unload X to free Y.")
- 404s list what IS available
- Status polling for async operations (load/swap)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ecohash-co added a commit to ecohash-co/exo that referenced this pull request Mar 15, 2026
Adds `exo-cli` as a separate entrypoint for managing a running exo
cluster over HTTP, analogous to kubectl/kubelet or obsidian-cli/obsidian.

Commands:
  exo-cli status                          Cluster overview
  exo-cli health                          Quick liveness (exits 1 if down)
  exo-cli nodes [<id>]                    List or inspect nodes
  exo-cli models                          Loaded models + downloads
  exo-cli models status <name>            Poll readiness
  exo-cli models load [--wait] <name>     Load with auto-placement
  exo-cli models unload <name>            Unload by name
  exo-cli models swap [--wait] <old> <new>  Atomic model swap

Key features:
- --wait flag blocks until async ops complete (no polling loops in scripts)
- --json flag for machine-readable output
- --host/--port to target any node
- Human-friendly table output by default
- Zero new dependencies (stdlib urllib + argparse)

Closes exo-explore#1728
Depends on exo-explore#1727 (cluster management API endpoints)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ecohash-co
Copy link
Copy Markdown
Contributor Author

Heads up — we're aware this has merge conflicts with main. Per @Evanev7's feedback on #1728 about waiting for daemonization before adding HTTP management endpoints, we'll hold off on rebasing to avoid unnecessary churn.

Happy to rebase and adapt to the new daemon architecture whenever the team signals readiness. Just ping us here or on the issue.

@Evanev7
Copy link
Copy Markdown
Member

Evanev7 commented Mar 28, 2026

thanks for your patience! the dgx spark integrations have proven pretty involved on this front which is current priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants