feat(api): add agent-friendly cluster management endpoints#1727
Open
ecohash-co wants to merge 1 commit intoexo-explore:mainfrom
Open
feat(api): add agent-friendly cluster management endpoints#1727ecohash-co wants to merge 1 commit intoexo-explore:mainfrom
ecohash-co wants to merge 1 commit intoexo-explore:mainfrom
Conversation
Add /v1/cluster/* endpoints designed for programmatic cluster management
by AI agents, CLI tools, and automation scripts.
New endpoints:
- GET /v1/cluster — full cluster overview in one call
- GET /v1/cluster/health — quick liveness check
- GET /v1/cluster/nodes — all nodes with flat summaries
- GET /v1/cluster/nodes/{id} — single node detail
- GET /v1/cluster/models — loaded models + active downloads
- GET /v1/cluster/models/{id}/status — poll model readiness
- POST /v1/cluster/models/load — load model by name (auto-placement)
- POST /v1/cluster/models/swap — atomic unload-then-load
- DELETE /v1/cluster/models/{id} — unload by model name
Design principles:
- Flat fields with units (ram_available_gb, speed_mb_s)
- Error messages suggest fixes ("Need 45GB, have 30GB. Unload X to free Y.")
- 404s list what IS available
- Status polling for async operations (load/swap)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ecohash-co
added a commit
to ecohash-co/exo
that referenced
this pull request
Mar 15, 2026
Adds `exo-cli` as a separate entrypoint for managing a running exo cluster over HTTP, analogous to kubectl/kubelet or obsidian-cli/obsidian. Commands: exo-cli status Cluster overview exo-cli health Quick liveness (exits 1 if down) exo-cli nodes [<id>] List or inspect nodes exo-cli models Loaded models + downloads exo-cli models status <name> Poll readiness exo-cli models load [--wait] <name> Load with auto-placement exo-cli models unload <name> Unload by name exo-cli models swap [--wait] <old> <new> Atomic model swap Key features: - --wait flag blocks until async ops complete (no polling loops in scripts) - --json flag for machine-readable output - --host/--port to target any node - Human-friendly table output by default - Zero new dependencies (stdlib urllib + argparse) Closes exo-explore#1728 Depends on exo-explore#1727 (cluster management API endpoints) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Heads up — we're aware this has merge conflicts with main. Per @Evanev7's feedback on #1728 about waiting for daemonization before adding HTTP management endpoints, we'll hold off on rebasing to avoid unnecessary churn. Happy to rebase and adapt to the new daemon architecture whenever the team signals readiness. Just ping us here or on the issue. |
Member
|
thanks for your patience! the dgx spark integrations have proven pretty involved on this front which is current priority. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
EXO has a powerful internal state model, but the only way to query cluster status programmatically is
GET /state, which returns the full raw state blob. For AI agents, CLI tools, and automation scripts managing an EXO cluster, this requires parsing topology graphs, nested runner status unions, and cross-referencing node IDs across multiple state mappings.This PR adds a set of
/v1/cluster/*endpoints that provide flat, pre-digested cluster information designed for programmatic consumption. The primary use case is enabling agents and scripts to manage model lifecycle — e.g., swapping a large model in overnight for batch deep-reasoning work, then switching back to a smaller/faster model during the day.Changes
New file:
src/exo/shared/types/cluster.py— 13 Pydantic response/request modelsNew endpoints in
src/exo/master/api.py:GET/v1/clusterGET/v1/cluster/healthGET/v1/cluster/nodesGET/v1/cluster/nodes/{id}GET/v1/cluster/modelsGET/v1/cluster/models/{id}/statusPOST/v1/cluster/models/loadPOST/v1/cluster/models/swapDELETE/v1/cluster/models/{id}New file:
src/exo/master/tests/test_cluster_api.py— 13 tests with a realistic 2-node cluster fixtureDesign principles
ram_available_gb,speed_mb_s,temperature_c(no nested objects to traverse)GET /v1/cluster/models/{id}/statusreturnsready: true/falsewith human-readable progress for async operationsExample: day/night model swap
Why It Works
All endpoints are implemented as methods on the existing
APIclass, reading fromself.state(the same state object the dashboard uses). Write operations (load,unload,swap) send commands through the existingself._send()pipeline, so they go through the same master election, placement, and event-sourcing path as dashboard actions. No new infrastructure or state management was added.Response types are plain
pydantic.BaseModel(notCamelCaseModel) since these are new endpoints with no existing consumers expecting camelCase — field names read naturally as-is (ram_available_gb). This is a deliberate choice to maximize readability for the programmatic consumers these endpoints target.Test Plan
Automated Testing
13 new tests in
src/exo/master/tests/test_cluster_api.py:TestClusterHealth— healthy cluster, empty clusterTestClusterOverview— full structure validation, memory math (340+350=690 available)TestClusterNodes— list, detail, 404 with helpful messageTestClusterModels— loaded models, empty clusterTestClusterModelStatus— ready model (by full ID and short name), not-loaded model, loading-in-progress with layer countAll tests use a realistic 2-node fixture (two M3 Ultra 512GB nodes with a pipeline-sharded model). Full suite: 97 tests pass (84 existing + 13 new), zero regressions.
Manual Testing
Not yet tested against a live cluster (our cluster wasn't running during development). The endpoints read from the same
self.stateobject used by the dashboard and existing API, so the data flow is well-established. Would appreciate help testing on a live multi-node cluster.