Skip to content

registry: real control-plane resolver (k8s Service lookup by resource-id label)#28

Open
hhuuggoo wants to merge 1 commit into
release-2026.02.01from
hugo/k8s-label-resolver
Open

registry: real control-plane resolver (k8s Service lookup by resource-id label)#28
hhuuggoo wants to merge 1 commit into
release-2026.02.01from
hugo/k8s-label-resolver

Conversation

@hhuuggoo

Copy link
Copy Markdown
Contributor

What

The cached/chain resolver strategies had their cache built but their LookupFunc was a stub that fell back to a naming-template guess. That guess can't work for Saturn inference — the model's Service name is pd-{identity5}-{name}-{id}, embedding the owning group + endpoint name, neither of which phoebe receives. So static/convention were walking-skeleton modes; the real multi-model design was never wired.

This implements the real LookupFunc: resolve a deployment id → its model Service by the saturncloud.io/resource-id label, read off the in-cluster Kubernetes API. That label's value is X-Saturn-Resource-Id (Saturn stamps it on every inference Service via basic_resource_labels), so it's an exact join key — no name reconstruction, no new Atlas API, no new headers, and self-correcting if Atlas changes the Service name template.

Details

  • Selector resource-id=<id>,service-type=internal — the service-type clause is load-bearing: a deployment's ssh Service shares the resource-id label, so without it the match is ambiguous.
  • Port: prefer the port named "8000" (vLLM serve port == Deployment.proxy_port); single-port fallback; error rather than silently pick wrong on a multi-port Service.
  • 0 matches → ErrNotFound (negative-cached, short TTL → new models reachable fast). API error / ambiguous / no-served-port → transient error (not cached, retried).
  • Wired into buildResolver for cached/chain; chain keeps convention only as a k8s-API-unreachable fallback.
  • New registry.k8sNamespace config (Saturn: main-namespace). Fixed the misleading placeholder convention comment so nobody flips strategy: convention and gets silent 404s.
  • 8 named unit tests (fake clientset, no cluster) covering invariants + negatives + the ssh-disambiguation attack. Full suite + vet + gofmt green.

Contracts

phoebe now depends on two Atlas-owned k8s conventions for inference routing — (1) inference Services carry saturncloud.io/resource-id + service-type=internal; (2) the served port is 8000. Both hold today (verified in pdc); this promotes them to a documented routing contract so an Atlas refactor can't silently break routing. No Atlas change required to ship.

Follow-up (separate saturn-k8s PR): the interceptor pod needs an RBAC Role granting get/list on services in main-namespace.

…e-id label

The cached/chain resolver strategies had their cache machinery built but their
LookupFunc was a STUB (conventionLookup) that fell back to a naming-template
guess. That guess can't work for Saturn inference: the model's Service name is
pd-{identity5}-{name}-{id}, which embeds the owning group and endpoint name —
neither of which phoebe receives. So static/convention were walking-skeleton
modes; the real multi-model design (cached/chain) was never wired.

Implement the real LookupFunc (internal/registry/k8s.go): resolve a deployment
id to its model Service by the saturncloud.io/resource-id label, read off the
in-cluster Kubernetes API. That label's value IS X-Saturn-Resource-Id (Saturn
stamps it on every inference Service via basic_resource_labels), so it's an
exact join key — no name reconstruction, no new Atlas API, no new headers. It's
self-correcting: the Service NAME template can change Atlas-side without breaking
phoebe.

- Select `resource-id=<id>,service-type=internal` — the service-type clause is
  load-bearing: a deployment's ssh Service shares the resource-id label, so
  without it the match is ambiguous. (Tested.)
- Port: prefer the port named "8000" (Route.port_name == str(container_port), and
  the vLLM serve port is 8000 == Deployment.proxy_port); single-port fallback;
  error rather than silently pick a wrong port on a multi-port Service. (Tested.)
- 0 matches → ErrNotFound (CachedResolver negative-caches, short TTL, so a new
  model is reachable fast). API error / ambiguous / no-served-port → transient
  error (NOT cached, retried). (Tested.)
- Wire k8sLookup into buildResolver for cached/chain (replacing conventionLookup).
  chain keeps convention as a fallback ONLY for k8s-API-unreachable.
- New config registry.k8sNamespace (Saturn: "main-namespace"); required for
  cached/chain. Fixed the misleading placeholder convention comment in
  settings.example.yaml so nobody flips strategy:convention and gets silent 404s.
- Adds client-go (in-cluster client). Unit-tested with the fake clientset — no
  cluster needed; 8 named tests covering the invariants + negatives + attacks.

ROUTING CONTRACT (one-way door, for Hugo's review): phoebe now depends on two
Atlas-owned k8s conventions for inference routing — (1) inference Services carry
saturncloud.io/resource-id + service-type=internal; (2) the served port is 8000.
Both hold today; this promotes them to a documented contract so an Atlas refactor
can't silently break routing. No Atlas change required to ship this.

Deploy note: the interceptor pod needs an RBAC Role granting get/list on services
in main-namespace (a saturn-k8s chart change, separate PR).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant