Land phoebe on release-2026.02.01#24
Merged
Merged
Conversation
Standard Go service layout (cmd/ + internal/) for the token factory's metering interceptor: the tenant-aware reverse proxy that sits behind Traefik and in front of the inference router/engine. Wired and tested (walking skeleton): - identity: trusted X-Saturn-* header extraction (no re-auth) - registry: model -> upstream resolution (Resolver iface + Static impl), clean 404 for torn-down models - proxy: identity -> registry -> upstream dispatch, server lifecycle with streaming-tuned timeouts and graceful shutdown - config: YAML settings (defaults -> unmarshal -> parse) - metering: immutable Event + Usage schema and Emitter contract - logging: leveled logger Idioms (logger, settings load-flow, X-Saturn-* headers, build files) borrowed from auth-server. Tooling: golangci-lint (v1.64.8) + pre-commit, make lint/vet/test, CI lint+test jobs. make lint/vet/test and gofmt all clean. Deliberately stubbed for the next milestone (marked TODO in internal/proxy and internal/metering): the SSE forward-then-inspect tee, forcing stream_options.include_usage=true, client-abort cancellation, and the durable metering emitter (Kafka/Redis Streams; LogEmitter stands in). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the load-bearing milestone: capture the engine's usage block
from the response as it streams to the client, never buffering.
- capture.Result: frozen handoff contract between the tee (M1) and the
metering emitter (M2). Carries usage, finish_reason, UsageFound (an
explicit "unknown" signal, not zero), Aborted, and Streamed.
- rewrite.go: force stream_options.include_usage=true on streaming
requests by parse->set->reserialize (not duplicate-key last-writer-win,
which vLLM's Pydantic parse doesn't honor). Overwrites client
include_usage:false, mirroring vLLM's --enable-force-include-usage.
Preserves other stream_options; leaves non-streaming/non-JSON untouched.
- tee.go: captureReader wraps the upstream body as an io.Reader that
forwards bytes verbatim while inspecting them. Streaming: scans SSE
"data: {...}" lines, pulls usage from the trailing chunk (choices: [])
and finish_reason from its own earlier chunk — crucially does NOT stop
at finish_reason (the LiteLLM trap). Non-streaming: parses the single
JSON body once.
- proxy.go: wires rewrite -> ReverseProxy (FlushInterval=-1 for per-chunk
flush) -> ModifyResponse tee -> emit. Client-disconnect (context.Canceled)
is handled as an abort, not a 502.
Verified vLLM usage shape against source (v0.6.4..main): cached tokens at
prompt_tokens_details.cached_tokens, gated by --enable-prompt-tokens-details
+ a cache hit, so absence means 0 (handled). No completion_tokens_details
in vLLM. Usage chunk precedes data: [DONE].
Caught and fixed a buffer-aliasing bug in SSE line scanning that dropped
usage under realistic chunked reads (the byte-at-a-time test passed while
the bulk-read test failed — the tell). Tests cover the LiteLLM ordering,
cache-miss omission, non-streaming, split-across-reads, no-usage, abort
finalize, and a full end-to-end streamed request through a fake vLLM
backend. 13 tests, race-clean, lint-clean.
Abort *correctness* (mark aborted + emit partial per policy) is M3; the
Aborted field and markAborted hook are in place for it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nResolver
Adds two resolver strategies to internal/registry/ so new models become
reachable without a redeploy and torn-down models fail cleanly as ErrNotFound:
- ConventionResolver: zero-lookup URL derivation via a configurable {id}
template (e.g. http://model-{id}.ns.svc.cluster.local:8000). Needs no
control-plane call; works by Kubernetes Service naming convention alone.
- CachedResolver: wraps an injected LookupFunc (the seam for a future Atlas
control-plane call) with an LRU cache and separate positive / negative TTLs.
Negative TTL is intentionally short so newly-created models become reachable
quickly. Single-flight deduplication (channel-based, race-free) prevents
thundering-herd on cache miss. Clock is injectable for deterministic tests.
- ChainResolver: tries each Resolver in order, falling through on ErrNotFound
OR any other error, so a control-plane outage degrades gracefully to the
naming-convention guess rather than failing all traffic.
- registry.Config struct for strategy selection and all tunable knobs.
26 tests covering: template substitution, cache hit/miss, positive and negative
TTL expiry, not-found propagation, transient-error retry, URL mutation safety,
concurrent Resolve with single-flight, composition (CachedResolver wrapping
ConventionResolver as LookupFunc, and ChainResolver{cached, convention}).
Full gate: go build ./..., go test -race ./..., go vet ./...,
golangci-lint run, gofmt -l . (empty) — all clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds internal/emit with a DurableEmitter that implements metering.Emitter via Valkey Streams (hot XADD path) → local-disk WAL (fsync'd JSONL fallback when Valkey is down or channel is full) → structured log floor (last resort if disk also fails). Emit is always non-blocking: events are handed to a buffered channel and dispatched by a background worker pool; a separate shipper goroutine drains the WAL back into Valkey on recovery. Clock is injectable for deterministic tests. 19 tests covering the WAL, all three fallback levels, concurrency, idempotency, timestamps, and graceful close. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # go.mod # go.sum
… policy
Wire client-disconnect detection into the proxy path and apply explicit
billing policy for aborted requests.
## captureReader (tee.go)
- Add sync.Mutex guarding result and done across all paths (Read goroutine,
markAborted watcher goroutine, Close/EOF path). All result writes and the
done flag are now under mu — go test -race clean.
- Add finishedCh (chan struct{}) closed by finish() so the abort-watcher
goroutine in handleProxy can exit without leaking when the stream completes
normally (two-arm select on r.Context().Done() vs cr.finishedCh).
- markAborted() now sets Aborted=true under mu, then calls finish() if not
already done — guaranteeing onDone sees Aborted=true when the watcher wins
the race. If finish() already ran (clean EOF before context cancel), the
done guard prevents a second fire and Aborted=false stands (correct).
- onDone fires exactly once regardless of whether abort-watcher, Read-EOF,
or Close reaches finish() first.
## handleProxy (proxy.go)
- Remove the NOTE about M3 being pending; implement it.
- Launch abort-watcher goroutine inside ModifyResponse (after cr is assigned)
that selects on r.Context().Done() → cr.markAborted() or cr.finishedCh →
exit. The goroutine never leaks: finishedCh fires on every normal completion.
## emit policy (proxy.go)
- Aborted + usage captured: always emit (real counts available).
- Aborted + no usage + BillPartialOnAbort=true: emit zero-count partial event
so downstream can reconcile.
- Aborted + no usage + BillPartialOnAbort=false: log only, no billable event.
- Not aborted + no usage: unchanged from M1 (log for reconciliation).
## Tests (tee_test.go, abort_test.go)
New tests: TestTeeOnDoneFiresExactlyOnce, TestTeeMarkAbortedAfterEOFIsNoop,
TestTeeAbortWithPartialUsage (uses blockAfterReader to simulate mid-stream
abort with usage present), TestAbortMidStreamEmitsAbortedEvent,
TestAbortBillPartialTrue_NoUsage, TestAbortBillPartialFalse_NoUsage,
TestAbortWithUsage (subtests for both billPartial values),
TestAbortOnDoneFiresExactlyOnceViaProxy, TestNormalCompletionNotAffectedByAbortWatcher,
TestAbortRaceStress (50 concurrent aborts; -race verifies no data races),
TestLongStreamNoDeadlineSever (slow backend, no deadline severing).
All existing M1 tee and proxy tests pass unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The binary now uses the real M2/M4 implementations instead of the LogEmitter/Static stubs: - config.Settings gains registry.* and emit.* sections (plain YAML fields, translated into the subsystem configs in main.go so config doesn't import emit/registry). Strategy validation + TTL parsing in parse(). - main.go builds the resolver per strategy (static|convention|cached|chain) and the durable emitter (Valkey + WAL fallback), with graceful shutdown. A failed durable-emitter init falls back to the log emitter rather than taking down serving. - settings.example.yaml documents the new registry/emit options. - config_test.go covers defaults, strategy validation, TTL parsing, emit. Known seam: the cached/chain resolver's LookupFunc currently degrades to the naming convention. The real Atlas control-plane lookup is still gated on confirming auth-server resolves model resources via X-Saturn-Resource-Id. Full gate green: build, test -race (73 tests), vet, golangci-lint, gofmt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Capture the token / API-key identity (X-Saturn-Auth-Id, the JWT sub / IdentityAuth.id) plus every other X-Saturn-* header verbatim, and record them on the metering Event. AuthID is the primary attribution key; org/user/group are resolved downstream (out of band, at rating time) from the IdentityAuth record, so the hot path never resolves a multi-org user's active-org context. - identity: add HeaderAuthID + Identity.AuthID; FromRequest reads it. - metering.Event: add AuthID, ResourceID, ResourceType (all omitempty) so no identity the edge gave us is dropped; reorganize fields with clearer docs. - proxy.emit: populate the new fields. - tests: identity capture (all-headers + all-empty), e2e proxy asserts auth_id + resource_type flow into the event. Security: Phoebe blindly trusts these headers (it does not authenticate). That is safe only because Traefik's atlas-auth ForwardAuth allowlist overwrites any client-supplied X-Saturn-* copy. X-Saturn-Auth-Id is added to that allowlist in saturn-k8s#976, and emitted by auth-server#85 — both must be live before this header is relied on for billing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A billing product must not serve traffic it can't attribute. Add an early billing-identity gate in handleProxy that rejects with 400 if X-Saturn-Auth-Id (the token / API-key attribution key) or X-Saturn-Resource-Id (the model being billed) is absent. The error names every missing field so a header misconfiguration — auth-server not emitting it, or Traefik not allowlisting it — is immediately obvious. This replaces the prior lone resource-id check with missingBillingFields(), which reports all missing required fields at once. UserID/GroupID are NOT required here — they're resolved downstream from AuthID. A rejected request emits no billing event (verified). Tests cover each missing combination, the both-present happy path, and updated existing proxy/abort tests to supply auth-id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Records the why behind Phoebe's shape — the decisions not recoverable from code: the ForwardAuth trust model + non-spoofing invariant, token-id-as- billing-key with downstream org resolution (and why org can't be on the hot path for multi-org users), the tiered durability contract (Postgres = system of record, WAL loss acceptable → Phoebe stays stateless), replica safety and the resolver-returns-Services invariant, Postgres-first I/O logging with the documented OpenSearch tripwire, the Atlas Postgres conventions to follow, and the open verify-gates / defaulted decisions from autonomous build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The interceptor's emitter XADDs metering events to a Valkey Stream as the hot buffer. This adds the consumer side of that durability ladder: a standalone service that reads the same stream via a consumer group and writes each event durably to Postgres, the system of record for raw, pre-rating token counts. internal/drain: - Drainer.Run drives XREADGROUP in batches; XACK happens ONLY after the Postgres write commits (at-least-once delivery). XGROUP CREATE MKSTREAM at "0" so a fresh group drains existing backlog. XAUTOCLAIM reclaims entries stranded by a dead consumer. Malformed entries are ACK'd and dropped so a poison pill can't wedge the group. - Store interface (Upsert/Ping/Close) is the mockable seam. PostgresStore does a single parameterised multi-row INSERT ... ON CONFLICT (request_id) DO NOTHING per batch, in one transaction. At-least-once + idempotent upsert = effectively-once. Empty identity strings bind NULL so billing GROUP BY auth_id never gets a spurious "" bucket. - Config mirrors emit.Config's wire-at-call-site pattern (no shared config dependency). database/sql + pgx stdlib driver; pool 90/20, Ping on start. cmd/drainer: -f YAML settings (Valkey/group knobs) + DATABASE_URL env (Atlas convention) for Postgres. SIGTERM -> finish in-flight batch, exit clean; un-ACK'd work redelivers safely. migrations: the table lives in the shared Atlas Postgres, owned by the Alembic chain — the drainer does NOT migrate at startup. Ships the schema as reviewable DDL (0001_billing_event.sql) plus a ready-to-copy Alembic file (atlas/b1f0c2d3e4a5_add_billing_event.py, down_revision = current Atlas head c1d2e3f4a5b6) following Atlas pk_/ix naming. README explains the single-migration-system rationale. Tests (11, race-clean): batch consume -> store -> ACK; idempotent redelivery (same request_id -> one row); store failure -> no ACK (redelivers); graceful shutdown; XAUTOCLAIM reclaim; poison-pill drop; sqlmock SQL-shape + NULL-binding assertions. miniredis for Valkey, fake Store + sqlmock for Postgres — no real Postgres required. Deps (both go 1.23-compatible, go directive unchanged at 1.23.0): jackc/pgx/v5 v5.7.5 (latest v5 whose go directive is still 1.23.0; v5.10 needs go 1.25), DATA-DOG/go-sqlmock v1.5.2 (go 1.15). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds internal/iolog: a per-tenant, opt-in, sampled, short-retention capture of request+response BODIES, forked from the same upstream bytes as metering but written to a SEPARATE Postgres store (io_log) with its own retention/access. Mechanism (mirrors internal/emit's sink+config+worker style): - Record: request_id + verbatim identity + bodies + status/latency. - Sink interface: async, non-blocking Log. PostgresSink (worker pool + buffered channel, drops+counts on overflow — best-effort, the deliberate contrast with metering's durable no-drop ladder); NopSink (off-default); OpenSearch left as a documented typed hole with its sampled→log-everything tripwire (DESIGN §8). - Policy: ShouldLog(id, request_id) gate. StaticPolicy = interim opt-in (global enabled=FALSE default, sampleRate=0.0 default, optional auth/group allowlist) with DETERMINISTIC hash-of-request_id sampling (reproducible/testable). The Policy interface is the seam for a future control-plane ControlPlanePolicy (flagged loudly in-code). Proxy wiring (zero hot-path cost when off): - handleProxy computes shouldLog ONCE; only then captures the ORIGINAL client request body (pre-rewrite, for tenant-troubleshooting fidelity) and enables a bounded response-body tee (256KiB cap, truncate+flag) alongside the existing usage scan. When off, no extra read/alloc — captureReader.logBuf stays nil. - Fires sink.Log async from the same onDone completion; never blocks the client. - main.go wires NopSink + deny-all policy by default; PostgresSink only when ioLog.enabled. Config gains ioLog block (fail-closed validation). Schema: migrations/0002_io_log.sql + alembic stub + README, following the drainer's migration-ownership pattern (down_revision set at copy time). Deps: pgx v5.7.5, go-sqlmock v1.5.2 (both go<=1.23); go-internal pinned v1.14.1 so tidy doesn't bump the toolchain. go directive unchanged (go 1.23.0). Tests (30 new): policy off-by-default/allowlist/rate-boundaries/determinism; sink async-nonblocking/overflow-drops/NopSink/sqlmock SQL shape; proxy shouldLog=false no-buffering, shouldLog=true full Record, streaming verbatim, response cap truncation; config fail-closed validation. go test -race clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The revenue path. A batch job (cmd/rater) reads billing_event over a time window, joins an effective-dated price book, computes cost in integer micro-USD, and upserts per-(auth_id, model, hour) cost rollups into rated_usage. Money correctness is the product; the non-negotiable invariants, each tested by name: - INTEGER micro-USD (1e-6 USD) everywhere, never float. 1 Atlas hourly_usage_record unit (1e-4 USD) == 100 micro-USD; finer base avoids rounding tiny per-token prices to zero before the multiply. - Billable-prompt formula (the highest-risk line): vLLM's prompt_tokens is the TOTAL and cached_tokens is the cache-hit SUBSET, so cost = (prompt-cached)*prompt_rate + cached*cached_rate + completion*comp_rate. Cached tokens are charged ONCE (at the discounted rate), never double-counted. - Effective-dated prices: an event is rated with the price in effect at its event_ts (fallback created_at); rating now never retroactively reprices old traffic. - Idempotent re-runs: rollups upsert ON CONFLICT (auth_id, model, window_start) DO UPDATE, recomputing totals from scratch — re-rating a window reconciles, never doubles. - Fail-closed on missing price: a model with no price-book entry at the event's time is counted as unpriced and logged loudly, NEVER silently billed $0 (that is lost revenue). The rater exits 2 so a CronJob can alert. Schema (migrations + ready-to-copy Alembic, chained after billing_event's b1f0c2d3e4a5): model_price (effective-dated per-token price book) and rated_usage (the hourly rollup). Prices are DATA, not code — a clearly-labelled non-binding seed lives in seed_example_prices.sql; no prices ship in the schema. No new dependencies. go build/vet/test -race/golangci-lint/gofmt all clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ReadWindow previously filtered out billing_event rows with NULL auth_id or model via `AND auth_id IS NOT NULL AND model IS NOT NULL`, with a comment claiming they were "surfaced elsewhere." That elsewhere did not exist, so such a row vanished from rating with zero count and zero alert — a silent revenue-loss / data-loss path, exactly what we fail LOUD on for missing prices. Make the exclusion loud, mirroring UnpricedEvents: - store: drop the NULL filter from the WHERE; scan every in-window row with sql.NullString and, when auth_id/model is NULL, count it and skip it (atomic single-scan count). ReadWindow now returns (events, unattributable, error). - rater: add Result.UnattributableEvents, HasUnattributable(), and a combined HasAnomaly() = HasUnpriced() || HasUnattributable(); thread the count through Run(); log it loudly (ERROR) and fold it into the loud summary. - cmd/rater: exit 2 on HasAnomaly() (was HasUnpriced()); rename exitUnpriced -> exitAnomaly and update the exit-code doc. A nonzero unattributable count now alerts a CronJob the same as unpriced events. - replace the stale "surfaced elsewhere" comment with the truth. This should never be nonzero (the interceptor's fail-closed billing gate rejects requests missing auth_id before they are metered) — which is precisely why a nonzero count must be loudly surfaced: it means something upstream is broken and revenue is leaking. Tests: TestRater_UnattributableEventsCountedNotSilent (not rated, counted, triggers anomaly/exit-2) and TestPostgresStore_ReadWindowCountsUnattributable (NULL rows skipped+counted in the scan). Existing money-rule and ReadWindow tests updated for the new signature; all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Everything in the usage-capture path is unit-tested against SYNTHETIC vLLM payloads + source research — necessary but not sufficient for a billing product. If the real engine names a field differently (esp. prompt_tokens_details.cached_tokens, flagged in metering.go), every bill is wrong and the unit tests wouldn't catch it. cmd/engine-conformance fires real OpenAI-style requests at a live engine and asserts the bytes parse into the SAME metering.Usage struct Phoebe bills on: - non-streaming usage block present + sane counts - streaming trailing usage chunk (choices:[]) before data: [DONE] - (opt-in) prefix-cache: cached_tokens > 0 on a repeated long prefix — the field metering.go flags for live verification, doubling as the check for the --enable-prompt-tokens-details deployment requirement Exit 0 = safe to bill against this engine; 1 = a billed field mismatched. Needs NO GPU to develop against (targets any OpenAI-compatible URL); validated green against a synthetic vLLM-shaped stub including the cached path. The one-command real-vLLM run (the trust sign-off) is documented in validation/README.md and gated only on a running engine — no GPU is spent until someone deliberately stands one up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A flaky CI failure on TestAbortMidStreamEmitsAbortedEvent turned out to be a REAL billing bug, not test flakiness: on a client abort, the metering event was sometimes never emitted — lost revenue. Root cause: the abort signal and the finalisation raced. An abort-watcher goroutine called markAborted() to set result.Aborted=true, while the body's Close() (triggered by ReverseProxy on context cancel) independently called finish(). If Close()→finish() won the race, onDone fired with Aborted=false → emit() hit the "no usage, not aborted" early-return → NO event. markAborted() then ran too late (done already true). Reproduced ~1 in 20 runs. Fix: make abort determination independent of goroutine scheduling. The captureReader now holds the request context and finish() reads ctx.Err() as the single source of truth for "did the client disconnect" — whichever path reaches finish() first observes the same correct value. net/http cancels the request context on disconnect, and ReverseProxy cancels the upstream from it, so by the time finish() runs on an abort, ctx.Err() is non-nil. This removes the racy machinery entirely (the abort-watcher goroutine, markAborted, finishedCh) — net-simpler and correct. proxy.go just constructs the reader with r.Context(). Also hardens the tests against the flakiness that exposed this: a recordingEmitter .waitForEvents(n, timeout) replaces fixed time.Sleep-then-assert (the async emit could land after a fixed sleep under CI load). Tee abort tests now drive abort via a cancelled context, matching the real path. Verified: the abort/tee suite passes 50x under -race; the three previously-CI- failing tests pass 50x consecutively. Full gate green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fine-tune derivation
Replace the obsolete rating-v1 (integer micro-USD, Go-side per-event Rate loop,
price keyed on model name) with the locked v2 design. Money correctness is the
product; this is the deepest money path in phoebe.
Schema (migrations/0002_rating.sql + atlas/c2f1a3b4d5e6_add_rating.py):
- model_price: effective-dated price book keyed on a STABLE model_id, with a
self-referencing derived_from (one hop), NUMERIC(20,9) per-token rates
(nullable for derived rows), created_by audit column. A GiST EXCLUDE
(model_id WITH =, tsrange(effective_from, effective_to) WITH &&) makes
overlapping price windows IMPOSSIBLE, so the rating join cannot fan out and
over-bill. CHECK enforces rate-or-derived and all-or-none rate components.
- derivation_policy: single GLOBAL fine-tune rule (identity|multiplier|markup),
effective-dated, with its own GiST EXCLUDE (one policy per instant) and param
CHECKs. Per-base override is a deliberate v1 non-goal.
- rated_usage: per-(auth_id, model_id, hour) rollup, NUMERIC cost, idempotent
natural-key upsert. Adds a partial event_ts index for the window scan.
- Notes CREATE EXTENSION btree_gist.
Money model: all money is NUMERIC and ALL money math is in SQL. The rater is a
single INSERT ... SELECT that resolves the effective price (own rate wins; else
base via derived_from through the effective policy, ONE hop; else unpriced),
computes per-event cost via the billable-prompt formula, and SUM()s per rollup —
Go never holds a running money total; it carries NUMERIC as text. The
billable-prompt formula (cached ⊆ prompt, charged once, clamped >= 0) is the
highest-risk line and is implemented identically in SQL and in the Rate() oracle.
Code (internal/rating, cmd/rater): rate.go keeps a PURE Rate() reference + an
exact big.Rat decimal (Dec) as the spec; pricebook.go is the resolution oracle;
store.go is the SQL rater (resolvedEventsCTE shared by the rate insert and the
anomaly count, so "unpriced" means the same in both); rater.go is orchestration
only; cmd/rater exit codes 0/1/2 unchanged. Unpriced and unattributable rows are
counted, logged loudly, excluded from rollups (never $0-billed), and exit 2.
Deleted v1 surface: int64 micro-USD Price/cost, the Go-side per-event Rate loop
and rollup aggregation, PriceBook over model-name, Store.LoadPrices/ReadWindow/
UpsertRollups, the cost_micro_usd schema.
Tests (43 in internal/rating): the named money rules — cached-subset-no-double-
count, effective-dated-selection, derivation identity/multiplier-1.5x/markup,
derived_from-inheritance, own-rate-bypasses-policy, missing-price-fails-loud-not-
zero, unattributable-counted-not-silent, idempotent-rerun-no-doubling, numeric-
exactness-no-float, one-hop-only — plus a conformance test (in-Go SQL model ==
Rate() oracle) and a build-tagged live-Postgres conformance/idempotency test
(store_integration_test.go) running the REAL SQL. sqlmock pins the INSERT...SELECT
shape, the effective-dating predicates, the policy CASE, and ON CONFLICT.
Out of scope (noted in code/migrations): the price-write authz / deploy-time gate
(Atlas control-plane); per-base derivation override; derived_from chains > 1 hop
(treated as unpriced, never recursed).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # internal/proxy/proxy.go # internal/proxy/tee.go # migrations/README.md
Build /app/phoebe, /app/phoebe-drainer, and /app/phoebe-rater in one image so a single tag ships the whole system; the k8s command selects which runs. The rater is the batch rating job (cmd/rater) that a CronJob invokes hourly. Supersedes the two-binary Dockerfile in #4 (this branch is stacked on rating-v1, which has cmd/rater). Static CGO-off builds on alpine + ca-certificates. Verified: docker build succeeds, all three binaries run, image ~46MB. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Go cost-formula code in this package — Rate(), the Dec exact-decimal type, the in-memory PriceBook resolver, ApplyPolicy — is an ORACLE: a reference reimplementation that exists only to pin the production SQL rater (a conformance test asserts the SQL matches it row-for-row). The production rater is pure SQL; NO production code path references Rate3, Dec, RatedEvent, PriceBook, Rate, or ApplyPolicy (verified: every caller is a _test.go file, and cmd/rater touches only New/Run/OpenPostgres/Config). But it lived in production files (rate.go, pricebook.go), so "which of this actually bills a customer?" required tracing call sites to answer — twice, in review. Go has a mechanical answer to that question: the _test.go suffix excludes a file from normal builds. Putting the oracle there lets the compiler guarantee it never ships, and makes the production surface self-evident. rate.go -> oracle_test.go (rename + comment edits only; no logic change) pricebook.go -> resolve_test.go (pure rename) doc.go (new) package doc + the production-vs-oracle split Behaviour is unchanged: production builds without the oracle (proving it was never needed there), and the full test suite — including the conformance test that pins the SQL to the oracle — still passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The rater prices on billing_event.model, which store.go aliases to model_id —
"the stable model identity, NOT the ephemeral deployment id." But the proxy was
writing Event.Model = id.ResourceID (the X-Saturn-Resource-Id routing header,
i.e. exactly the deployment id). So every event's price key was a deployment id
that matches no model_price row → every event unpriced → the rater bills $0 and
exits fail-loud. The metering→rating seam was severed: producer and consumer
each documented the invariant as if the other upheld it; neither did.
The root cause was a missing field in the capture contract: capture.Result had
no model name, and the tee discarded the `model` field that every OpenAI/vLLM
chunk carries at the top level. Fix it at the seam:
- add Model to the tee's chunk struct + capture.Result, populated in
inspectChunk from the engine's own response (the authoritative source)
- plumb res.Model into Event.Model and the iolog Record (was id.ResourceID
in both); ResourceID stays the routing/attribution key it always was
- store.go's `model AS model_id` is now TRUE
Tests: the streaming end-to-end and iolog fixtures now carry the engine model
name ("llama-3-8b") distinct from the resource id ("model-abc"), and the
assertions require Event.Model == the engine name AND != ResourceID — the
regression guard that was missing (the old tests asserted Model == resource id,
encoding the bug as correct).
Also drop the stale package-doc reference to a markAborted() watcher goroutine
that the M3 abort-fix already deleted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
drainWAL did readAll (release lock) → ship over the network → Truncate(0) (zero the whole file). The lock was dropped across the multi-second ship, and append() runs concurrently on the hot path (Valkey down → channel full → append) — exactly when the WAL is hot. Any event appended during the ship window was destroyed by Truncate(0) without ever being shipped: a silently lost metering event on a LIVE, healthy pod. DESIGN.md §5 defends pod-death WAL loss (correct) in a way that laundered this steady-state loss (not correct). Fix: rotate, don't truncate. rotate() atomically renames the WAL aside to a `.draining` snapshot and opens a fresh WAL under the lock; the drain ships the snapshot and dropRotated() deletes it on success. An append concurrent with the ship lands in the fresh WAL, never in the snapshot being shipped — so dropping the snapshot cannot destroy an unshipped event. recoverRotated() re-ships a snapshot orphaned by a crash mid-ship on the next tick (at-least-once; the consumer dedups on request_id). Tests: TestWAL_RotateDoesNotLoseConcurrentAppend is the regression guard — an append after rotate/before drop survives (the old truncate would have eaten it); TestWAL_RecoverRotatedReshipsOrphan covers crash-mid-ship recovery. DESIGN.md §5 now distinguishes the tolerated loss (pod death) from the fixed bug (in-process drain race). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d iolog
Four review fixes where being wrong is expensive, plus one related guard:
④ Reject non-hour-aligned --since/--until in the rater (cmd/rater). The rollup
buckets by date_trunc('hour') and the upsert REPLACES a bucket, so a sub-hour
window would overwrite a complete hourly rollup with a partial sum → silent
under-bill. Fail loud (consistent with the inverted-window guard) rather than
snap, so an operator never re-rates hours they didn't name.
⑤ tsrange → tstzrange in the model_price/derivation_policy exclusion constraints.
The columns are TIMESTAMPTZ; tsrange coerces them to local time via the session
TimeZone, making the no-overlap guarantee session-TZ-dependent. The rater's
price lookups use LIMIT 1 with no ORDER BY, trusting that constraint for
at-most-one-match — that bedrock cannot be TZ-sensitive. tstzrange compares in
absolute time. (Also add CHECK(effective_from < effective_to): an equal-bound
row is an empty range the exclusion won't catch — a silently inert dead price.)
⑥ iolog allowlist is now FAIL-CLOSED. An empty allowlist opted in EVERY tenant —
so enabling sampling for one tenant but forgetting the allowlist silently
captured all tenants' (possibly-PII) bodies, contradicting the package's own
"explicitly opted in / fail closed" posture. Empty now opts in no one; a new
explicit AllowAllTenants flag is required for deliberate fleet-wide debug.
⑦ Already fixed in the model-capture commit: onDone emits on
context.WithoutCancel(r.Context()) so an aborted request (whose context is
cancelled — that's how abort is detected) can't have its durable bill dropped
if Emit ever honors ctx.
Tests: hour-aligned-window rejection (minutes/seconds rejected, whole hours
accepted); iolog empty-allowlist-means-NONE + AllowAllTenants-is-explicit.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + rounding) The conformance story looked more rigorous than it was. Three gaps, all closed: 1. ROUNDING ORDER didn't match production. The SQL sums the EXACT per-event products and rounds the SUM once (NUMERIC(20,9)); the oracle rounded EACH event then summed. For a sub-nano per-token residue (e.g. a derived price 0.000000001 × 1.5 = 0.0000000015) these diverge by up to ~1 nano/event. The oracle now sums exact (rateExact) and rounds once per rollup, matching the SQL. 2. The in-Go "conformance" test checked the oracle against ITSELF — both sides pure Go, the production SQL never ran in the default lane. Its doc comment (and doc.go) claimed it "pins the SQL"; corrected to say what it actually does (internal consistency), and pointed at the integration tests as the real pin. 3. The integration conformance test now runs in CI. Added an integration-test job with a Postgres service container + `make integration-test`, so the REAL rateWindowSQL is exercised on every push — money correctness can't regress silently behind a build tag that never runs. New TestConformance_SubNanoRoundingMatchesSQL builds a multi-event rollup with a sub-nano residue where round-each-then-sum (0.000000006) and sum-then-round-once (0.000000005) genuinely differ, asserts the SQL equals round-once, and guards that the fixture actually exercises the divergence. Verified against a live Postgres: it PASSES reconciled and FAILS (5 vs 6 nano) if the oracle reverts to round-per-event. Also caught and fixed a stale `tsrange` (vs tstzrange) in the test's inline schema — Postgres rejects tsrange(timestamptz,...) outright, the same TZ-safety fix as the migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Runs `make integration-test` against a postgres:16 service so the rating SQL (rateWindowSQL + the GiST/tstzrange constraints + the sum-then-round behavior) is exercised on every push — money correctness can't regress silently behind a build tag that never runs in CI. Held out of the review-fixes push because the bot PAT lacks 'workflow' scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…el sentinel X-Request-Id is NOT on the Traefik auth-server allowlist, so it is client-controlled — yet it becomes billing_event's PRIMARY KEY (the billing idempotency key). Three holes, one gate: - Absent header: previously forwarded verbatim as "", which the drainer poison-drops — a client that simply omitted the header was served but never billed. Now the proxy GENERATES an id (16 bytes crypto/rand, hex, "phoebe-" prefix), uses it on the metering event, forwards it upstream, and echoes it on the response so correlation still works. - Invalid header: an oversize (>VARCHAR(255)) or non-printable id would fail the drainer's all-or-nothing batch INSERT and wedge innocent events. Now fail closed with 400: max 200 chars, printable ASCII ([\x21-\x7e]) only — same class as the identity gate above it. - Deliberate reuse of a valid id is dedup'd by the PK (free inference); that cannot be fixed here without breaking at-least-once redelivery dedup, so it is documented as a known limitation in proxy.go and DESIGN.md §1 (true fix: an edge-stamped, allowlisted id). Also two latent context bugs in the same handler: - ErrorHandler compared err == context.Canceled; the transport delivers the cancellation wrapped (*net.OpError), so aborts were logged as upstream errors. Use errors.Is. - onDone passed the raw (cancelled-on-abort) r.Context() to ioSink.Log; the first sink that honours ctx would drop every aborted request's record. Hoist one context.WithoutCancel for both emit and Log. Tests: TestProxyRequestID_GeneratedWhenAbsent, TestProxyRequestID_RejectsInvalid (upstream never sees a rejected id; no billing event emitted). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…maly bucket
eventArgs stored e.Model verbatim; its "NOT NULL in practice" comment
was false — capture.Result.Model is "" whenever the upstream never
reports a model. The rater's unattributable predicate is `model_id IS
NULL`, so a stored '' dodged it and the event was misreported as
UNPRICED, sending operators down the wrong runbook ("backfill prices"
instead of "fix the capture gap"). Route Model through nullStr like
every other identity column.
migrations/README.md gains a one-line backfill for pre-fix rows:
UPDATE billing_event SET model = NULL WHERE model = '';
Test: TestPostgresStore_EmptyModelStoredAsNull (sqlmock asserts the
model arg binds nil).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…truncate-after-ship) The rotate-to-.draining snapshot scheme had four confirmed loss modes: 1. rotate() renamed onto the FIXED .draining path, clobbering an unshipped orphan snapshot during a multi-tick Valkey outage — destroying every tick's events but the last. 2. A partial read error shipped the prefix then dropRotated() the WHOLE snapshot, deleting the unread suffix. 3. rotate's rollback double-fault left w.f on an unlinked inode. 4. readEvents used a default bufio.Scanner, so any >64KB line (unbounded header/model strings) killed the whole scan. Replace the mechanism with github.com/tidwall/wal v1.2.1 (small, MIT, well-tested, go1.13): entries are written at strictly increasing indexes and only reclaimed by TruncateFront AFTER the shipper confirms delivery through their index. Concurrent appends land at higher indexes that truncation never touches, so clobber/partial-read loss is structurally impossible. The drain now truncates after EACH shipped batch, so partial-outage progress is retained (strictly better than the old all-or-nothing drop). tidwall cannot truncate a log to empty (TruncateFront(last+1) is out of range), so a fully-shipped log retains its tail entry; an in-memory shippedThrough watermark stops it from re-shipping every tick. On restart the watermark resets and that one entry re-ships once — harmless, at-least-once, the drainer dedups on request_id. Also handled, with tests: - Corrupt log dir on open → quarantined to <dir>.corrupt.<ts> for forensics, fresh log created; a corrupt billing buffer must not block serving and the loss is bounded + loudly logged. - Legacy single-file JSONL WAL at the same configured path → imported (unbounded line reader, torn final line tolerated) and renamed aside to <path>.imported; lost-on-upgrade events would be silent revenue loss. - Corrupt individual entries are skipped with a counted ERROR log and truncated past — never wedge the drain. Regression tests: multi-tick-outage-no-loss (the clobber bug), partial-batch-progress, crash-reopen, corrupt-entry-skipped, corrupt-dir-quarantine, legacy-import (unit + end-to-end ship), oversize-event (~100KB field). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The rater filters billing_event on COALESCE(event_ts, created_at), but the index was on bare (event_ts), partial on NOT NULL. Postgres matches index expressions structurally, so that index can never serve the COALESCE predicate — every rater run seq-scanned a table that only grows, and both migration files' comments wrongly claimed the index covered the scan. Replace it with an expression index on (COALESCE(event_ts, created_at)) in both the reference .sql and the Alembic artifact, and fix the comments. Verified against live Postgres 16 with EXPLAIN (enable_seqscan=off): the plan uses billing_event_rating_instant_ix for the rater's window predicate (the integration test pinning this lands with the rating-package commit). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rministic ids
Three money-correctness fixes in the SQL rater, one statement-shape change:
ONE STATEMENT, ONE SNAPSHOT (fail-loud could silently pass): Run() issued
CountAnomalies and RateWindow as two independent READ COMMITTED statements,
so a billing_event row the drainer committed between them was excluded from
the rollups yet uncounted — exit 0 with revenue silently missing. The
anomaly counts now ride rateWindowSQL itself: the upsert is a data-modifying
CTE and the final SELECT reads both it and the resolution CTEs, and every
CTE of one statement sees the same snapshot, so rated + unpriced +
unattributable provably partitions the in-window events. Run() makes ONE
store call; CountAnomalies stays as an ad-hoc/ops query only.
SESSION-TZ-INDEPENDENT HOUR BUCKETS (double-bill on re-rate):
date_trunc('hour', tstz) truncates in the session TimeZone, and nothing
pinned it — a fractional-offset session (+05:30) wrote off-boundary
window_start values whose ON CONFLICT keys disagree with a UTC session's,
i.e. duplicate/overlapping rollups. The bucket is now
date_trunc('hour', ev_ts AT TIME ZONE 'UTC') AT TIME ZONE 'UTC' (portable to
all supported PG, unlike 3-arg date_trunc) — same hazard class the schema
already closed with tstzrange in the exclusion constraint. Belt-and-braces:
OpenPostgres appends timezone=UTC to the DSN unless the operator set one.
DETERMINISTIC SURROGATE IDS: the id was md5(random()||clock_timestamp()),
with a comment claiming a collision would be a "benign PK retry" — there is
no retry; a PK collision aborts the statement. The id is now the md5 of the
natural key (auth_id|model_id|epoch(window_start)), so re-runs regenerate
the same id and cross-run collisions are impossible by construction (epoch,
not ::text, because timestamptz::text renders in the session TZ).
Also: the upsert's SELECT is ORDERed BY the natural key so concurrent raters
lock rows in deterministic order (no ABBA deadlock).
Tests: oracleStore mirrors the one-pass shape; conformance asserts the
partition invariant and that Run never calls CountAnomalies; new live-PG
integration tests pin UTC bucketing under SET TIME ZONE 'Asia/Kolkata'
(exact hour boundaries + same conflict keys from a UTC re-run), id stability
across re-runs, and EXPLAIN using the new expression index. All verified
against Postgres 16.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ains) The default cron window was only the last complete hour, so an event drained late — a Valkey outage recovered from the WAL — with an event_ts in an already-rated hour fell in no future window and was never rated: silent revenue loss, and exit 0 forever since the anomaly counts scan the same window. The default is now the TRAILING N complete hours, [floor(now)-N*1h, floor(now)), N = rateTrailingHours (default 24, must be >= 1; a pointer in the YAML so an explicit 0 fails loud). Re-rating closed hours is safe by construction: the upsert REPLACES each (auth_id, model_id, hour) bucket with a freshly recomputed total, so late events fold in and nothing doubles. Explicit --since/--until still wins, with the existing hour-alignment fence. Residual risk, stated honestly in the docs: an event arriving more than N hours late still slips the default window — the DESIGN.md reconciliation backstop is the eventual answer; until then widen rateTrailingHours or re-rate the hour with --since/--until. Tests: resolveWindow default is trailing-24 (and honors a custom N), N < 1 rejected; the rating package's late-arrival test (previous commit) pins that a later trailing-window run rates an event drained after its hour was already rated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…T, count insert losses Three holes in one path: io_log records could fail their INSERT (and vanish uncounted) because the bytes weren't storable in Postgres TEXT. - tee.appendLog truncated the body-log copy at a raw byte offset, which can split a multibyte rune at the cap and leave the buffered copy invalid UTF-8. Back up to the rune start (<= 3 bytes, utf8.UTFMax-1) before cutting, and stop appending permanently once truncated. - The postgres sink now sanitizes both bodies before INSERT: Postgres TEXT rejects invalid UTF-8 outright and NUL (0x00) even though it is a valid code point. Invalid runs become U+FFFD; NULs are removed. Bodies are tenant-controlled bytes — either would otherwise fail the insert and silently lose the record. - insert() failures now increment the dropped counter (extended to mean "records lost for any reason": overflow or failed INSERT), so loss is visible instead of buried in warn logs. Tests: TestTeeBodyLogTruncationKeepsRuneBoundary, TestPostgresSink_NulAndInvalidUTF8BodySanitized, TestPostgresSink_InsertFailureCountsDropped. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two pre-existing shutdown bugs, both of which could silently drop metered (billable) events: 1. DurableEmitter.Close: workers exited via `default: return` the instant the channel was momentarily empty, while producers were never stopped — an Emit racing Close could queue an event into the buffered channel after the last worker left, stranding it with no WAL entry and no log floor. Now Close marks the emitter closed and closes the channel under a write lock (paired with a read lock around Emit's send, so no send ever hits a closed channel); workers range the channel to completion; Emit after Close goes straight to the WAL (or the log floor once the WAL is closed). The final WAL drain moved from the shipper into Close, AFTER the workers finish, so events that fell to disk during shutdown still ship. 2. cmd/interceptor: a server error hit log.Error.Fatalf, whose os.Exit skips deferred cleanup — the emitter was never Closed (no flush, no final drain) and the iolog sink never flushed its batch. Cleanups now run unconditionally before a nonzero exit. Tests: close-vs-emit race (every event accounted for in WAL ∪ floor), emit-after-close lands on the log floor. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- internal/config: document that walPath is the WAL directory, that a pre-upgrade single-file JSONL WAL at the same path is auto-imported and renamed aside, and that the configured path must not change across the upgrade. - settings.example.yaml: same note on the emit block. - DESIGN.md §5: replace the rotation paragraph with the tidwall/wal design — sequential indexed entries, truncate-after-each-shipped- batch watermark, why clobber/partial-read loss is structurally impossible now, the retained-tail/at-least-once wrinkle, corrupt-dir quarantine, legacy import, and the library-choice rationale (small, MIT, tested, go1.13). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ts, expression index
…l NULL, iolog sanitize
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…> Postgres -> rater
The three worst review bugs were producer/consumer contract mismatches
between components whose unit tests each passed (Event.Model carrying the
deployment id, empty request_id poison-dropped by the drainer, empty model
misfiled as UNPRICED instead of UNATTRIBUTABLE) — because nothing tested
the actual pipe. internal/e2e wires the REAL components in one process:
- httptest fake vLLM streaming the realistic SSE fixture (engine model
name + trailing usage chunk with cached_tokens)
- real proxy.Server with full identity headers, resource id distinct
from the model name, and deliberately NO X-Request-Id (generated-id
path end to end)
- real emit.DurableEmitter against miniredis
- real drain.Drainer via its production Run loop into a real Postgres
billing_event, schema created from the migration .sql files on disk
- real rating.Rater over the event's hour window
Asserts the money: one rated_usage row keyed on the ENGINE name (never the
deployment id), token sums from the usage chunk, cost equal to the
hand-derived NUMERIC (compared in SQL, never as a Go float), zero
unpriced/unattributable, and a phoebe-prefixed request_id surviving to the
billing_event PK. A second scenario pins the nullStr(model) contract:
a model-less event drains to model = NULL and lands in UNATTRIBUTABLE,
not UNPRICED.
Gated behind the integration tag + PHOEBE_TEST_DATABASE_URL (picked up by
make integration-test); isolated DROP/CREATE schemas so it can run in
parallel with the rating integration tests; ~1.6s wall clock.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A failed tidwall Write left the wrapper and the library divergent in
three ways, all verified broken:
(a) fsync fails after the data hit the file: the library's lastIndex
advanced but the wrapper's nextIndex did not, so every subsequent
Write(nextIndex) returned ErrOutOfOrder FOREVER — the WAL wedged
permanently and every event for the rest of the process floored,
even after the disk recovered.
(b) partial write (ENOSPC — the exact scenario the WAL exists for):
torn bytes landed mid-segment, the next append "succeeded" after
them, and the NEXT restart hit ErrCorrupt at Open and quarantined
the whole buffer — maximal, delayed, silent loss when fullest.
(c) clean write failure left a ghost entry in the library's in-memory
segment buffer — index-shifted reads until restart.
Fix: the library's own prescription (documented on its truncate error
paths) — on ANY Write error, and on ErrCorrupt from reads/truncates,
close and reopen the log via the existing quarantine-capable open path,
resync nextIndex from disk, and conservatively reset the shippedThrough
watermark (re-shipping retained entries is safe: at-least-once, deduped
downstream on request_id). The failed event still floors, as before.
Under ENOSPC the honest trade is: the reopen finds the torn tail
immediately (verified empirically: torn bytes at the very tail fail
Open with ErrCorrupt, same as mid-file) and quarantines the pre-failure
buffer ONCE, loudly, at failure time — bounded and recoverable from the
.corrupt dir — instead of planting a silent time bomb for the next
restart. The quarantine log line now includes the count of prior
.corrupt siblings so a flapping pod's disk growth is visible.
If the reopen itself fails the wal is poisoned: appends error
immediately (the emitter floors them), pending/markShipped no-op
cleanly, and the reopen is retried on every append — it is already the
failure path, so the extra open attempt is cheap relative to the loss
it can stop.
Also fixes the legacy-import crash window: the old order renamed
wal.jsonl to .imported BEFORE appending its events to the new log, so a
crash in that window parked events invisibly (silent revenue loss). Now
the file is staged to .importing, the events are appended FIRST, and
only then is it parked as .imported — a crash re-imports on the next
startup, producing duplicates, which is the correct side of the trade
because the drainer dedups on request_id. A leftover .imported file
warns on every startup (parked forensic data), and a wholly unreadable
legacy file is left in place as evidence instead of being renamed
aside.
Tests (fault injection via a test-only failNextWrite hook that receives
the live log, so tests reproduce the real divergent states):
- TestWAL_WriteErrorRecovers_NoPermanentWedge (a)
- TestWAL_WriteErrorNoSilentCorruption (b)
- TestWAL_PoisonedFloorsAndRecovers
- TestWAL_LegacyImportCrashWindow (both crash points)
- TestWAL_ImportedLeftoverWarns
- TestWAL_LegacyUnreadable
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The ReverseProxy ErrorHandler fires for both RoundTrip errors and
ModifyResponse-returned errors. Go delivers a mid-flight client
cancellation WRAPPED (a *url.Error around context.Canceled on the common
RoundTrip path), so an identity compare misses it and the abort is
misclassified as an upstream fault: logged at error AND 502'd over a
connection that is already gone.
Introduce a single isClientAbort(err) predicate using errors.Is, covering
both context.Canceled and context.DeadlineExceeded, and extract the
ErrorHandler into s.errorHandler(upstream, id, requestID) — a method seam
that makes the classification directly testable and threads identity
through for the abort-emit that follows (Fix A). Mechanism only here; the
emit is wired next.
Test TestErrorHandlerClassifiesWrappedCancel drives the handler with
*url.Error{Err: context.Canceled} (and DeadlineExceeded) and asserts no
502; TestErrorHandlerUpstreamFaultStill502 pins that a genuine upstream
fault still 502s; TestIsClientAbort locks the predicate's truth table.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ModifyResponse installs the captureReader whose onDone emits the metering event. But a client that disconnects BEFORE the upstream writes response headers fails RoundTrip, so ModifyResponse never runs and onDone never fires — a request that passed the billing-identity gate would emit NOTHING, leaving served-or-attempted traffic invisible to billing and reconciliation. The ErrorHandler now emits exactly ONE zero-token, attributable event (Aborted=true, UsageFound=false) with the already-resolved identity, via the SAME s.emit path the completion path uses — so the pre-header abort obeys the SAME BillPartialOnAbort policy (no second, policy-bypassing emit). The emit is gated on isClientAbort, so a genuine upstream fault never writes a bogus zero-token billing row. The context is decoupled from the cancelled client ctx (WithoutCancel), mirroring onDone. No double-emit: ModifyResponse always returns nil in phoebe, so ErrorHandler fires ONLY on a RoundTrip error (pre-header) — mutually exclusive with the onDone (post-header) path. Establishes the invariant: every request past the billing-identity gate emits exactly one attributable event — usage on completion, or zero-token Aborted on disconnect (pre- or post-header). Tests: TestPreHeaderAbortEmitsAttributableEvent (exactly one event, Aborted=true, zero tokens, AuthID+ResourceID populated); TestPreHeaderAbortBillPartialFalseNoEvent (policy still honored); TestNormalCompletionEmitsExactlyOnce (no double-emit on the clean path). Updated TestAbortRaceStress's now-stale "no event is correct" comment to the new contract. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The captured request body flowed UNCAPPED into the io_log INSERT, whose
body_tsv column is populated by to_tsvector over the request+response
bodies at write time. Postgres rejects a tsvector input past ~1 MiB
("string is too long for tsvector ..., max 1048575 bytes"), so any
long-context prompt failed the whole INSERT and the sink silently
drop-and-counted the record. (Confirmed against live Postgres 16 — see the
integration test.) The response body was already capped + flagged; the
request body was not.
Cap the LOGGED request-body copy at the SAME MaxBodyBytes bound the
response uses, add Record.RequestTruncated mirroring ResponseTruncated,
and collapse the rune-safe truncation into ONE shared helper
(truncateAtRuneBoundary) consumed by both the response copy
(captureReader.appendLog) and the request copy (captureRequestBody) —
one bound, one truncation, driven by the one MaxBodyBytes config value.
The FORWARDED request to the upstream is always the full body; the cap
bounds only what is stored. Truncation is at a rune boundary so the
stored TEXT is valid UTF-8.
Wires request_truncated through the INSERT column list/args, the
reference SQL migration, and the Alembic artifact (request_body still
$8 / response_body now $10 in the tsvector). Also corrects the
captureRequestBody doc, which previously claimed the copy was "bounded by
inbound request size" — io.ReadAll enforced no such bound; now it is real.
Tests: TestRequestBodyCappedBeforeTsvector + TestRequestBodyUnderCapNotFlagged
(proxy path: capped + flagged, full body still forwarded);
TestTruncateAtRuneBoundary (rune-safe bound, valid UTF-8); and the
integration pair TestRequestBodyCappedBeforeTsvector_Integration (capped
body INSERTs cleanly with a populated tsvector) +
TestUncappedRequestBodyWouldFailTsvector_Integration (pins the >1 MiB
failure the cap prevents).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The inline comment at the per-tenant opt-in gate read "An empty allowlist means all tenants are eligible" — the exact INVERSE of the fail-closed code it annotates (optedIn returns false for an empty allowlist; fleet-wide capture requires the explicit AllowAllTenants flag). For the most sensitive data in the system (prompts/completions, possibly PII) a comment that states the opposite of the privacy rule is a latent footgun for the next editor. Comment-only; matches the code and the struct docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
In process(), poison (decode-failed) entries were collected during the decode loop but ACK'd only AFTER the store work — and the store-outage fallback early-returns BEFORE that block. So on a store outage the poison entries were never ACK'd: they redelivered every pass until an outage-free run, re-logging the same malformed entries and wasting reclaim work, with no upside (they can never be stored). A malformed entry is unstorable BY DEFINITION, so dropping it is correct regardless of store health. Move the poison-ACK to immediately after the decode loop, before the store work, so every return path — including the outage early-return — has already dropped it. The good rows batched alongside still obey store-before-ACK and stay pending on an outage. Sentinel compares in this file already use errors.Is (redis.Nil); the BUSYGROUP check is a plain reply string, not a wrapped sentinel, so its string compare is correct and unchanged. Test TestDrain_PoisonAckedDuringStoreOutage: a malformed entry + a good entry, store DOWN — asserts the poison is ACK'd (pending drops to 1, not 2) while the good entry stays pending and lands after recovery. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…flow rateWindowSQL returned rollups_written, events_rated, and the two anomaly counts as COUNT(*)::int / SUM(event_count)::int — 32-bit. Over an arbitrary backfill window these counts can exceed 2^31 (a single re-rate of a long history, or a large unpriced/unattributable spill), which a 32-bit cast silently overflows into a negative/garbage value and defeats the rated-count reconciliation and the fail-loud anomaly thresholds. Widen the four COUNT/SUM casts to ::bigint and the Go side to int64 (RateResult, rater.Result, and Anomalies, kept consistent so they compose without lossy conversions). TotalCost is already NUMERIC text. Trivially correct, low blast radius; verified against live Postgres. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CountAnomalies (the Store interface method, the PostgresStore impl, the countAnomaliesSQL constant, and its tests) had NO non-test caller. The rater's fail-loud contract gets its anomaly counts from RateWindow's SINGLE snapshot (UnpricedEvents / UnattributableEvents on RateResult) — precisely because a separate counting statement could see a different snapshot than the upsert and disagree with what the rollups excluded. The doc itself flagged it as "ad-hoc / ops use" that "the rater's fail-loud contract does NOT use." That hypothetical ops query never acquired a caller, so it is unexercised surface that can silently rot out of sync with the resolution CTE. Delete it (interface + impl + SQL + the two SQL-shape/scan tests + the oracle's CountAnomalies method and the now-moot countCalls guard). The small Anomalies value type stays — it is the resolve-pass result the test oracle mirrors, and RateResult carries the same two counts. Resolution CTE and integration-test docs updated to drop the countAnomaliesSQL reference. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
TestIOLog_NonBlockingSink wired a sink that never blocked, so it asserted nothing about non-blocking — a test name that lies is a defect. The proxy calls Sink.Log SYNCHRONOUSLY from onDone (which can fire inside the final body Read, before the last bytes are copied), so a genuinely-blocking sink DOES stall the response — the proxy cannot honestly promise "the client returns without waiting." The non-blocking property lives in the Sink implementation and is already pinned by iolog.TestPostgresSink_LogIsNonBlocking (buffered-channel send, drop on full). Rename to TestIOLog_SinkReceivesRecordViaOnDone and assert what the proxy-level test CAN verify: an opted-in request hands exactly one record to the sink from onDone, response forwarded verbatim. The doc points to where the real non-blocking guarantee is tested. (The companion overclaiming doc — captureRequestBody's "bounded by inbound request size" — was corrected in the request-body cap fix, where the bound became real.) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fast-follow's errors.Is switch ALSO widened isClientAbort to match context.DeadlineExceeded — a regression the battery caught. In this proxy a DeadlineExceeded is ALWAYS an upstream/transport fault (no client-side request deadline exists; it's what a dial timeout / response-header timeout returns), so treating it as a client abort silently downgraded an upstream outage to a debug log, dropped the client's 502, and emitted a spurious zero-token Aborted billing row attributing the outage to the tenant. Revert to context.Canceled-only (the known-correct base behavior). Tests flipped to pin BOTH directions (they previously asserted the regression as correct — a bug-pinning test): TestIsClientAbort now requires DeadlineExceeded (bare + wrapped) to be NOT-an-abort; TestErrorHandlerUpstreamFaultStill502 now also drives a wrapped DeadlineExceeded and asserts 502 + zero emitted events. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Phoebe: complete token-metering + billing system (Rating v2)
Phoebe fast-follow: abort observability, iolog body cap, battery mechanical fixes
The complete token-rating system (PRs #14–#21, batteried to production-dry). phoebe's hourly rater turns billing_event token counts into rated_usage money: - Prices are a YAML config file (operator-authored, enterprise/air-gapped friendly), not a DB table — base per-token rates, a global fine-tune premium policy, and per-GPU floor rates. No effective-dating: the rater reads the current file, rates the last hour, and stores the APPLIED rate + cost on the row (self-auditing). - Money is NUMERIC, computed in SQL. quantize-then-multiply (the rate on the row reconstructs the cost). Fail-loud on any unpriced/unattributable event. - Model identity: base = HF id; fine-tune = ft:<checkpoint_artifact_id> with the base carried as a separate field on the event (X-Saturn-Base-Model) and priced base × premium. One hop only, enforced at load. ft: ids globally unique. - Reconcile-on-re-rate: a re-run deletes superseded rollups (latest run is what bills); routine reconcile-delete is loud (ERROR + exit nonzero), explicit backfill is quiet. Single-flight rater (Atlas CronJob concurrencyPolicy:Forbid). - iolog request body capped before to_tsvector + truncation logging. Reviewed via the machine-review battery to production-dry (zero production-code findings across the final rounds); the remaining iteration was test-quality only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… (E2) The rater dropped resource_id; billing needs it to identify the customer. rated_usage now keys on (auth_id, resource_id, model_id, window_start) so a rollup maps to the deployment → org that gets charged (E2). resource_id was already on billing_event (metering → drain) — the rater just discards it. - Grain widens by resource_id (model_id stays: prices resolve per-model, one deployment can serve several). Two deployments of the same model by the same auth in one hour now bill as two rows — correct (different orgs). - Fail-closed attribution: a NULL resource_id is counted UNATTRIBUTABLE (exits nonzero), never $0-billed or billed to a NULL org. The partition invariant rated + unpriced + unattributable + ambiguous == total holds. - Migration edited in place (no prod data, unapplied). Speculative (resource_id, window_start) index deferred to the PR that adds the E2 reader. PRs #22–#23, batteried to production-clean (3 rounds zero prod findings). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Brings the phoebe codebase from
mainontorelease-2026.02.01so the release branch contains the actual system (it was an emptyInitial commit). Required before release-images can build a phoebe image fromRELEASE_BRANCH.release-2026.02.01is a direct ancestor ofmain, so this is a clean, conflict-free merge — 84 files / ~18.8k lines: the streaming-proxy interceptor, durable metering emit + WAL, the Postgres drainer, Rating v2 (YAML-priced token→money, fine-tune base×premium, reconcile-on-re-rate, resource_id customer attribution), and M5 I/O logging. All batteried to production-clean and already merged tomain.Contracts
main(this is a branch-landing, not new code). The customer-facing/billing contracts (rated_usage schema, model_id price key, the metering Event shape) are unchanged from main.🤖 Generated with Claude Code