Skip to content

Land phoebe on release-2026.02.01#24

Merged
hhuuggoo merged 60 commits into
release-2026.02.01from
main
Jun 17, 2026
Merged

Land phoebe on release-2026.02.01#24
hhuuggoo merged 60 commits into
release-2026.02.01from
main

Conversation

@hhuuggoo

Copy link
Copy Markdown
Contributor

Brings the phoebe codebase from main onto release-2026.02.01 so the release branch contains the actual system (it was an empty Initial commit). Required before release-images can build a phoebe image from RELEASE_BRANCH.

release-2026.02.01 is a direct ancestor of main, so this is a clean, conflict-free merge — 84 files / ~18.8k lines: the streaming-proxy interceptor, durable metering emit + WAL, the Postgres drainer, Rating v2 (YAML-priced token→money, fine-tune base×premium, reconcile-on-re-rate, resource_id customer attribution), and M5 I/O logging. All batteried to production-clean and already merged to main.

Contracts

  • No new external contract beyond what's already on main (this is a branch-landing, not new code). The customer-facing/billing contracts (rated_usage schema, model_id price key, the metering Event shape) are unchanged from main.
  • Enables the release-images phoebe image wiring (separate PR against release-images).

🤖 Generated with Claude Code

hhuuggoo and others added 30 commits June 4, 2026 21:12
Standard Go service layout (cmd/ + internal/) for the token factory's
metering interceptor: the tenant-aware reverse proxy that sits behind
Traefik and in front of the inference router/engine.

Wired and tested (walking skeleton):
- identity: trusted X-Saturn-* header extraction (no re-auth)
- registry: model -> upstream resolution (Resolver iface + Static impl),
  clean 404 for torn-down models
- proxy: identity -> registry -> upstream dispatch, server lifecycle with
  streaming-tuned timeouts and graceful shutdown
- config: YAML settings (defaults -> unmarshal -> parse)
- metering: immutable Event + Usage schema and Emitter contract
- logging: leveled logger

Idioms (logger, settings load-flow, X-Saturn-* headers, build files)
borrowed from auth-server.

Tooling: golangci-lint (v1.64.8) + pre-commit, make lint/vet/test, CI
lint+test jobs. make lint/vet/test and gofmt all clean.

Deliberately stubbed for the next milestone (marked TODO in
internal/proxy and internal/metering): the SSE forward-then-inspect tee,
forcing stream_options.include_usage=true, client-abort cancellation, and
the durable metering emitter (Kafka/Redis Streams; LogEmitter stands in).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the load-bearing milestone: capture the engine's usage block
from the response as it streams to the client, never buffering.

- capture.Result: frozen handoff contract between the tee (M1) and the
  metering emitter (M2). Carries usage, finish_reason, UsageFound (an
  explicit "unknown" signal, not zero), Aborted, and Streamed.
- rewrite.go: force stream_options.include_usage=true on streaming
  requests by parse->set->reserialize (not duplicate-key last-writer-win,
  which vLLM's Pydantic parse doesn't honor). Overwrites client
  include_usage:false, mirroring vLLM's --enable-force-include-usage.
  Preserves other stream_options; leaves non-streaming/non-JSON untouched.
- tee.go: captureReader wraps the upstream body as an io.Reader that
  forwards bytes verbatim while inspecting them. Streaming: scans SSE
  "data: {...}" lines, pulls usage from the trailing chunk (choices: [])
  and finish_reason from its own earlier chunk — crucially does NOT stop
  at finish_reason (the LiteLLM trap). Non-streaming: parses the single
  JSON body once.
- proxy.go: wires rewrite -> ReverseProxy (FlushInterval=-1 for per-chunk
  flush) -> ModifyResponse tee -> emit. Client-disconnect (context.Canceled)
  is handled as an abort, not a 502.

Verified vLLM usage shape against source (v0.6.4..main): cached tokens at
prompt_tokens_details.cached_tokens, gated by --enable-prompt-tokens-details
+ a cache hit, so absence means 0 (handled). No completion_tokens_details
in vLLM. Usage chunk precedes data: [DONE].

Caught and fixed a buffer-aliasing bug in SSE line scanning that dropped
usage under realistic chunked reads (the byte-at-a-time test passed while
the bulk-read test failed — the tell). Tests cover the LiteLLM ordering,
cache-miss omission, non-streaming, split-across-reads, no-usage, abort
finalize, and a full end-to-end streamed request through a fake vLLM
backend. 13 tests, race-clean, lint-clean.

Abort *correctness* (mark aborted + emit partial per policy) is M3; the
Aborted field and markAborted hook are in place for it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nResolver

Adds two resolver strategies to internal/registry/ so new models become
reachable without a redeploy and torn-down models fail cleanly as ErrNotFound:

- ConventionResolver: zero-lookup URL derivation via a configurable {id}
  template (e.g. http://model-{id}.ns.svc.cluster.local:8000).  Needs no
  control-plane call; works by Kubernetes Service naming convention alone.

- CachedResolver: wraps an injected LookupFunc (the seam for a future Atlas
  control-plane call) with an LRU cache and separate positive / negative TTLs.
  Negative TTL is intentionally short so newly-created models become reachable
  quickly.  Single-flight deduplication (channel-based, race-free) prevents
  thundering-herd on cache miss.  Clock is injectable for deterministic tests.

- ChainResolver: tries each Resolver in order, falling through on ErrNotFound
  OR any other error, so a control-plane outage degrades gracefully to the
  naming-convention guess rather than failing all traffic.

- registry.Config struct for strategy selection and all tunable knobs.

26 tests covering: template substitution, cache hit/miss, positive and negative
TTL expiry, not-found propagation, transient-error retry, URL mutation safety,
concurrent Resolve with single-flight, composition (CachedResolver wrapping
ConventionResolver as LookupFunc, and ChainResolver{cached, convention}).

Full gate: go build ./..., go test -race ./..., go vet ./...,
golangci-lint run, gofmt -l . (empty) — all clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds internal/emit with a DurableEmitter that implements metering.Emitter
via Valkey Streams (hot XADD path) → local-disk WAL (fsync'd JSONL fallback
when Valkey is down or channel is full) → structured log floor (last resort
if disk also fails). Emit is always non-blocking: events are handed to a
buffered channel and dispatched by a background worker pool; a separate
shipper goroutine drains the WAL back into Valkey on recovery. Clock is
injectable for deterministic tests. 19 tests covering the WAL, all three
fallback levels, concurrency, idempotency, timestamps, and graceful close.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	go.mod
#	go.sum
… policy

Wire client-disconnect detection into the proxy path and apply explicit
billing policy for aborted requests.

## captureReader (tee.go)
- Add sync.Mutex guarding result and done across all paths (Read goroutine,
  markAborted watcher goroutine, Close/EOF path). All result writes and the
  done flag are now under mu — go test -race clean.
- Add finishedCh (chan struct{}) closed by finish() so the abort-watcher
  goroutine in handleProxy can exit without leaking when the stream completes
  normally (two-arm select on r.Context().Done() vs cr.finishedCh).
- markAborted() now sets Aborted=true under mu, then calls finish() if not
  already done — guaranteeing onDone sees Aborted=true when the watcher wins
  the race. If finish() already ran (clean EOF before context cancel), the
  done guard prevents a second fire and Aborted=false stands (correct).
- onDone fires exactly once regardless of whether abort-watcher, Read-EOF,
  or Close reaches finish() first.

## handleProxy (proxy.go)
- Remove the NOTE about M3 being pending; implement it.
- Launch abort-watcher goroutine inside ModifyResponse (after cr is assigned)
  that selects on r.Context().Done() → cr.markAborted() or cr.finishedCh →
  exit. The goroutine never leaks: finishedCh fires on every normal completion.

## emit policy (proxy.go)
- Aborted + usage captured: always emit (real counts available).
- Aborted + no usage + BillPartialOnAbort=true: emit zero-count partial event
  so downstream can reconcile.
- Aborted + no usage + BillPartialOnAbort=false: log only, no billable event.
- Not aborted + no usage: unchanged from M1 (log for reconciliation).

## Tests (tee_test.go, abort_test.go)
New tests: TestTeeOnDoneFiresExactlyOnce, TestTeeMarkAbortedAfterEOFIsNoop,
TestTeeAbortWithPartialUsage (uses blockAfterReader to simulate mid-stream
abort with usage present), TestAbortMidStreamEmitsAbortedEvent,
TestAbortBillPartialTrue_NoUsage, TestAbortBillPartialFalse_NoUsage,
TestAbortWithUsage (subtests for both billPartial values),
TestAbortOnDoneFiresExactlyOnceViaProxy, TestNormalCompletionNotAffectedByAbortWatcher,
TestAbortRaceStress (50 concurrent aborts; -race verifies no data races),
TestLongStreamNoDeadlineSever (slow backend, no deadline severing).
All existing M1 tee and proxy tests pass unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The binary now uses the real M2/M4 implementations instead of the
LogEmitter/Static stubs:

- config.Settings gains registry.* and emit.* sections (plain YAML fields,
  translated into the subsystem configs in main.go so config doesn't import
  emit/registry). Strategy validation + TTL parsing in parse().
- main.go builds the resolver per strategy (static|convention|cached|chain)
  and the durable emitter (Valkey + WAL fallback), with graceful shutdown.
  A failed durable-emitter init falls back to the log emitter rather than
  taking down serving.
- settings.example.yaml documents the new registry/emit options.
- config_test.go covers defaults, strategy validation, TTL parsing, emit.

Known seam: the cached/chain resolver's LookupFunc currently degrades to the
naming convention. The real Atlas control-plane lookup is still gated on
confirming auth-server resolves model resources via X-Saturn-Resource-Id.

Full gate green: build, test -race (73 tests), vet, golangci-lint, gofmt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Capture the token / API-key identity (X-Saturn-Auth-Id, the JWT sub /
IdentityAuth.id) plus every other X-Saturn-* header verbatim, and record them
on the metering Event. AuthID is the primary attribution key; org/user/group
are resolved downstream (out of band, at rating time) from the IdentityAuth
record, so the hot path never resolves a multi-org user's active-org context.

- identity: add HeaderAuthID + Identity.AuthID; FromRequest reads it.
- metering.Event: add AuthID, ResourceID, ResourceType (all omitempty) so no
  identity the edge gave us is dropped; reorganize fields with clearer docs.
- proxy.emit: populate the new fields.
- tests: identity capture (all-headers + all-empty), e2e proxy asserts
  auth_id + resource_type flow into the event.

Security: Phoebe blindly trusts these headers (it does not authenticate). That
is safe only because Traefik's atlas-auth ForwardAuth allowlist overwrites any
client-supplied X-Saturn-* copy. X-Saturn-Auth-Id is added to that allowlist in
saturn-k8s#976, and emitted by auth-server#85 — both must be live before this
header is relied on for billing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A billing product must not serve traffic it can't attribute. Add an early
billing-identity gate in handleProxy that rejects with 400 if X-Saturn-Auth-Id
(the token / API-key attribution key) or X-Saturn-Resource-Id (the model being
billed) is absent. The error names every missing field so a header
misconfiguration — auth-server not emitting it, or Traefik not allowlisting it
— is immediately obvious.

This replaces the prior lone resource-id check with missingBillingFields(),
which reports all missing required fields at once. UserID/GroupID are NOT
required here — they're resolved downstream from AuthID.

A rejected request emits no billing event (verified). Tests cover each missing
combination, the both-present happy path, and updated existing proxy/abort
tests to supply auth-id.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Records the why behind Phoebe's shape — the decisions not recoverable from
code: the ForwardAuth trust model + non-spoofing invariant, token-id-as-
billing-key with downstream org resolution (and why org can't be on the hot
path for multi-org users), the tiered durability contract (Postgres = system
of record, WAL loss acceptable → Phoebe stays stateless), replica safety and
the resolver-returns-Services invariant, Postgres-first I/O logging with the
documented OpenSearch tripwire, the Atlas Postgres conventions to follow, and
the open verify-gates / defaulted decisions from autonomous build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The interceptor's emitter XADDs metering events to a Valkey Stream as the
hot buffer. This adds the consumer side of that durability ladder: a
standalone service that reads the same stream via a consumer group and
writes each event durably to Postgres, the system of record for raw,
pre-rating token counts.

internal/drain:
- Drainer.Run drives XREADGROUP in batches; XACK happens ONLY after the
  Postgres write commits (at-least-once delivery). XGROUP CREATE MKSTREAM
  at "0" so a fresh group drains existing backlog. XAUTOCLAIM reclaims
  entries stranded by a dead consumer. Malformed entries are ACK'd and
  dropped so a poison pill can't wedge the group.
- Store interface (Upsert/Ping/Close) is the mockable seam. PostgresStore
  does a single parameterised multi-row INSERT ... ON CONFLICT
  (request_id) DO NOTHING per batch, in one transaction. At-least-once +
  idempotent upsert = effectively-once. Empty identity strings bind NULL
  so billing GROUP BY auth_id never gets a spurious "" bucket.
- Config mirrors emit.Config's wire-at-call-site pattern (no shared config
  dependency). database/sql + pgx stdlib driver; pool 90/20, Ping on start.

cmd/drainer: -f YAML settings (Valkey/group knobs) + DATABASE_URL env
(Atlas convention) for Postgres. SIGTERM -> finish in-flight batch, exit
clean; un-ACK'd work redelivers safely.

migrations: the table lives in the shared Atlas Postgres, owned by the
Alembic chain — the drainer does NOT migrate at startup. Ships the schema
as reviewable DDL (0001_billing_event.sql) plus a ready-to-copy Alembic
file (atlas/b1f0c2d3e4a5_add_billing_event.py, down_revision = current
Atlas head c1d2e3f4a5b6) following Atlas pk_/ix naming. README explains
the single-migration-system rationale.

Tests (11, race-clean): batch consume -> store -> ACK; idempotent
redelivery (same request_id -> one row); store failure -> no ACK
(redelivers); graceful shutdown; XAUTOCLAIM reclaim; poison-pill drop;
sqlmock SQL-shape + NULL-binding assertions. miniredis for Valkey, fake
Store + sqlmock for Postgres — no real Postgres required.

Deps (both go 1.23-compatible, go directive unchanged at 1.23.0):
jackc/pgx/v5 v5.7.5 (latest v5 whose go directive is still 1.23.0; v5.10
needs go 1.25), DATA-DOG/go-sqlmock v1.5.2 (go 1.15).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds internal/iolog: a per-tenant, opt-in, sampled, short-retention capture of
request+response BODIES, forked from the same upstream bytes as metering but
written to a SEPARATE Postgres store (io_log) with its own retention/access.

Mechanism (mirrors internal/emit's sink+config+worker style):
- Record: request_id + verbatim identity + bodies + status/latency.
- Sink interface: async, non-blocking Log. PostgresSink (worker pool + buffered
  channel, drops+counts on overflow — best-effort, the deliberate contrast with
  metering's durable no-drop ladder); NopSink (off-default); OpenSearch left as
  a documented typed hole with its sampled→log-everything tripwire (DESIGN §8).
- Policy: ShouldLog(id, request_id) gate. StaticPolicy = interim opt-in (global
  enabled=FALSE default, sampleRate=0.0 default, optional auth/group allowlist)
  with DETERMINISTIC hash-of-request_id sampling (reproducible/testable). The
  Policy interface is the seam for a future control-plane ControlPlanePolicy
  (flagged loudly in-code).

Proxy wiring (zero hot-path cost when off):
- handleProxy computes shouldLog ONCE; only then captures the ORIGINAL client
  request body (pre-rewrite, for tenant-troubleshooting fidelity) and enables a
  bounded response-body tee (256KiB cap, truncate+flag) alongside the existing
  usage scan. When off, no extra read/alloc — captureReader.logBuf stays nil.
- Fires sink.Log async from the same onDone completion; never blocks the client.
- main.go wires NopSink + deny-all policy by default; PostgresSink only when
  ioLog.enabled. Config gains ioLog block (fail-closed validation).

Schema: migrations/0002_io_log.sql + alembic stub + README, following the
drainer's migration-ownership pattern (down_revision set at copy time).

Deps: pgx v5.7.5, go-sqlmock v1.5.2 (both go<=1.23); go-internal pinned v1.14.1
so tidy doesn't bump the toolchain. go directive unchanged (go 1.23.0).

Tests (30 new): policy off-by-default/allowlist/rate-boundaries/determinism;
sink async-nonblocking/overflow-drops/NopSink/sqlmock SQL shape; proxy
shouldLog=false no-buffering, shouldLog=true full Record, streaming verbatim,
response cap truncation; config fail-closed validation. go test -race clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The revenue path. A batch job (cmd/rater) reads billing_event over a time
window, joins an effective-dated price book, computes cost in integer
micro-USD, and upserts per-(auth_id, model, hour) cost rollups into rated_usage.

Money correctness is the product; the non-negotiable invariants, each tested
by name:

- INTEGER micro-USD (1e-6 USD) everywhere, never float. 1 Atlas
  hourly_usage_record unit (1e-4 USD) == 100 micro-USD; finer base avoids
  rounding tiny per-token prices to zero before the multiply.
- Billable-prompt formula (the highest-risk line): vLLM's prompt_tokens is the
  TOTAL and cached_tokens is the cache-hit SUBSET, so
  cost = (prompt-cached)*prompt_rate + cached*cached_rate + completion*comp_rate.
  Cached tokens are charged ONCE (at the discounted rate), never double-counted.
- Effective-dated prices: an event is rated with the price in effect at its
  event_ts (fallback created_at); rating now never retroactively reprices old
  traffic.
- Idempotent re-runs: rollups upsert ON CONFLICT (auth_id, model, window_start)
  DO UPDATE, recomputing totals from scratch — re-rating a window reconciles,
  never doubles.
- Fail-closed on missing price: a model with no price-book entry at the event's
  time is counted as unpriced and logged loudly, NEVER silently billed $0
  (that is lost revenue). The rater exits 2 so a CronJob can alert.

Schema (migrations + ready-to-copy Alembic, chained after billing_event's
b1f0c2d3e4a5): model_price (effective-dated per-token price book) and
rated_usage (the hourly rollup). Prices are DATA, not code — a clearly-labelled
non-binding seed lives in seed_example_prices.sql; no prices ship in the schema.

No new dependencies. go build/vet/test -race/golangci-lint/gofmt all clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ReadWindow previously filtered out billing_event rows with NULL auth_id or
model via `AND auth_id IS NOT NULL AND model IS NOT NULL`, with a comment
claiming they were "surfaced elsewhere." That elsewhere did not exist, so such
a row vanished from rating with zero count and zero alert — a silent
revenue-loss / data-loss path, exactly what we fail LOUD on for missing prices.

Make the exclusion loud, mirroring UnpricedEvents:

- store: drop the NULL filter from the WHERE; scan every in-window row with
  sql.NullString and, when auth_id/model is NULL, count it and skip it (atomic
  single-scan count). ReadWindow now returns (events, unattributable, error).
- rater: add Result.UnattributableEvents, HasUnattributable(), and a combined
  HasAnomaly() = HasUnpriced() || HasUnattributable(); thread the count through
  Run(); log it loudly (ERROR) and fold it into the loud summary.
- cmd/rater: exit 2 on HasAnomaly() (was HasUnpriced()); rename exitUnpriced ->
  exitAnomaly and update the exit-code doc. A nonzero unattributable count now
  alerts a CronJob the same as unpriced events.
- replace the stale "surfaced elsewhere" comment with the truth.

This should never be nonzero (the interceptor's fail-closed billing gate rejects
requests missing auth_id before they are metered) — which is precisely why a
nonzero count must be loudly surfaced: it means something upstream is broken and
revenue is leaking.

Tests: TestRater_UnattributableEventsCountedNotSilent (not rated, counted,
triggers anomaly/exit-2) and TestPostgresStore_ReadWindowCountsUnattributable
(NULL rows skipped+counted in the scan). Existing money-rule and ReadWindow
tests updated for the new signature; all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Everything in the usage-capture path is unit-tested against SYNTHETIC vLLM
payloads + source research — necessary but not sufficient for a billing
product. If the real engine names a field differently (esp.
prompt_tokens_details.cached_tokens, flagged in metering.go), every bill is
wrong and the unit tests wouldn't catch it.

cmd/engine-conformance fires real OpenAI-style requests at a live engine and
asserts the bytes parse into the SAME metering.Usage struct Phoebe bills on:
- non-streaming usage block present + sane counts
- streaming trailing usage chunk (choices:[]) before data: [DONE]
- (opt-in) prefix-cache: cached_tokens > 0 on a repeated long prefix — the
  field metering.go flags for live verification, doubling as the check for the
  --enable-prompt-tokens-details deployment requirement

Exit 0 = safe to bill against this engine; 1 = a billed field mismatched.

Needs NO GPU to develop against (targets any OpenAI-compatible URL); validated
green against a synthetic vLLM-shaped stub including the cached path. The
one-command real-vLLM run (the trust sign-off) is documented in
validation/README.md and gated only on a running engine — no GPU is spent
until someone deliberately stands one up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A flaky CI failure on TestAbortMidStreamEmitsAbortedEvent turned out to be a
REAL billing bug, not test flakiness: on a client abort, the metering event
was sometimes never emitted — lost revenue.

Root cause: the abort signal and the finalisation raced. An abort-watcher
goroutine called markAborted() to set result.Aborted=true, while the body's
Close() (triggered by ReverseProxy on context cancel) independently called
finish(). If Close()→finish() won the race, onDone fired with Aborted=false →
emit() hit the "no usage, not aborted" early-return → NO event. markAborted()
then ran too late (done already true). Reproduced ~1 in 20 runs.

Fix: make abort determination independent of goroutine scheduling. The
captureReader now holds the request context and finish() reads ctx.Err() as the
single source of truth for "did the client disconnect" — whichever path reaches
finish() first observes the same correct value. net/http cancels the request
context on disconnect, and ReverseProxy cancels the upstream from it, so by the
time finish() runs on an abort, ctx.Err() is non-nil.

This removes the racy machinery entirely (the abort-watcher goroutine,
markAborted, finishedCh) — net-simpler and correct. proxy.go just constructs the
reader with r.Context().

Also hardens the tests against the flakiness that exposed this: a recordingEmitter
.waitForEvents(n, timeout) replaces fixed time.Sleep-then-assert (the async emit
could land after a fixed sleep under CI load). Tee abort tests now drive abort
via a cancelled context, matching the real path.

Verified: the abort/tee suite passes 50x under -race; the three previously-CI-
failing tests pass 50x consecutively. Full gate green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fine-tune derivation

Replace the obsolete rating-v1 (integer micro-USD, Go-side per-event Rate loop,
price keyed on model name) with the locked v2 design. Money correctness is the
product; this is the deepest money path in phoebe.

Schema (migrations/0002_rating.sql + atlas/c2f1a3b4d5e6_add_rating.py):
  - model_price: effective-dated price book keyed on a STABLE model_id, with a
    self-referencing derived_from (one hop), NUMERIC(20,9) per-token rates
    (nullable for derived rows), created_by audit column. A GiST EXCLUDE
    (model_id WITH =, tsrange(effective_from, effective_to) WITH &&) makes
    overlapping price windows IMPOSSIBLE, so the rating join cannot fan out and
    over-bill. CHECK enforces rate-or-derived and all-or-none rate components.
  - derivation_policy: single GLOBAL fine-tune rule (identity|multiplier|markup),
    effective-dated, with its own GiST EXCLUDE (one policy per instant) and param
    CHECKs. Per-base override is a deliberate v1 non-goal.
  - rated_usage: per-(auth_id, model_id, hour) rollup, NUMERIC cost, idempotent
    natural-key upsert. Adds a partial event_ts index for the window scan.
  - Notes CREATE EXTENSION btree_gist.

Money model: all money is NUMERIC and ALL money math is in SQL. The rater is a
single INSERT ... SELECT that resolves the effective price (own rate wins; else
base via derived_from through the effective policy, ONE hop; else unpriced),
computes per-event cost via the billable-prompt formula, and SUM()s per rollup —
Go never holds a running money total; it carries NUMERIC as text. The
billable-prompt formula (cached ⊆ prompt, charged once, clamped >= 0) is the
highest-risk line and is implemented identically in SQL and in the Rate() oracle.

Code (internal/rating, cmd/rater): rate.go keeps a PURE Rate() reference + an
exact big.Rat decimal (Dec) as the spec; pricebook.go is the resolution oracle;
store.go is the SQL rater (resolvedEventsCTE shared by the rate insert and the
anomaly count, so "unpriced" means the same in both); rater.go is orchestration
only; cmd/rater exit codes 0/1/2 unchanged. Unpriced and unattributable rows are
counted, logged loudly, excluded from rollups (never $0-billed), and exit 2.

Deleted v1 surface: int64 micro-USD Price/cost, the Go-side per-event Rate loop
and rollup aggregation, PriceBook over model-name, Store.LoadPrices/ReadWindow/
UpsertRollups, the cost_micro_usd schema.

Tests (43 in internal/rating): the named money rules — cached-subset-no-double-
count, effective-dated-selection, derivation identity/multiplier-1.5x/markup,
derived_from-inheritance, own-rate-bypasses-policy, missing-price-fails-loud-not-
zero, unattributable-counted-not-silent, idempotent-rerun-no-doubling, numeric-
exactness-no-float, one-hop-only — plus a conformance test (in-Go SQL model ==
Rate() oracle) and a build-tagged live-Postgres conformance/idempotency test
(store_integration_test.go) running the REAL SQL. sqlmock pins the INSERT...SELECT
shape, the effective-dating predicates, the policy CASE, and ON CONFLICT.

Out of scope (noted in code/migrations): the price-write authz / deploy-time gate
(Atlas control-plane); per-base derivation override; derived_from chains > 1 hop
(treated as unpriced, never recursed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	internal/proxy/proxy.go
#	internal/proxy/tee.go
#	migrations/README.md
Build /app/phoebe, /app/phoebe-drainer, and /app/phoebe-rater in one image so a
single tag ships the whole system; the k8s command selects which runs. The rater
is the batch rating job (cmd/rater) that a CronJob invokes hourly.

Supersedes the two-binary Dockerfile in #4 (this branch is stacked on rating-v1,
which has cmd/rater). Static CGO-off builds on alpine + ca-certificates.
Verified: docker build succeeds, all three binaries run, image ~46MB.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Go cost-formula code in this package — Rate(), the Dec exact-decimal type,
the in-memory PriceBook resolver, ApplyPolicy — is an ORACLE: a reference
reimplementation that exists only to pin the production SQL rater (a conformance
test asserts the SQL matches it row-for-row). The production rater is pure SQL;
NO production code path references Rate3, Dec, RatedEvent, PriceBook, Rate, or
ApplyPolicy (verified: every caller is a _test.go file, and cmd/rater touches
only New/Run/OpenPostgres/Config).

But it lived in production files (rate.go, pricebook.go), so "which of this
actually bills a customer?" required tracing call sites to answer — twice, in
review. Go has a mechanical answer to that question: the _test.go suffix
excludes a file from normal builds. Putting the oracle there lets the compiler
guarantee it never ships, and makes the production surface self-evident.

  rate.go      -> oracle_test.go   (rename + comment edits only; no logic change)
  pricebook.go -> resolve_test.go  (pure rename)
  doc.go       (new)               package doc + the production-vs-oracle split

Behaviour is unchanged: production builds without the oracle (proving it was
never needed there), and the full test suite — including the conformance test
that pins the SQL to the oracle — still passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The rater prices on billing_event.model, which store.go aliases to model_id —
"the stable model identity, NOT the ephemeral deployment id." But the proxy was
writing Event.Model = id.ResourceID (the X-Saturn-Resource-Id routing header,
i.e. exactly the deployment id). So every event's price key was a deployment id
that matches no model_price row → every event unpriced → the rater bills $0 and
exits fail-loud. The metering→rating seam was severed: producer and consumer
each documented the invariant as if the other upheld it; neither did.

The root cause was a missing field in the capture contract: capture.Result had
no model name, and the tee discarded the `model` field that every OpenAI/vLLM
chunk carries at the top level. Fix it at the seam:
  - add Model to the tee's chunk struct + capture.Result, populated in
    inspectChunk from the engine's own response (the authoritative source)
  - plumb res.Model into Event.Model and the iolog Record (was id.ResourceID
    in both); ResourceID stays the routing/attribution key it always was
  - store.go's `model AS model_id` is now TRUE

Tests: the streaming end-to-end and iolog fixtures now carry the engine model
name ("llama-3-8b") distinct from the resource id ("model-abc"), and the
assertions require Event.Model == the engine name AND != ResourceID — the
regression guard that was missing (the old tests asserted Model == resource id,
encoding the bug as correct).

Also drop the stale package-doc reference to a markAborted() watcher goroutine
that the M3 abort-fix already deleted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
drainWAL did readAll (release lock) → ship over the network → Truncate(0)
(zero the whole file). The lock was dropped across the multi-second ship, and
append() runs concurrently on the hot path (Valkey down → channel full →
append) — exactly when the WAL is hot. Any event appended during the ship
window was destroyed by Truncate(0) without ever being shipped: a silently
lost metering event on a LIVE, healthy pod. DESIGN.md §5 defends pod-death WAL
loss (correct) in a way that laundered this steady-state loss (not correct).

Fix: rotate, don't truncate. rotate() atomically renames the WAL aside to a
`.draining` snapshot and opens a fresh WAL under the lock; the drain ships the
snapshot and dropRotated() deletes it on success. An append concurrent with the
ship lands in the fresh WAL, never in the snapshot being shipped — so dropping
the snapshot cannot destroy an unshipped event. recoverRotated() re-ships a
snapshot orphaned by a crash mid-ship on the next tick (at-least-once; the
consumer dedups on request_id).

Tests: TestWAL_RotateDoesNotLoseConcurrentAppend is the regression guard — an
append after rotate/before drop survives (the old truncate would have eaten it);
TestWAL_RecoverRotatedReshipsOrphan covers crash-mid-ship recovery. DESIGN.md §5
now distinguishes the tolerated loss (pod death) from the fixed bug (in-process
drain race).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d iolog

Four review fixes where being wrong is expensive, plus one related guard:

④ Reject non-hour-aligned --since/--until in the rater (cmd/rater). The rollup
   buckets by date_trunc('hour') and the upsert REPLACES a bucket, so a sub-hour
   window would overwrite a complete hourly rollup with a partial sum → silent
   under-bill. Fail loud (consistent with the inverted-window guard) rather than
   snap, so an operator never re-rates hours they didn't name.

⑤ tsrange → tstzrange in the model_price/derivation_policy exclusion constraints.
   The columns are TIMESTAMPTZ; tsrange coerces them to local time via the session
   TimeZone, making the no-overlap guarantee session-TZ-dependent. The rater's
   price lookups use LIMIT 1 with no ORDER BY, trusting that constraint for
   at-most-one-match — that bedrock cannot be TZ-sensitive. tstzrange compares in
   absolute time. (Also add CHECK(effective_from < effective_to): an equal-bound
   row is an empty range the exclusion won't catch — a silently inert dead price.)

⑥ iolog allowlist is now FAIL-CLOSED. An empty allowlist opted in EVERY tenant —
   so enabling sampling for one tenant but forgetting the allowlist silently
   captured all tenants' (possibly-PII) bodies, contradicting the package's own
   "explicitly opted in / fail closed" posture. Empty now opts in no one; a new
   explicit AllowAllTenants flag is required for deliberate fleet-wide debug.

⑦ Already fixed in the model-capture commit: onDone emits on
   context.WithoutCancel(r.Context()) so an aborted request (whose context is
   cancelled — that's how abort is detected) can't have its durable bill dropped
   if Emit ever honors ctx.

Tests: hour-aligned-window rejection (minutes/seconds rejected, whole hours
accepted); iolog empty-allowlist-means-NONE + AllowAllTenants-is-explicit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + rounding)

The conformance story looked more rigorous than it was. Three gaps, all closed:

1. ROUNDING ORDER didn't match production. The SQL sums the EXACT per-event
   products and rounds the SUM once (NUMERIC(20,9)); the oracle rounded EACH
   event then summed. For a sub-nano per-token residue (e.g. a derived price
   0.000000001 × 1.5 = 0.0000000015) these diverge by up to ~1 nano/event. The
   oracle now sums exact (rateExact) and rounds once per rollup, matching the SQL.

2. The in-Go "conformance" test checked the oracle against ITSELF — both sides
   pure Go, the production SQL never ran in the default lane. Its doc comment
   (and doc.go) claimed it "pins the SQL"; corrected to say what it actually does
   (internal consistency), and pointed at the integration tests as the real pin.

3. The integration conformance test now runs in CI. Added an integration-test
   job with a Postgres service container + `make integration-test`, so the REAL
   rateWindowSQL is exercised on every push — money correctness can't regress
   silently behind a build tag that never runs.

New TestConformance_SubNanoRoundingMatchesSQL builds a multi-event rollup with a
sub-nano residue where round-each-then-sum (0.000000006) and sum-then-round-once
(0.000000005) genuinely differ, asserts the SQL equals round-once, and guards
that the fixture actually exercises the divergence. Verified against a live
Postgres: it PASSES reconciled and FAILS (5 vs 6 nano) if the oracle reverts to
round-per-event. Also caught and fixed a stale `tsrange` (vs tstzrange) in the
test's inline schema — Postgres rejects tsrange(timestamptz,...) outright, the
same TZ-safety fix as the migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Runs `make integration-test` against a postgres:16 service so the rating SQL
(rateWindowSQL + the GiST/tstzrange constraints + the sum-then-round behavior)
is exercised on every push — money correctness can't regress silently behind a
build tag that never runs in CI. Held out of the review-fixes push because the
bot PAT lacks 'workflow' scope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…el sentinel

X-Request-Id is NOT on the Traefik auth-server allowlist, so it is
client-controlled — yet it becomes billing_event's PRIMARY KEY (the
billing idempotency key). Three holes, one gate:

- Absent header: previously forwarded verbatim as "", which the drainer
  poison-drops — a client that simply omitted the header was served but
  never billed. Now the proxy GENERATES an id (16 bytes crypto/rand,
  hex, "phoebe-" prefix), uses it on the metering event, forwards it
  upstream, and echoes it on the response so correlation still works.
- Invalid header: an oversize (>VARCHAR(255)) or non-printable id would
  fail the drainer's all-or-nothing batch INSERT and wedge innocent
  events. Now fail closed with 400: max 200 chars, printable ASCII
  ([\x21-\x7e]) only — same class as the identity gate above it.
- Deliberate reuse of a valid id is dedup'd by the PK (free inference);
  that cannot be fixed here without breaking at-least-once redelivery
  dedup, so it is documented as a known limitation in proxy.go and
  DESIGN.md §1 (true fix: an edge-stamped, allowlisted id).

Also two latent context bugs in the same handler:

- ErrorHandler compared err == context.Canceled; the transport delivers
  the cancellation wrapped (*net.OpError), so aborts were logged as
  upstream errors. Use errors.Is.
- onDone passed the raw (cancelled-on-abort) r.Context() to ioSink.Log;
  the first sink that honours ctx would drop every aborted request's
  record. Hoist one context.WithoutCancel for both emit and Log.

Tests: TestProxyRequestID_GeneratedWhenAbsent,
TestProxyRequestID_RejectsInvalid (upstream never sees a rejected id;
no billing event emitted).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
hhuuggoo and others added 29 commits June 10, 2026 23:50
…maly bucket

eventArgs stored e.Model verbatim; its "NOT NULL in practice" comment
was false — capture.Result.Model is "" whenever the upstream never
reports a model. The rater's unattributable predicate is `model_id IS
NULL`, so a stored '' dodged it and the event was misreported as
UNPRICED, sending operators down the wrong runbook ("backfill prices"
instead of "fix the capture gap"). Route Model through nullStr like
every other identity column.

migrations/README.md gains a one-line backfill for pre-fix rows:
UPDATE billing_event SET model = NULL WHERE model = '';

Test: TestPostgresStore_EmptyModelStoredAsNull (sqlmock asserts the
model arg binds nil).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…truncate-after-ship)

The rotate-to-.draining snapshot scheme had four confirmed loss modes:

1. rotate() renamed onto the FIXED .draining path, clobbering an
   unshipped orphan snapshot during a multi-tick Valkey outage —
   destroying every tick's events but the last.
2. A partial read error shipped the prefix then dropRotated() the WHOLE
   snapshot, deleting the unread suffix.
3. rotate's rollback double-fault left w.f on an unlinked inode.
4. readEvents used a default bufio.Scanner, so any >64KB line
   (unbounded header/model strings) killed the whole scan.

Replace the mechanism with github.com/tidwall/wal v1.2.1 (small, MIT,
well-tested, go1.13): entries are written at strictly increasing
indexes and only reclaimed by TruncateFront AFTER the shipper confirms
delivery through their index. Concurrent appends land at higher
indexes that truncation never touches, so clobber/partial-read loss is
structurally impossible. The drain now truncates after EACH shipped
batch, so partial-outage progress is retained (strictly better than
the old all-or-nothing drop).

tidwall cannot truncate a log to empty (TruncateFront(last+1) is out
of range), so a fully-shipped log retains its tail entry; an in-memory
shippedThrough watermark stops it from re-shipping every tick. On
restart the watermark resets and that one entry re-ships once —
harmless, at-least-once, the drainer dedups on request_id.

Also handled, with tests:
- Corrupt log dir on open → quarantined to <dir>.corrupt.<ts> for
  forensics, fresh log created; a corrupt billing buffer must not
  block serving and the loss is bounded + loudly logged.
- Legacy single-file JSONL WAL at the same configured path → imported
  (unbounded line reader, torn final line tolerated) and renamed aside
  to <path>.imported; lost-on-upgrade events would be silent revenue
  loss.
- Corrupt individual entries are skipped with a counted ERROR log and
  truncated past — never wedge the drain.

Regression tests: multi-tick-outage-no-loss (the clobber bug),
partial-batch-progress, crash-reopen, corrupt-entry-skipped,
corrupt-dir-quarantine, legacy-import (unit + end-to-end ship),
oversize-event (~100KB field).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The rater filters billing_event on COALESCE(event_ts, created_at), but the
index was on bare (event_ts), partial on NOT NULL. Postgres matches index
expressions structurally, so that index can never serve the COALESCE
predicate — every rater run seq-scanned a table that only grows, and both
migration files' comments wrongly claimed the index covered the scan.

Replace it with an expression index on (COALESCE(event_ts, created_at)) in
both the reference .sql and the Alembic artifact, and fix the comments.
Verified against live Postgres 16 with EXPLAIN (enable_seqscan=off): the
plan uses billing_event_rating_instant_ix for the rater's window predicate
(the integration test pinning this lands with the rating-package commit).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rministic ids

Three money-correctness fixes in the SQL rater, one statement-shape change:

ONE STATEMENT, ONE SNAPSHOT (fail-loud could silently pass): Run() issued
CountAnomalies and RateWindow as two independent READ COMMITTED statements,
so a billing_event row the drainer committed between them was excluded from
the rollups yet uncounted — exit 0 with revenue silently missing. The
anomaly counts now ride rateWindowSQL itself: the upsert is a data-modifying
CTE and the final SELECT reads both it and the resolution CTEs, and every
CTE of one statement sees the same snapshot, so rated + unpriced +
unattributable provably partitions the in-window events. Run() makes ONE
store call; CountAnomalies stays as an ad-hoc/ops query only.

SESSION-TZ-INDEPENDENT HOUR BUCKETS (double-bill on re-rate):
date_trunc('hour', tstz) truncates in the session TimeZone, and nothing
pinned it — a fractional-offset session (+05:30) wrote off-boundary
window_start values whose ON CONFLICT keys disagree with a UTC session's,
i.e. duplicate/overlapping rollups. The bucket is now
date_trunc('hour', ev_ts AT TIME ZONE 'UTC') AT TIME ZONE 'UTC' (portable to
all supported PG, unlike 3-arg date_trunc) — same hazard class the schema
already closed with tstzrange in the exclusion constraint. Belt-and-braces:
OpenPostgres appends timezone=UTC to the DSN unless the operator set one.

DETERMINISTIC SURROGATE IDS: the id was md5(random()||clock_timestamp()),
with a comment claiming a collision would be a "benign PK retry" — there is
no retry; a PK collision aborts the statement. The id is now the md5 of the
natural key (auth_id|model_id|epoch(window_start)), so re-runs regenerate
the same id and cross-run collisions are impossible by construction (epoch,
not ::text, because timestamptz::text renders in the session TZ).

Also: the upsert's SELECT is ORDERed BY the natural key so concurrent raters
lock rows in deterministic order (no ABBA deadlock).

Tests: oracleStore mirrors the one-pass shape; conformance asserts the
partition invariant and that Run never calls CountAnomalies; new live-PG
integration tests pin UTC bucketing under SET TIME ZONE 'Asia/Kolkata'
(exact hour boundaries + same conflict keys from a UTC re-run), id stability
across re-runs, and EXPLAIN using the new expression index. All verified
against Postgres 16.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ains)

The default cron window was only the last complete hour, so an event
drained late — a Valkey outage recovered from the WAL — with an event_ts in
an already-rated hour fell in no future window and was never rated: silent
revenue loss, and exit 0 forever since the anomaly counts scan the same
window.

The default is now the TRAILING N complete hours, [floor(now)-N*1h,
floor(now)), N = rateTrailingHours (default 24, must be >= 1; a pointer in
the YAML so an explicit 0 fails loud). Re-rating closed hours is safe by
construction: the upsert REPLACES each (auth_id, model_id, hour) bucket
with a freshly recomputed total, so late events fold in and nothing
doubles. Explicit --since/--until still wins, with the existing
hour-alignment fence.

Residual risk, stated honestly in the docs: an event arriving more than N
hours late still slips the default window — the DESIGN.md reconciliation
backstop is the eventual answer; until then widen rateTrailingHours or
re-rate the hour with --since/--until.

Tests: resolveWindow default is trailing-24 (and honors a custom N),
N < 1 rejected; the rating package's late-arrival test (previous commit)
pins that a later trailing-window run rates an event drained after its
hour was already rated.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…T, count insert losses

Three holes in one path: io_log records could fail their INSERT (and
vanish uncounted) because the bytes weren't storable in Postgres TEXT.

- tee.appendLog truncated the body-log copy at a raw byte offset, which
  can split a multibyte rune at the cap and leave the buffered copy
  invalid UTF-8. Back up to the rune start (<= 3 bytes, utf8.UTFMax-1)
  before cutting, and stop appending permanently once truncated.
- The postgres sink now sanitizes both bodies before INSERT: Postgres
  TEXT rejects invalid UTF-8 outright and NUL (0x00) even though it is
  a valid code point. Invalid runs become U+FFFD; NULs are removed.
  Bodies are tenant-controlled bytes — either would otherwise fail the
  insert and silently lose the record.
- insert() failures now increment the dropped counter (extended to mean
  "records lost for any reason": overflow or failed INSERT), so loss is
  visible instead of buried in warn logs.

Tests: TestTeeBodyLogTruncationKeepsRuneBoundary,
TestPostgresSink_NulAndInvalidUTF8BodySanitized,
TestPostgresSink_InsertFailureCountsDropped.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two pre-existing shutdown bugs, both of which could silently drop
metered (billable) events:

1. DurableEmitter.Close: workers exited via `default: return` the
   instant the channel was momentarily empty, while producers were
   never stopped — an Emit racing Close could queue an event into the
   buffered channel after the last worker left, stranding it with no
   WAL entry and no log floor. Now Close marks the emitter closed and
   closes the channel under a write lock (paired with a read lock
   around Emit's send, so no send ever hits a closed channel); workers
   range the channel to completion; Emit after Close goes straight to
   the WAL (or the log floor once the WAL is closed). The final WAL
   drain moved from the shipper into Close, AFTER the workers finish,
   so events that fell to disk during shutdown still ship.

2. cmd/interceptor: a server error hit log.Error.Fatalf, whose os.Exit
   skips deferred cleanup — the emitter was never Closed (no flush, no
   final drain) and the iolog sink never flushed its batch. Cleanups
   now run unconditionally before a nonzero exit.

Tests: close-vs-emit race (every event accounted for in WAL ∪ floor),
emit-after-close lands on the log floor.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- internal/config: document that walPath is the WAL directory, that a
  pre-upgrade single-file JSONL WAL at the same path is auto-imported
  and renamed aside, and that the configured path must not change
  across the upgrade.
- settings.example.yaml: same note on the emit block.
- DESIGN.md §5: replace the rotation paragraph with the tidwall/wal
  design — sequential indexed entries, truncate-after-each-shipped-
  batch watermark, why clobber/partial-read loss is structurally
  impossible now, the retained-tail/at-least-once wrinkle, corrupt-dir
  quarantine, legacy import, and the library-choice rationale (small,
  MIT, tested, go1.13).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…> Postgres -> rater

The three worst review bugs were producer/consumer contract mismatches
between components whose unit tests each passed (Event.Model carrying the
deployment id, empty request_id poison-dropped by the drainer, empty model
misfiled as UNPRICED instead of UNATTRIBUTABLE) — because nothing tested
the actual pipe. internal/e2e wires the REAL components in one process:

  - httptest fake vLLM streaming the realistic SSE fixture (engine model
    name + trailing usage chunk with cached_tokens)
  - real proxy.Server with full identity headers, resource id distinct
    from the model name, and deliberately NO X-Request-Id (generated-id
    path end to end)
  - real emit.DurableEmitter against miniredis
  - real drain.Drainer via its production Run loop into a real Postgres
    billing_event, schema created from the migration .sql files on disk
  - real rating.Rater over the event's hour window

Asserts the money: one rated_usage row keyed on the ENGINE name (never the
deployment id), token sums from the usage chunk, cost equal to the
hand-derived NUMERIC (compared in SQL, never as a Go float), zero
unpriced/unattributable, and a phoebe-prefixed request_id surviving to the
billing_event PK. A second scenario pins the nullStr(model) contract:
a model-less event drains to model = NULL and lands in UNATTRIBUTABLE,
not UNPRICED.

Gated behind the integration tag + PHOEBE_TEST_DATABASE_URL (picked up by
make integration-test); isolated DROP/CREATE schemas so it can run in
parallel with the rating integration tests; ~1.6s wall clock.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A failed tidwall Write left the wrapper and the library divergent in
three ways, all verified broken:

  (a) fsync fails after the data hit the file: the library's lastIndex
      advanced but the wrapper's nextIndex did not, so every subsequent
      Write(nextIndex) returned ErrOutOfOrder FOREVER — the WAL wedged
      permanently and every event for the rest of the process floored,
      even after the disk recovered.
  (b) partial write (ENOSPC — the exact scenario the WAL exists for):
      torn bytes landed mid-segment, the next append "succeeded" after
      them, and the NEXT restart hit ErrCorrupt at Open and quarantined
      the whole buffer — maximal, delayed, silent loss when fullest.
  (c) clean write failure left a ghost entry in the library's in-memory
      segment buffer — index-shifted reads until restart.

Fix: the library's own prescription (documented on its truncate error
paths) — on ANY Write error, and on ErrCorrupt from reads/truncates,
close and reopen the log via the existing quarantine-capable open path,
resync nextIndex from disk, and conservatively reset the shippedThrough
watermark (re-shipping retained entries is safe: at-least-once, deduped
downstream on request_id). The failed event still floors, as before.

Under ENOSPC the honest trade is: the reopen finds the torn tail
immediately (verified empirically: torn bytes at the very tail fail
Open with ErrCorrupt, same as mid-file) and quarantines the pre-failure
buffer ONCE, loudly, at failure time — bounded and recoverable from the
.corrupt dir — instead of planting a silent time bomb for the next
restart. The quarantine log line now includes the count of prior
.corrupt siblings so a flapping pod's disk growth is visible.

If the reopen itself fails the wal is poisoned: appends error
immediately (the emitter floors them), pending/markShipped no-op
cleanly, and the reopen is retried on every append — it is already the
failure path, so the extra open attempt is cheap relative to the loss
it can stop.

Also fixes the legacy-import crash window: the old order renamed
wal.jsonl to .imported BEFORE appending its events to the new log, so a
crash in that window parked events invisibly (silent revenue loss). Now
the file is staged to .importing, the events are appended FIRST, and
only then is it parked as .imported — a crash re-imports on the next
startup, producing duplicates, which is the correct side of the trade
because the drainer dedups on request_id. A leftover .imported file
warns on every startup (parked forensic data), and a wholly unreadable
legacy file is left in place as evidence instead of being renamed
aside.

Tests (fault injection via a test-only failNextWrite hook that receives
the live log, so tests reproduce the real divergent states):
  - TestWAL_WriteErrorRecovers_NoPermanentWedge   (a)
  - TestWAL_WriteErrorNoSilentCorruption          (b)
  - TestWAL_PoisonedFloorsAndRecovers
  - TestWAL_LegacyImportCrashWindow (both crash points)
  - TestWAL_ImportedLeftoverWarns
  - TestWAL_LegacyUnreadable

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The ReverseProxy ErrorHandler fires for both RoundTrip errors and
ModifyResponse-returned errors. Go delivers a mid-flight client
cancellation WRAPPED (a *url.Error around context.Canceled on the common
RoundTrip path), so an identity compare misses it and the abort is
misclassified as an upstream fault: logged at error AND 502'd over a
connection that is already gone.

Introduce a single isClientAbort(err) predicate using errors.Is, covering
both context.Canceled and context.DeadlineExceeded, and extract the
ErrorHandler into s.errorHandler(upstream, id, requestID) — a method seam
that makes the classification directly testable and threads identity
through for the abort-emit that follows (Fix A). Mechanism only here; the
emit is wired next.

Test TestErrorHandlerClassifiesWrappedCancel drives the handler with
*url.Error{Err: context.Canceled} (and DeadlineExceeded) and asserts no
502; TestErrorHandlerUpstreamFaultStill502 pins that a genuine upstream
fault still 502s; TestIsClientAbort locks the predicate's truth table.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ModifyResponse installs the captureReader whose onDone emits the metering
event. But a client that disconnects BEFORE the upstream writes response
headers fails RoundTrip, so ModifyResponse never runs and onDone never
fires — a request that passed the billing-identity gate would emit
NOTHING, leaving served-or-attempted traffic invisible to billing and
reconciliation.

The ErrorHandler now emits exactly ONE zero-token, attributable event
(Aborted=true, UsageFound=false) with the already-resolved identity, via
the SAME s.emit path the completion path uses — so the pre-header abort
obeys the SAME BillPartialOnAbort policy (no second, policy-bypassing
emit). The emit is gated on isClientAbort, so a genuine upstream fault
never writes a bogus zero-token billing row. The context is decoupled from
the cancelled client ctx (WithoutCancel), mirroring onDone.

No double-emit: ModifyResponse always returns nil in phoebe, so
ErrorHandler fires ONLY on a RoundTrip error (pre-header) — mutually
exclusive with the onDone (post-header) path.

Establishes the invariant: every request past the billing-identity gate
emits exactly one attributable event — usage on completion, or zero-token
Aborted on disconnect (pre- or post-header).

Tests: TestPreHeaderAbortEmitsAttributableEvent (exactly one event,
Aborted=true, zero tokens, AuthID+ResourceID populated);
TestPreHeaderAbortBillPartialFalseNoEvent (policy still honored);
TestNormalCompletionEmitsExactlyOnce (no double-emit on the clean path).
Updated TestAbortRaceStress's now-stale "no event is correct" comment to
the new contract.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The captured request body flowed UNCAPPED into the io_log INSERT, whose
body_tsv column is populated by to_tsvector over the request+response
bodies at write time. Postgres rejects a tsvector input past ~1 MiB
("string is too long for tsvector ..., max 1048575 bytes"), so any
long-context prompt failed the whole INSERT and the sink silently
drop-and-counted the record. (Confirmed against live Postgres 16 — see the
integration test.) The response body was already capped + flagged; the
request body was not.

Cap the LOGGED request-body copy at the SAME MaxBodyBytes bound the
response uses, add Record.RequestTruncated mirroring ResponseTruncated,
and collapse the rune-safe truncation into ONE shared helper
(truncateAtRuneBoundary) consumed by both the response copy
(captureReader.appendLog) and the request copy (captureRequestBody) —
one bound, one truncation, driven by the one MaxBodyBytes config value.
The FORWARDED request to the upstream is always the full body; the cap
bounds only what is stored. Truncation is at a rune boundary so the
stored TEXT is valid UTF-8.

Wires request_truncated through the INSERT column list/args, the
reference SQL migration, and the Alembic artifact (request_body still
$8 / response_body now $10 in the tsvector). Also corrects the
captureRequestBody doc, which previously claimed the copy was "bounded by
inbound request size" — io.ReadAll enforced no such bound; now it is real.

Tests: TestRequestBodyCappedBeforeTsvector + TestRequestBodyUnderCapNotFlagged
(proxy path: capped + flagged, full body still forwarded);
TestTruncateAtRuneBoundary (rune-safe bound, valid UTF-8); and the
integration pair TestRequestBodyCappedBeforeTsvector_Integration (capped
body INSERTs cleanly with a populated tsvector) +
TestUncappedRequestBodyWouldFailTsvector_Integration (pins the >1 MiB
failure the cap prevents).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The inline comment at the per-tenant opt-in gate read "An empty allowlist
means all tenants are eligible" — the exact INVERSE of the fail-closed
code it annotates (optedIn returns false for an empty allowlist; fleet-wide
capture requires the explicit AllowAllTenants flag). For the most sensitive
data in the system (prompts/completions, possibly PII) a comment that
states the opposite of the privacy rule is a latent footgun for the next
editor. Comment-only; matches the code and the struct docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
In process(), poison (decode-failed) entries were collected during the
decode loop but ACK'd only AFTER the store work — and the store-outage
fallback early-returns BEFORE that block. So on a store outage the poison
entries were never ACK'd: they redelivered every pass until an outage-free
run, re-logging the same malformed entries and wasting reclaim work, with
no upside (they can never be stored).

A malformed entry is unstorable BY DEFINITION, so dropping it is correct
regardless of store health. Move the poison-ACK to immediately after the
decode loop, before the store work, so every return path — including the
outage early-return — has already dropped it. The good rows batched
alongside still obey store-before-ACK and stay pending on an outage.

Sentinel compares in this file already use errors.Is (redis.Nil); the
BUSYGROUP check is a plain reply string, not a wrapped sentinel, so its
string compare is correct and unchanged.

Test TestDrain_PoisonAckedDuringStoreOutage: a malformed entry + a good
entry, store DOWN — asserts the poison is ACK'd (pending drops to 1, not
2) while the good entry stays pending and lands after recovery.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…flow

rateWindowSQL returned rollups_written, events_rated, and the two anomaly
counts as COUNT(*)::int / SUM(event_count)::int — 32-bit. Over an arbitrary
backfill window these counts can exceed 2^31 (a single re-rate of a long
history, or a large unpriced/unattributable spill), which a 32-bit cast
silently overflows into a negative/garbage value and defeats the
rated-count reconciliation and the fail-loud anomaly thresholds.

Widen the four COUNT/SUM casts to ::bigint and the Go side to int64
(RateResult, rater.Result, and Anomalies, kept consistent so they compose
without lossy conversions). TotalCost is already NUMERIC text. Trivially
correct, low blast radius; verified against live Postgres.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CountAnomalies (the Store interface method, the PostgresStore impl, the
countAnomaliesSQL constant, and its tests) had NO non-test caller. The
rater's fail-loud contract gets its anomaly counts from RateWindow's
SINGLE snapshot (UnpricedEvents / UnattributableEvents on RateResult) —
precisely because a separate counting statement could see a different
snapshot than the upsert and disagree with what the rollups excluded. The
doc itself flagged it as "ad-hoc / ops use" that "the rater's fail-loud
contract does NOT use." That hypothetical ops query never acquired a
caller, so it is unexercised surface that can silently rot out of sync with
the resolution CTE.

Delete it (interface + impl + SQL + the two SQL-shape/scan tests + the
oracle's CountAnomalies method and the now-moot countCalls guard). The
small Anomalies value type stays — it is the resolve-pass result the test
oracle mirrors, and RateResult carries the same two counts. Resolution CTE
and integration-test docs updated to drop the countAnomaliesSQL reference.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
TestIOLog_NonBlockingSink wired a sink that never blocked, so it asserted
nothing about non-blocking — a test name that lies is a defect. The proxy
calls Sink.Log SYNCHRONOUSLY from onDone (which can fire inside the final
body Read, before the last bytes are copied), so a genuinely-blocking sink
DOES stall the response — the proxy cannot honestly promise "the client
returns without waiting." The non-blocking property lives in the Sink
implementation and is already pinned by
iolog.TestPostgresSink_LogIsNonBlocking (buffered-channel send, drop on
full).

Rename to TestIOLog_SinkReceivesRecordViaOnDone and assert what the
proxy-level test CAN verify: an opted-in request hands exactly one record
to the sink from onDone, response forwarded verbatim. The doc points to
where the real non-blocking guarantee is tested.

(The companion overclaiming doc — captureRequestBody's "bounded by inbound
request size" — was corrected in the request-body cap fix, where the bound
became real.)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fast-follow's errors.Is switch ALSO widened isClientAbort to match
context.DeadlineExceeded — a regression the battery caught. In this proxy a
DeadlineExceeded is ALWAYS an upstream/transport fault (no client-side request
deadline exists; it's what a dial timeout / response-header timeout returns), so
treating it as a client abort silently downgraded an upstream outage to a debug
log, dropped the client's 502, and emitted a spurious zero-token Aborted billing
row attributing the outage to the tenant. Revert to context.Canceled-only (the
known-correct base behavior).

Tests flipped to pin BOTH directions (they previously asserted the regression as
correct — a bug-pinning test): TestIsClientAbort now requires DeadlineExceeded
(bare + wrapped) to be NOT-an-abort; TestErrorHandlerUpstreamFaultStill502 now
also drives a wrapped DeadlineExceeded and asserts 502 + zero emitted events.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Phoebe: complete token-metering + billing system (Rating v2)
Phoebe fast-follow: abort observability, iolog body cap, battery mechanical fixes
The complete token-rating system (PRs #14#21, batteried to production-dry).
phoebe's hourly rater turns billing_event token counts into rated_usage money:

- Prices are a YAML config file (operator-authored, enterprise/air-gapped
  friendly), not a DB table — base per-token rates, a global fine-tune premium
  policy, and per-GPU floor rates. No effective-dating: the rater reads the
  current file, rates the last hour, and stores the APPLIED rate + cost on the
  row (self-auditing).
- Money is NUMERIC, computed in SQL. quantize-then-multiply (the rate on the row
  reconstructs the cost). Fail-loud on any unpriced/unattributable event.
- Model identity: base = HF id; fine-tune = ft:<checkpoint_artifact_id> with the
  base carried as a separate field on the event (X-Saturn-Base-Model) and priced
  base × premium. One hop only, enforced at load. ft: ids globally unique.
- Reconcile-on-re-rate: a re-run deletes superseded rollups (latest run is what
  bills); routine reconcile-delete is loud (ERROR + exit nonzero), explicit
  backfill is quiet. Single-flight rater (Atlas CronJob concurrencyPolicy:Forbid).
- iolog request body capped before to_tsvector + truncation logging.

Reviewed via the machine-review battery to production-dry (zero production-code
findings across the final rounds); the remaining iteration was test-quality only.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… (E2)

The rater dropped resource_id; billing needs it to identify the customer.
rated_usage now keys on (auth_id, resource_id, model_id, window_start) so a
rollup maps to the deployment → org that gets charged (E2). resource_id was
already on billing_event (metering → drain) — the rater just discards it.

- Grain widens by resource_id (model_id stays: prices resolve per-model, one
  deployment can serve several). Two deployments of the same model by the same
  auth in one hour now bill as two rows — correct (different orgs).
- Fail-closed attribution: a NULL resource_id is counted UNATTRIBUTABLE (exits
  nonzero), never $0-billed or billed to a NULL org. The partition invariant
  rated + unpriced + unattributable + ambiguous == total holds.
- Migration edited in place (no prod data, unapplied). Speculative
  (resource_id, window_start) index deferred to the PR that adds the E2 reader.

PRs #22#23, batteried to production-clean (3 rounds zero prod findings).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hhuuggoo hhuuggoo merged commit 19e38b9 into release-2026.02.01 Jun 17, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant