feat: sozu top btop-style operator TUI + cardinality-lease verb#1256
Merged
Conversation
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
…p, getrandom Close out four remaining lease-primitive findings from the PR #1256 review pass. 2A — Loud TTL reject in lease_apply (was: silent clamp). The aggregator no longer transparently caps `ttl > LEASE_TTL_MAX` to LEASE_TTL_MAX. It now returns `LeaseApplyOutcome::TtlOutOfRange`, which the worker dispatch arm surfaces as `WorkerResponse::error`. The dispatch site itself already rejects out-of-range TTLs before reaching the aggregator; the new error arm catches any caller that bypasses dispatch (proto fuzzing, future internal use) instead of silently capping their intent. 2B — Master-side TTL pre-validate before fan-out. `worker_request` checks the operator-supplied `ttl_seconds` against `LEASE_TTL_MAX` BEFORE scattering to workers. A malicious or buggy `SetMetricDetail{ttl_seconds: u32::MAX}` no longer reaches the worker fan-out, eliminating the N×rejected-fan-out + N×audit-line amplifier. 2E — Lease HashMap cap + client_id length cap. New `LEASE_TABLE_CAP = 64` and `LEASE_CLIENT_ID_MAX_BYTES = 64` constants. `lease_apply` returns `LeaseApplyOutcome::TableFull` for fresh inserts that would overflow the table (renewals of existing entries still succeed, so an active operator never loses their lease just because the table is full) and `LeaseApplyOutcome::ClientIdTooLong` for oversized lease keys. Closes the CWE-770 vector where a same-UID attacker could roll `client_id` faster than expiry to grow the map unbounded toward worker OOM. The master applies the same length cap before fan-out via the new `sozu_lib::metrics::LEASE_CLIENT_ID_MAX_BYTES` export. 2D — Direct `libc::getrandom` on Linux for the lease-id random suffix. Replaces the `/dev/urandom` File::open path on Linux with the `getrandom(2)` syscall + `GRND_NONBLOCK` flag (no fs dependency, no chroot/sandbox failure path). Non-Linux Unix targets keep the `/dev/urandom` read because `getrandom`'s ABI varies (FreeBSD vs OpenBSD's `getentropy(2)` vs macOS's `SecRandomCopyBytes`). Switched from `u32::from_ne_bytes` to `u32::from_le_bytes` for cross-arch reproducibility of the rendered hex. Last-resort fallback to `subsec_nanos()` remains for total-entropy-failure environments and the fallback is now silent at the data layer (the `app.status` surface covers operator visibility in a later commit). Four new tests cover ClientIdTooLong, TableFull (including renewal- succeeds-when-full), TtlOutOfRange, and the LEASE_TTL_MAX boundary. Build/clippy clean; 29/29 metric tests pass (4 new). Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
Three operator-facing TUI hardening items from the PR #1256 review. 2C — Skin loader fails closed when the parent anchor can't be resolved. Previously, when `skins_anchor()` returned `None` (race-delete of the parent, weird `/proc` paths, unusual fs mounts) the confinement check was silently skipped and `from_toml` ran on the bare resolved path, defeating the symlink-escape defence the canonicalize block was added to provide. Now the anchor failure short-circuits to the default skin with a status-bar diagnostic. 2C (companion) — Close the skin loader TOCTOU window. The previous shape canonicalized the file, then went `from_toml(&resolved)` which called `std::fs::read_to_string(&Path)` — re-resolving the path through the kernel and re-opening the symlink chain. An attacker with write access to a shared skins dir could swap a symlink in the gap. Replaced with `from_open_file(&Path)` which calls `File::open` once after the parent-anchor check and reads from the `&mut File` handle. The path-based `from_toml` API is removed; tests parse via `toml::from_str` on a raw string. 3E — `ctrlc::set_handler` degrades gracefully on `MultipleHandlers` error. The previous `.expect()` aborted the TUI on programmatic re- entry (any caller that invokes `run` twice in the same process address space hits the second-install path). The crossterm event loop already observes Ctrl-C as a keypress, so the dedicated signal handler is belt-and-braces rather than the primary path; falling through with a status-bar note keeps embedded callers viable. Status-bar precedence: skin > lease-elevation > signal-handler diagnostic. (Renewer-thread status surface — review M-7 — deferred: the worker status channel needs threading through `DetailGuard` and the render-loop poll. Tracked as a follow-up to keep this commit surgical.) Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
…pe, privacy Three documentation fixes from the PR #1256 review. 3G — CHANGELOG referenced "three synchronous transport threads"; the CERTS-pane collector added a fourth in eec8422 but the entry was not updated. Fixed to "four synchronous transport threads (snapshots, listeners, certs, events)". 2G — CHANGELOG audit-scope claim updated to reflect this PR's actual behaviour: every cardinality transition emits METRIC_DETAIL_CHANGED, including the previously-silent worker-local janitor expiries and post-fan-out apply/clear paths, thanks to the new worker→master audit IPC. Lease ownership binding (SO_PEERCRED peer pid + session ULID) is called out so SecNumCloud-style reviewers see the trust model in plain text. The unsupported_workers field's wiring status is described honestly — the proto field + per-worker version snapshot ship in this release; capability-aware dispatch is tracked as follow-up. 3H + D5 — `doc/sozu-top.md` gains an explicit transport layout note (four threads, six unix-socket connections per invocation) and a privacy paragraph: the operator-supplied `--reason` text flows to the audit log and to SubscribeEvents subscribers, so PII / customer IDs should not be embedded. Length and character caps applied server-side are documented. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
Two cleanup items from the PR #1256 review: 4A — Remove the empty `bin/api` zero-byte file. The diff added it as a new empty file (`new file mode 100644, index 00000000..e69de29`) — clearly a stray `touch bin/api` from local experimentation that got `git add`-ed. No callers in tree. Lisa L-017 — Add a dedicated `cargo audit` CI job that runs on every push + PR. The `tui` Cargo feature pulled in 12 new optional crates (color-eyre, crossterm, ratatui, throbber-widgets-tui, tui-input, tui-big-text, tui-popup, tui-scrollview, tui-tree-widget, ctrlc, crossbeam-channel, toml) plus their transitive closure outside what the default-features build covers. Without an audit gate any RUSTSEC advisory landing against one of those crates can ship in a release unnoticed. Job: cache-all-crates, prefix-key `ci-audit`, `--deny warnings` so the gate fails fast on any new advisory. Two invocations to keep symmetry with the per-feature build matrix above; both read the same Cargo.lock that already contains the TUI deps. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
Two leftover items from PR #1256 review round-2. 3F — Regression guard for the C1 dispatch-whitelist fix. A future refactor that drops `SetMetricDetail` from the no-op match arm in `ConfigState::dispatch` and falls through to the catch-all `UndispatchableRequest` arm would silently re-break TUI cardinality elevation entirely (the original C1 bug). New test exercises the happy path with a fully-populated `SetMetricDetail` request and asserts `dispatch` returns `Ok`. 4B — Replace `LEASE_TTL_DEFAULT.as_secs() as u32` lossy cast in the SetMetricDetail worker arm with `u32::try_from(...).unwrap_or(60)`. The default fits in u32 by construction (60 s); the explicit checked conversion documents the bound and shields against any future tweak that grows LEASE_TTL_DEFAULT past `u32::MAX` seconds (≈ 136 years). D6 — Confirmed via `grep -nE "^## \[" CHANGELOG.md`: a single `## [Unreleased]` section. No edit needed. Build/clippy clean; 28/28 state tests pass (1 new). Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
PR #1256 review M-7: the renewer thread's `eprintln!`-on-error wrote to the wiped alt-screen, so the operator never saw "renewer dropped" diagnostics until the lease silently lapsed and per-backend panes went sparse. Wire a shared status slot: - New `cardinality::StatusSlot = Arc<Mutex<Option<String>>>` plus `new_status_slot()` / `take_status(slot)` / private `publish_status` helpers. Poisoned-lock recovery via `into_inner` so a panic in one background thread does not silently strand the next message. - `DetailGuard::apply` takes a `StatusSlot` parameter. The renewer thread receives a clone and writes through `publish_status` on its two error paths (channel open + renewal send), replacing the prior silent `eprintln!`. The `status` field on `DetailGuard` keeps the slot alive for the renewer's lifetime — the read path lives on the render-loop side (hence the `#[allow(dead_code)]` for the reader-not-on-guard pattern). - `RenderConfig` gains a `lease_status: StatusSlot` field; the render loop drains it once per tick and overwrites `App::status` so the F-key bar repaints the message on the next frame. New `App::mark_dirty()` triggers the redraw when no snapshot landed. Same plumbing is ready to receive the four transport collectors' `eprintln!` sites in a follow-up — kept out of this commit to keep the change focused on the renewer (review L-2 is the companion finding). Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
…CKENDS panes (A4) PR #1256 simplify A4. `panes/clusters.rs` and `panes/backends.rs` each carried an identical 19-LOC `sort_header` helper that produced the styled `ratatui::widgets::Cell` for an active-or-inactive sort column header. Only the column-key enum (`ClusterSortKey` vs `BackendSortKey`) differed. Replace the two copies with one shared `pub(super) fn sort_header( label, active, reverse, skin) -> Cell<'static>` in `bin/src/ctl/top/panes/mod.rs:27`. Call sites compute the boolean `active = current_key == key` themselves, so the helper stays generic-free. Insta snapshot tests for both panes pass unchanged — the rendered bytes are byte-for-byte identical (snapshot tests are the byte-equality guard). Pure refactor, no behaviour change. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
…ric helper (A1) PR #1256 simplify A1. `spawn_collector` (snapshots), `spawn_listeners`, and `spawn_certs` shared an identical `loop { match poll { Ok(v) => try_send; Err(e) => eprintln; } sleep_remaining }` shape — ~90 LOC of triplicated control flow. Extract a single private `poll_loop<T, F>(label, interval, tx, channel, poll)` helper that closes over the channel ownership and the per-thread polling closure. Each spawn site keeps its own `thread::Builder::new().spawn(...)` entry point, its per-thread `Channel` ownership, and the `bounded(1)` publish-or-skip discipline locked by the auditor — only the inner polling skeleton is shared. Constraints preserved: - Four `thread::spawn` sites: `spawn_collector`, `spawn_listeners`, `spawn_certs`, `spawn_events`. `spawn_events` has a different drainer shape and stays untouched. - `bounded(1)` channel capacity on the polling threads. - Publish-or-skip on `TrySendError::Full`; clean thread exit on `TrySendError::Disconnected`. - Per-thread `Channel` ownership (no `Mutex<Channel>`). Net LOC delta in `transport.rs`: -116 lines (442 → 326). Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
…spatcher (A2 + A9) Two PR #1256 Tier-A simplifications. A2 — PulseTracker.tick / tick_one used a two-pass `Vec<K>::collect()` + `for k in to_drop { map.remove(&k) }` loop to drop zero-aged entries. Same shape in two places. Replaced both with in-place `HashMap::retain(|_, v| { if *v == 0 { false } else { *v -= 1; true } })` — same semantics, no `String` key clones for the dropped set, no intermediate `Vec` allocation per render frame. A9 — `apply_palette` carried eight near-identical `match` arms mapping a string alias (`"overview"` / `"o"`, `"cluster"` / `"clusters"` / `"c"`, …) to a tab. Adding a new tab required patching `ActiveTab` AND remembering to also patch `apply_palette`. Centralised the alias table in a new `ActiveTab::from_alias(s) -> Option<Self>` resolver and let `apply_palette` fall back to a small fixed-cmd match (`help` / `quit` / empty / other) only when the alias resolver returns `None`. Future tab additions touch `ActiveTab` only. Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
… pane (A3) PR #1256 simplify A3. `bin/src/ctl/top/app.rs` carries `count_value` / `gauge_value` helpers that decode a `FilteredMetrics -> Option<{i64,u64}>`; `bin/src/ctl/top/panes/h2.rs` re-implemented the same two functions (different names: `count` / `gauge`) verbatim — same body, divergent identifiers. Promote the app-side helpers to `pub(super)` and import them into the H2 pane under their pane-local names (`use … as count, … as gauge`). Removes ~15 LOC of duplicated body and the `filtered_metrics::Inner` / `FilteredMetrics` imports the H2 pane no longer needs directly. Future renames or signature changes touch one site instead of two. Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
…lStatus reply (2H) Closes the PR #1256 follow-up where the `proto_version` field on `WorkerInfo` was wired (commit `b0ad5728`) but no dispatcher consumed it: `MetricDetailStatus.unsupported_workers[]` stayed permanently empty regardless of the master/worker version skew. New dedicated dispatch path `set_metric_detail_request` plus a `SetMetricDetailTask` GatheringTask in `bin/src/command/requests.rs`: - Snapshots master-side `(configured, effective)` before the request fans out so the synthesised reply carries `previous_effective`. - Mirrors `worker_request`'s peer-binding population (peer_pid + session_ulid from `ClientSession`) and length / TTL pre-validation. - Walks `server.workers`, partitioning by `proto_version >= MIN_PROTO_VERSION_FOR_SET_METRIC_DETAIL`. Capable workers are scattered to one-by-one via `scatter_on(Some(worker_id))`. Unsupported workers (typically inherited-after-`UpgradeMain` from a pre-tag-55 binary) skip the fan-out entirely and land in the response's `unsupported_workers[]` field. - Emits attempt-time + completion-time audit rows in the same shape the generic WorkerTask flow used, threading the operator-supplied `client_id` (lease_id column) + `reason` (metric_detail_reason column) through `MetricDetailAuditFields::into_extras`. - Synthesises a full `MetricDetailStatus` for the client: `configured`/`effective`/`previous_effective` from the master's Aggregator + `workers: BTreeMap<worker_id, WorkerMetricDetailStatus>` populated for every ACK'd worker + `unsupported_workers` from the pre-filter. Returns via `client.finish_ok_with_content`. Per-worker `(configured, effective, previous_effective, active_lease_count)` quartets currently mirror the master's view because the worker arm in `lib/src/server.rs` replies with `WorkerResponse::ok` (no content payload). A follow-up plumbing the actual per-worker view through a new `ResponseContent` variant is documented in the SetMetricDetailTask `on_finish` body; the wire schema is populated now, with master-view stand-ins, so consumers (TUI status bar) see a non-empty `workers` map even before that plumbing lands. `MIN_PROTO_VERSION_FOR_SET_METRIC_DETAIL = 1` constant declared local to the dispatcher so future per-verb gating doesn't get coupled to the global SOZU_PROTO_VERSION monotonic bump. Build/clippy clean; 1075/1075 workspace tests pass (12 suites, ~6 min). Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
Six small simplifications from PR #1256 review round-2's residual Tier-A / Tier-B list. Pure stylistic; no behaviour change. A6 — Hoist the HTTP-5xx error-status set out of the two inline `[S500, S502, S503, S504, S507]` iterators (cluster_rows + fold_overview) into a module-level `ERRORS_5XX: [&str; 5]` constant. Adding a new 5xx variant now touches one place. The dynamic sort comparator is left as-is — the heterogeneous `BackendSortKey` variant types (u64 / String) do not compose into a single tuple key without adding wrapper enums that would be heavier than the gain. A8 — Replace the five `(Self::default_dark(), Some(...))` fallback tuples in `Skin::resolve` with a local `let default_with = |msg| (Self::default_dark(), Some(msg));` closure. The fail-closed policy from commit `5b098d9b` and the `from_open_file` TOCTOU mitigation stay verbatim; diagnostic strings are byte-identical. A7 — Audited. The current `App::new` is already tight at 25 lines; adding `#[derive(Default)]` would require introducing arbitrary "first variant is default" choices on four enums (ActiveTab, ClusterSortKey, BackendSortKey, GlyphMode), which just relocates the explicit-default question rather than removing it. Net win is negligible; left as-is. L-8 — Clear `palette_input` on the unknown-command path in `App::apply_palette`. The success path already resets the input; the typo path used to leave the operator's previous text in place so the next `:` keypress re-opened the palette pre-populated with the bad command. Now both paths exit with a fresh input. B1 — Convert `.clone()` on `&String → String` sites to `.to_owned()` across the panes layer (listeners.rs, certs.rs). User preference per CLAUDE.md ("prefer ToOwned::to_owned() over Clone::clone() when going & str → String for ownership-intent clarity"). Same allocation behaviour; clearer intent. B5 / B9 — Collapse `format!(...) + &format!(" · {trend} 60 s")` in `panes/overview.rs::subtitle_for_rps` into a single `format!` call. Saves one allocation per subtitle render. Events pane: `Option<String>::clone().unwrap_or_default()` → `.as_deref().unwrap_or("").to_owned()` — skips the `String::new` allocation on the `None` branch (the kernel emits backend events without `cluster_id` populated for proxy-process-level transitions). Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
Final piece of the PR #1256 follow-through. Previously the capability-aware dispatcher in `bin/src/command/requests.rs` (`SetMetricDetailTask::on_finish`) synthesised `MetricDetailStatus.workers[<worker_id>]` using the master's aggregator view as a stand-in for each worker because workers replied with `WorkerResponse::ok(message.id)` carrying no payload. Each worker holds an independent `Aggregator` with its own lease table, so that stand-in obscured real per-worker drift (different configured floors, different active lease counts after a partial fan-out). Wire it properly end-to-end: - New `ResponseContent::WorkerMetricDetailStatus` oneof variant (tag 17 — proto additive). Carries the worker's own `(configured, effective, previous_effective, active_lease_count)` quartet, semantically distinct from the aggregated `MetricDetailStatus` at tag 16. - New `lib/src/server.rs::worker_metric_detail_status_content` helper that builds the response payload from a `(configured, effective, previous_effective, lease_count)` snapshot captured BEFORE the `METRICS.borrow_mut` scope ends (so the per- request snapshot is consistent with the transition that just happened). - The three ok-paths in the worker's SetMetricDetail arm (clear-Cleared, clear-NotFound, apply-Applied) now reply via `WorkerResponse:: ok_with_content` with the freshly-built payload instead of the payload-less `ok`. The `clear-NotFound` path reports `previous_effective == effective` (no transition). - Master-side `SetMetricDetailTask::on_finish` collects the per-worker payload from `response.content` and only falls back to skipping the worker entry when the response has no payload (e.g. an older worker that never went through `ok_with_content`). Removes the master-view stand-in noted as a follow-up in commit `70cd24af` (`set_metric_detail_request`). - `command/src/proto/display.rs` adds a silent OK match arm for the new variant — the per-worker payload flows master-side and is never printed directly on the operator's terminal. Build/clippy clean; 1075/1075 workspace tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 11, 2026
Strip the in-tree comment cross-references to review tags (`PR #1256 review M-7`, `simplify A3`, `C1 fix`), round labels (`round-4 follow-through`), and short-SHA references to other commits on this same branch. The technical rationale stays in place, inlined where the cross-reference used to be — committed text reads self-contained for any contributor without access to the local review pipeline. Affected sites: - `bin/src/command/requests.rs::handle_request` (SetMetricDetail match arm) - `bin/src/command/server.rs::handle_worker_response` (METRIC_DETAIL_CHANGED branch) - `bin/src/ctl/top/cardinality.rs` (DetailGuard.status, short_random_suffix) - `bin/src/ctl/top/panes/h2.rs` (gauge/count helper import note) - `bin/src/ctl/top/render.rs` (RenderConfig.lease_status) - `bin/src/ctl/top/theme.rs::Skin::resolve` (default_with closure doc) - `command/src/state.rs::dispatch_passes_through_set_metric_detail` - `lib/src/server.rs::worker_metric_detail_status_content` No behavioural change. Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Wires ratatui 0.30, crossterm 0.29, crossbeam-channel, ctrlc, color-eyre, and the tui-* polish crates (tui-input, tui-popup, tui-big-text, tui-tree-widget, tui-scrollview, throbber-widgets-tui) as workspace deps, all marked `optional = true` on `bin/`. The `tui` feature on `bin/` activates them; default builds (`jemallocator`, `crypto-ring`) keep the production binary lean. Deliberately omits `tokio` and `tokio-util`: the `sozu top` TUI runs on two synchronous transport threads + the UI thread, matching the existing `bin/` style and avoiding an async runtime in v1. Crossterm features pin to `events`/`bracketed-paste`; ratatui to `crossterm`/`macros`/ `underline-color`. `tui-logger` is intentionally excluded — the EVENTS pane (week 3) will surface proxy events directly. Adds `insta` to `[dev-dependencies]` for upcoming snapshot tests of the TUI panes (week 4); strictly dev-only, never production. Verified: `cargo check -p sozu --no-default-features --features crypto-ring` and `cargo check -p sozu --features tui` both pass; `cargo tree -p sozu --features tui | grep -c '^tokio'` returns 0. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
`sozu top` (and any future TUI client) needs to elevate the metrics cardinality knob (`MetricDetail`) from the configured `Cluster` floor up to `Backend` for the duration of an interactive session. A blind global set + restore-on-Drop fails on crash and doesn't compose with multiple concurrent clients. This commit adds a TTL-leased model on `Aggregator`: - `configured` is the static floor from `MetricsConfig.detail`. - `leases: HashMap<client_id, (level, expires_at)>` tracks active leases. - `effective = max(configured, max(active leases))` is recomputed off the metric-emission hot path (only on apply/clear/expire). - `lease_apply(client_id, level, ttl)` registers or renews a lease with the TTL clamped at `LEASE_TTL_MAX = 300s` to bound the worst-case effect of a stuck renewer; returns `(previous, new)` so callers can decide whether to emit a `MetricDetailChanged` audit event. - `lease_clear(&client_id)` releases by id; mismatched ids are silent no-ops so other clients' leases are unaffected. - `lease_tick(now)` is a polled expiry janitor: cheap when nothing has expired, returns `Some(previous)` when expiry actually moved the effective level. `lease_tick_due(now)` gates the call so the worker only walks the lease table every 5 s. Crash safety falls out of the design: a dead `sozu top` cannot permanently elevate cardinality because its lease self-expires after `ttl_seconds`. `Aggregator::receive_metric` now reads `effective` (single field load on the hot path) instead of `detail`. 12 unit tests cover monotonic apply, the configured floor (a lower lease cannot push effective below the floor), renewal-replaces-not-duplicates, max-merge across multiple clients, silent no-op on unknown id, deterministic expiry via `lease_tick(now)`, and the TTL clamp. All 22 metrics-module tests pass; existing `set_up_detail` semantics preserved. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Adds a runtime cardinality lease verb so `sozu top` can elevate the
metrics drain to `MetricDetailLevel::Backend` for the duration of an
interactive session. The lease design (TTL-bounded, `client_id`-keyed,
self-expiring) is crash-safe and composes with multiple concurrent
clients — see `Aggregator` lease bookkeeping in the previous commit.
Proto (additive, backwards-compat):
- `Request.request_type::SetMetricDetail = 55` carries
`{ client_id, detail?, ttl_seconds?, clear?, reason? }`.
- `ResponseContent::metric_detail_status = 16` returns
`MetricDetailStatus { configured, effective, previous_effective,
workers: map<id, WorkerMetricDetailStatus>, unsupported_workers[] }`
for mixed-version-fleet safety.
- `EventKind::METRIC_DETAIL_CHANGED = 30` on the `SubscribeEvents`
audit stream; distinct from `METRICS_CONFIGURED` (Enabled/Disabled
/Clear) since the cause is different.
- `command/build.rs` re-attaches `Hash, Eq` for `MetricDetailStatus`
(the embedded `map<string, WorkerMetricDetailStatus>` strips the
prost auto-derive, which propagates to `ResponseContent.content_type`
and `Request.request_type`).
- `command/src/proto/display.rs` adds arms for `RequestType::
SetMetricDetail`, `ContentType::MetricDetailStatus` (with a
prettytable renderer that lists per-worker configured/effective
/previous_effective + unsupported workers), and `EventKind::
MetricDetailChanged`.
- `command/src/request.rs` routes `SetMetricDetail` through the
worker-level dispatch group (mirrors `ConfigureMetrics`).
Master + worker plumbing:
- `bin/src/command/requests.rs::is_mutating_verb` learns the new verb
so the master brackets it with `RELOADING=1`/`READY=1` systemd hints.
- The dispatch match routes through the existing `worker_request`
fan-out path (same shape as `ConfigureMetrics`); per-worker
`MetricDetailStatus` aggregation lands in week 2 when the TUI
starts consuming it.
- `lib/src/server.rs::notify` adds two hooks: a polled lease-expiry
janitor at the top of every dispatch (gated by `lease_tick_due` so it
only walks the lease table every 5 s), and the `SetMetricDetail` arm
itself. The arm clamps the TTL via the `Aggregator` setter, decodes
the `MetricDetail` enum defensively, and acks with `WorkerResponse::
ok()` for now (week 2 will return the per-worker
`WorkerMetricDetailStatus` payload).
`MetricDetailChanged` audit emission is left as a `TODO(sozu-top week
2)` at both `lease_apply`/`lease_clear` call sites and the janitor —
plumbed through `bin/src/command/requests.rs::audit_emit_inline` once
the master collects per-worker effective levels back.
Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Wires the `sozu top` clap subcommand behind `#[cfg(feature = "tui")]`.
Default `sozu` builds keep the subcommand hidden — `sozu --help` does
not list it without `--features tui`.
Clap surface (10 flags):
- `--refresh-ms` (default 1000): data poll interval; render is capped
at 30 fps independently.
- `--no-color`, `--no-mouse`: disable ANSI colour / mouse capture
(auto-detected by default; honours `NO_COLOR` and `TERM=dumb`).
- `--skin <name>`: looks up `$XDG_CONFIG_HOME/sozu/skins/<name>.toml`,
with `SOZU_TOP_SKIN` env override (k9s parity).
- `--detail` (Process|Frontend|Cluster|Backend): override the lease
level; default `Backend` (auto-elevate, lease self-expires
server-side after `--lease-ttl-seconds`, default 60s).
- `--snapshot N`, `--tick-once`: test affordances that drive a fixed
number of frames or one tick and exit (no terminal control).
- `--log-file`: ship internal TUI logs to a file (avoids stomping the
rendered screen).
- `--glyphs` (Braille|Block|Tty): force a glyph mode; auto-detect by
default.
`bin/src/ctl/top/mod.rs` is a placeholder that prints the resolved
argument bag and exits cleanly — the render loop, transport threads,
DetailGuard lease lifecycle, and pane implementations land in
subsequent steps. `TopArgs` mirrors the clap variant so reviewers can
confirm the wiring before the renderer is built.
`bin/build.rs` adds `("TUI", "tui")` to the feature-flag table so
`sozu --version` reports `+tui` (with feature) or `-tui` (without). The
flag materially changes the binary size and dep graph, so operators
need to spot it on the banner before deploying.
Verified: `sozu top --help` lists every flag; default `cargo build`
hides the subcommand; `cargo build -p sozu --features tui` builds
clean; `cargo clippy --all-targets --features tui` is clean;
`cargo +nightly fmt --check` is clean; `cargo test --workspace --
features tui` passes 1024 unit/integration tests + 7 doctests.
Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Lays in three skeleton modules used by the upcoming render loop and panes.
`theme.rs` — single hard-coded `Skin` with Okabe-Ito categorical palette
(colour-blind safe in isolation; pairs distinguishable across the three
common dichromatic types) plus a Viridis-shaped sparkline gradient
(cool/warm/hot tiers via `Skin::spark_color`). `GlyphMode` enum with
three sets of sparkline ramp characters (Braille / Block / TTY-ASCII)
and three status glyphs (`▲ ▼ ●`) so the colour-blind cue is always
backed up by a shape. The `--skin` lookup, `SOZU_TOP_SKIN` env override,
TOML loader, and `LANG`/`TERM` capability cascade land in week 3.
`transport.rs` — synchronous dual-channel design (no tokio in v1, per
the Codex cross-check in `tasks/todo.md`). Two threads on two separate
unix-socket `Channel` connections to the master:
- `spawn_collector` polls `RequestType::QueryMetrics` on the configurable
`--refresh-ms` ticker; pushes each `AggregatedMetrics` into a
`crossbeam_channel::bounded::<Snapshot>(1)` with newest-wins overwrite,
so a slow UI tick never queues a fan-out pile-up.
- `spawn_events` opens `RequestType::SubscribeEvents` once and forwards
every inbound `Event` into a `crossbeam_channel::bounded::<TopEvent>(64)`.
The unix `Channel<W,R>` carries no message-id correlation, so multiplexing
the streamed events with the discrete metrics round-trip on one socket is
unsafe — hence the second connection. Both threads exit cleanly when the
UI drops the receivers; transient socket errors log via `eprintln!` and
shut the thread down rather than crashing the UI.
`cardinality.rs` — `DetailGuard` RAII handle. On `apply` it sends an
initial `SetMetricDetail{ client_id, detail = Backend, ttl_seconds = 60 }`
on a dedicated channel and spawns a renewer thread that re-sends every
`ttl/2` seconds. On `Drop` the guard sends a best-effort
`SetMetricDetail{ client_id, clear: true }` so the cardinality drops back
to its configured floor on clean exit. Crash safety: even if `Drop` never
runs (panic, kill -9), the lease self-expires server-side after
`ttl_seconds`, so a dead `sozu top` cannot permanently elevate cardinality.
The `client_id` shape `top:<pid>:<8 hex>` keeps multiple concurrent TUIs
isolated without pulling `rand`.
All three modules are wired into `bin/src/ctl/top/mod.rs` but not yet
consumed by the placeholder `run_top` — the renderer + app state land in
the next two commits, at which point the `dead_code` warnings on the
public APIs clear naturally. Clippy is clean.
Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
`App` is the pure-data root of the UI: tabs, the OVERVIEW state, recent events, threshold table, last-snapshot anchor, status text, and a `RateCalculator`. The render loop snapshots it for each frame; the transport threads push into it via `App::ingest_snapshot` and `App::ingest_event`. No I/O, no rendering — the split keeps the data fold testable without spinning up a terminal. Three primitives: - `SparkRing` — fixed-capacity `VecDeque<u64>` (default 60 samples = one minute at the 1 s data tick). Newest at the back; oldest drops off the front when capacity is hit. Matches the proto `FilteredTimeSerie.last_minute` cadence so a future server-side time series swap is a one-line change. - `RateCalculator` — turns Sōzu's cumulative `Count` metrics into per-second deltas. Detects the hourly `LocalDrain::clear` (which drops `Count`/`Time` while preserving Gauges) by looking for monotonic-decrease between samples and emits `0` for that tick instead of a negative spike. First observation returns `None` so the caller can show a "no baseline yet" placeholder. - `ThresholdTable` — colour-coding boundaries: 5xx ratio > 1 %, `slab.usage_percent` > 80, p99 > 500 ms, etc. Defaults are sane starting points; revisited in the docs commit (week 4) once operators see them in anger. `App::ingest_snapshot` folds an `AggregatedMetrics` into the OVERVIEW ring buffers: - RPS — sum of per-cluster `requests` counters via `RateCalculator`. Falls back to `proxying.requests` when no cluster metrics are exposed (worker configured at `MetricDetail::Process`). - 5xx ratio — derived ratio of summed `http.status.5xx` counters over the request rate; first sample shows 0 % (no baseline). - p99 latency — max p99 across clusters (averaging percentiles is meaningless; operators want "is anyone slow"). - Saturation — prefers `slab.usage_percent`; falls back to `client.connections / connections_max`. `ActiveTab` mirrors numbered tabs at the top of the screen (`1 OVERVIEW … 7 EVENTS`). `from_digit` handles `1`-`7` muscle memory; `cycle` powers `Tab` / `Shift-Tab`. `EVENTS` (no `LOGS` pane in v1, per Codex's `tui-logger` rejection) carries a 200-event ring sliding window. 5 unit tests cover ring overflow, the RateCalculator's three branches (first observation, monotonic increase, hourly-reset clamp to 0), and the tab digit/cycle helpers. Build clean, clippy clean. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
OVERVIEW pane is the first impression of the proxy's health: four sparklines arranged in a 2x2 grid (REQUESTS / SEC, LATENCY p99 ms, 5xx ERRORS, SATURATION %) with big-numeral headers and trend glyphs (`▲ ▼ ●`) so the colour signal always has a redundant shape cue. Each cell uses a rounded `Block::ROUNDED` border and pulls its gradient tier (cool / warm / hot) from `Skin::spark_color` based on the latest sample's normalised position. Latency and saturation sparklines are anchored at fixed thresholds so a quiet system draws short bars and a spike pegs the cell — operators care about "is anyone slow" not the wandering autoscale. CLUSTERS pane is a sortable table with one row per cluster_id. Six columns (cluster_id, rps, err %, p50, p99, backends_available/total). Default sort: 5xx error rate descending, then RPS — Codex's recommendation in the cross-check. `s` cycles the sort column, `S` reverses; the active column header carries an arrow glyph (▼/▲) and the accent colour. Critical rows (5xx ratio > threshold, p99 > threshold, all backends down) flip to `Skin::row_critical`. `App` gains: - `last_metrics: Option<AggregatedMetrics>` — the freshest snapshot is retained verbatim so table-shaped panes don't have to maintain their own derivation. `ingest_snapshot` clones the wire payload (cheap relative to what the master already paid). - `ClusterRow`/`cluster_rows()` — pure builder that materialises the per-cluster summary on demand, sorted per the active key. Computed per frame so changes to `cluster_sort` / `cluster_sort_reverse` take effect on the next paint without a re-fold. - `ClusterSortKey` enum + cycle helper so the renderer can map the `s` keypress to a new sort. `panes/mod.rs::render_placeholder` is the consistent empty-state widget that BACKENDS / LISTENERS / CERTS / H2 / EVENTS use until their real implementations land in week 3. Same rounded border, same muted/secondary colour layering — visually it's "more sōzu top", not a bug. Build clean, clippy clean, fmt clean. Renderer wiring lands in the next commit. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Synchronous 30-fps render loop driving the OVERVIEW + CLUSTERS panes plus placeholder cells for the remaining tabs. No tokio runtime in v1 by design (Codex cross-check); the loop is a hand-rolled `crossterm:: event::poll(timeout)` against a `RENDER_INTERVAL = 33 ms` budget. Each iteration drains the bounded `crossbeam_channel` snapshot + event receivers via `try_recv`, polls input with the smaller of `next_render - now()` and 50 ms (so an idle UI still drains channels at 20 Hz), then redraws if the frame budget has elapsed. Synchronized output (DEC mode 2026 via `BeginSynchronizedUpdate` / `EndSynchronizedUpdate`) wraps each `terminal.draw` so tmux + iTerm2 see a single atomic paint instead of per-cell flicker. Anthropic's own TUI team had to do the same — see `tasks/ask-sozu-top-research-external.md` §6. Layout (Layout::Vertical): - 3-line tabs row: numbered `1 OVERVIEW … 7 EVENTS` with the focused tab carrying the accent background, others in muted grey. Title text reports the live/no-snapshot status anchored on `last_snapshot_at`. - flex middle pane (the active tab's render). - 1-line htop-style F-key bar: `F1 Help · F2 Theme · F3 Find · F4 Filter · F5 Pause · F6 Sort · F7 Detail- · F8 Detail+ · F9 Config · F10 Quit`. Wired today: F1 toggles help, F10 quits, plus Tab cycle / `1`-`7` direct, `q`/`Q`/`Ctrl-C` exit, `s`/`S` sort cycle/reverse on CLUSTERS. The bar reserves slots for the rest so week-3 panes plug in without re-laying out. `RawModeGuard` is the RAII handle that owns the terminal: install enables raw mode + alt-screen + (optional) mouse capture + cursor hide; Drop reverses the same sequence. Combined with the `ctrlc::try_set_handler` shared `AtomicBool` flag the loop checks each iteration, panic / SIGINT / SIGTERM / clean-quit all restore the terminal predictably. Mouse capture is opt-out via `--no-mouse` for multiplexers that mis-route SGR events. `run_top` glues everything: spawns the two transport threads (own `Channel` connections, no multiplexing), applies the `SetMetricDetail` lease via `DetailGuard::apply` (continuing without elevation if the master is too old to decode the verb), and runs the render loop until the user quits or `--snapshot N` / `--tick-once` exhausts. Drop order: lease → transport receivers → guard, so the terminal restores last and a panic mid-render never strands the shell. Verified: `sozu --features tui top --tick-once` renders one frame and exits cleanly; `cargo build`/`clippy --all-targets`/`+nightly fmt --check`/`test --workspace --features tui` all pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Pickup from `cargo +nightly fmt` after the renderer landed: a single chained `Channel::write_message().map_err()` reflowed onto three lines per the project's nightly rustfmt rules. No behaviour change. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Renders the recent-events ring populated by the events transport thread. Newest events at the top so the eye lands on "what just happened". Five-column table (when, event, cluster, backend, address) with colour-coded rows: - hot tier: BACKEND_DOWN, NO_AVAILABLE_BACKENDS, HEALTH_CHECK_UNHEALTHY, CLUSTER_REMOVED, WORKER_KILLED, SOZU_STOP_REQUESTED — operational red flags that need attention. - cool tier: BACKEND_UP, CLUSTER_RECOVERED, HEALTH_CHECK_HEALTHY, CLUSTER_ADDED — recoveries. - accent: METRIC_DETAIL_CHANGED — the audit signal the SetMetricDetail lease emits, surfaced so operators can see another `sozu top` (or scraper) elevate cardinality on the same fleet. - secondary muted: every other audit / mutation event so they don't drown out the actionable rows. Empty-state copy explains where events come from (the events thread subscribes on startup; first event arrives whenever the master emits one). `format_relative_age` shows `Ns ago` / `NmMs` / `NhMm` so the operator sees freshness at a glance without sub-second precision noise. Wires `panes::events::render` in `render::draw_pane` so the EVENTS tab (numbered `7`) now shows live data instead of the placeholder. Build clean, clippy clean, fmt clean. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Three-section vertical layout pulling H2 metrics directly from the
freshest snapshot's `proxying` map.
Section 1 — streams: big-numeral active-streams gauge plus accent-tier
"H2 share of accepts" percentage (alpn.h2 / (alpn.h2 + alpn.http11)).
Companion table: active streams, H2 connections accepted, HTTP/1.1
accepted, client.connections gauge.
Section 2 — flow control: gauges for h2.connection.window_bytes and
h2.connection.pending_window_updates plus rate-of-change for
flow_control_stall, frames.tx.{window_update,rst_stream,goaway},
headers.rejected.budget_overrun. Counters that warn when non-zero
(stalls, RST_STREAM transmissions, GOAWAY transmissions, header
budget overruns) flip to the row-critical style.
Section 3 — flood mitigations: standalone hot-tier-titled block
listing every CVE-mitigation counter (h2.flood.violation.{glitch_window,
rapid_reset, continuation, made_you_reset, ping, settings, priority},
h2.window_update_dropped, h2.close_with_active_streams). Any non-zero
value is a documented attack mitigation firing — bold red so the eye
lands on it. Title names the three CVEs Sōzu tracks
(CVE-2023-44487 Rapid Reset, CVE-2024-27316 CONTINUATION flood,
CVE-2025-8671 MadeYouReset).
Wires `panes::h2::render` into `render::draw_pane` so tab `5` (H2)
shows live data instead of the placeholder. The "trend (60 s)" column
header reserves space for per-row sparklines once the renderer can
afford the second pass without copying. Build clean, clippy clean.
Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Renders one row per `(cluster_id, backend_id)` pair across the freshest snapshot. Default sort: bandwidth descending (sum of `back_bytes_in` + `back_bytes_out`) — the busiest backend lands at the top, which is the "who is on fire" question operators ask first. Seven columns: cluster, backend, bw down/up B, conn, p50, p99, req. Bandwidth is rendered with a human-friendly suffix (`K`/`M`/`G`); the rest are raw numerics. Rows where `p99 ≥ thresholds.latency_p99 _critical_ms` flip to the row-critical hot/bold style so a slow backend visually stands out even when it isn't the busiest. Sort columns: cluster, backend, bandwidth, connections, latency_p99, requests. `s` cycles, `S` reverses (same shortcuts the CLUSTERS pane already uses; the renderer dispatches by active tab). Empty-state copy explains the two reasons backends might be missing: no traffic yet, or `metrics.detail < backend` and the SetMetricDetail lease hasn't acknowledged. Points at the EVENTS pane for the METRIC_DETAIL_CHANGED audit signal. `App` gains `backend_sort` / `backend_sort_reverse` and a `BackendRow` + `backend_rows()` builder that flattens the per-backend metric maps in the snapshot. Sort tie-breaks deterministically on (cluster_id, backend_id) so the on-screen layout doesn't jiggle when two backends share the same primary key. In-table cluster-scope filter (drill from the CLUSTERS row's `cluster_id`) is week-4 polish; the flat view already gives the "which backend is loaded" answer at a glance. Build clean, clippy clean, fmt clean. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Third transport thread (`spawn_listeners`) opens its own `Channel` and polls `RequestType::ListListeners` every 5 s. The slower cadence matches the brief's "cold subjects" tier — listener state changes are operator- paced and the EVENTS pane already surfaces LISTENER_ADDED / LISTENER_REMOVED / LISTENER_UPDATED / LISTENER_ACTIVATED / LISTENER_DEACTIVATED audit events as they happen. Pushed snapshots land in a `crossbeam_channel::bounded::<ListenersSnapshot >(1)` with the same newest-wins semantics the metrics collector uses. `App` gains a `last_listeners: Option<ListenersSnapshot>` slot, populated by `App::ingest_listeners` from the render-loop drain. LISTENERS pane renders a flat three-column table (proto, address, status) across the freshest `ListenersList`: - HTTP listeners: `active=<bool>`. - HTTPS listeners: `active=<bool> · alpn=<csv>`. - TCP listeners: `active=<bool>`. Title strip shows refresh cadence + live/no-snapshot status so the operator can immediately see whether the inventory is current. Empty-state copy points at the EVENTS pane for audit transitions while the first 5-s poll is in flight. `render::run` signature takes the third receiver; `run_top` spawns the third thread and threads the receiver through. Build clean, clippy clean. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
…port The events transport thread blocked indefinitely on read_message_blocking_timeout(None). The other three transport threads use bounded intervals and observe try_send's Disconnected to exit on receiver drop; the events thread had no such wake path because dropping the crossbeam receiver does not propagate into the unix socket. Consequences: - The thread held its SubscribeEvents subscription open forever. - The master's per-session event ring kept queueing for a never-draining consumer (memory leak proportional to event traffic for the worker's lifetime). - Programmatic re-entry (tests, embedded callers, repeated run_top) leaked one Channel per invocation. Wire a finite read timeout (EVENTS_READ_TIMEOUT = 1s) plus an Arc<AtomicBool> shutdown flag owned by run_top. The events loop checks the flag between reads and exits cleanly when run_top sets it on the way out. Join all four transport handles on shutdown for symmetry. Module-level docs at transport.rs already promised "all threads exit cleanly when their crossbeam_channel peer is dropped" — the events thread now honours that contract. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Two related terminal-lifecycle bugs. 1. RawModeGuard::install enabled raw mode BEFORE constructing the guard. If a subsequent EnterAlternateScreen / Hide / EnableMouseCapture failed (rare but real on EBADF stdout in nspawn / systemd-run shells, EOF on closed pty after session detach), the function returned Err with raw mode still on but no Drop scheduled. Operator's shell was stuck in raw mode without echo; recovery via 'reset' or 'stty sane'. Invert the construction: build the guard immediately after enable_raw_mode, track partial-success state on alt_entered / mouse_enabled flags, and progressively set them as each step succeeds. Drop checks each flag before issuing the reverse, so a partial install also gets a partial teardown. 2. ctrlc::set_handler only catches SIGINT by default. The module-level docs at bin/src/ctl/top/mod.rs advertised SIGTERM coverage too, but the ctrlc dep was on default features only. 'kill -TERM <pid>' (which systemd-run --user --pty deliberately sends on close) left the terminal in alt-screen + raw mode. Enable ctrlc's 'termination' feature so the same handler also catches SIGTERM and SIGHUP. Update the module docstring to call out the feature dependency explicitly. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
…tely
The render loop's dirty-gate skipped terminal.draw when
!app.take_dirty() && !app.pulse.has_active(). Several handle_key arms
mutated visible state without setting the dirty flag, so on a quiet
system (1 s snapshot tick by default) the operator pressed Tab, sort,
':', '?', or F1 and saw nothing for up to one second.
Affected arms:
- KeyCode::Char(':') -> open_palette
- KeyCode::Char('s' | 'S') on Clusters and Backends panes (sort cycle /
reverse)
- KeyCode::Tab and BackTab
- KeyCode::Char('1'..='7') (direct tab jump)
- KeyCode::Char('?') and F1 (help toggle)
- the palette typing path
Add app.mark_dirty() at each site. The 30 fps render cap still bounds
the paint rate; the dirty flag is the canonical 'frame needs repaint'
signal independent of snapshot arrival.
Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
The four transport-thread spawn sites in bin/src/ctl/top/transport.rs
used .expect("spawn sozu-top <label>"). The thread::Builder::spawn path
fails under RLIMIT_NPROC pressure or transient OOM; the previous
behaviour panicked the binary with a backtrace.
The RawModeGuard Drop restores the terminal even on panic, but the
operator sees an abort rather than a clean error. Map the spawn failure
to a new CtlError::SpawnFailed { label, source } variant and propagate
it through the Result<...> return so the caller (run_top) surfaces a
one-line "failed to spawn thread 'sozu-top-collector': ..." message
instead of a panic banner.
Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Three eprintln! sites in bin/src/ctl/top/transport.rs wrote transient errors to stderr after the alt-screen takeover redirected the terminal. The operator never saw them; recovery required leaving the TUI and inspecting the parent shell's stderr buffer. Reuse the StatusSlot pattern already used by the DetailGuard renewer thread (bin/src/ctl/top/cardinality.rs). The four transport spawn functions now accept a shared StatusSlot; on a recoverable error (poll round-trip failure, SubscribeEvents write/read error) they push a message to the slot, which the render loop drains once per tick and surfaces in the status bar. publish_status was promoted from private to pub(super) so the transport module can call it. The render loop's existing take_status drain already handles the unified mailbox — no rendering changes needed. run_top now constructs the shared StatusSlot before the spawn calls and threads a clone into each transport thread + the DetailGuard apply. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
The variant is only constructed from `bin/src/ctl/top/transport.rs`, which is itself gated behind the `tui` Cargo feature. Without the gate the lean default-features build (`--no-default-features --features crypto-ring`) emits `warning: variant `SpawnFailed` is never constructed`, which CI's `-D warnings` clippy gate would catch. Add `#[cfg(feature = "tui")]` to the variant declaration so the lean build no longer sees the dead variant and the TUI build keeps it for the spawn-failure propagation path. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Four small documentation fixes uncovered during the PR-review pass. 1. doc/configure_cli.md flag table no longer documents --no-color and --log-file; both were removed from the clap surface earlier in this PR but the table still listed them. Operators reading the table would get a 'clap: unexpected argument' error if they tried the advertised flags. The --no-mouse row is kept and clarified since that flag is still wired in. 2. doc/sozu-top.md narrative referenced NO_COLOR/--no-color in the accessibility section; strike the sentence so the doc matches the trimmed clap surface (neither the env var nor the flag is wired into the TUI today). 3. bin/src/ctl/top/transport.rs had two inline comments claiming "drop oldest semantics on overflow", but try_send on a full bounded channel drops the *newest* sample (publish-or-skip). The publish-or-skip contract is the documented behaviour shared with the snapshot channels above; fix both events-channel comments to match. 4. lib/src/metrics/mod.rs doc comment near lease_apply claimed TTLs above LEASE_TTL_MAX are "clamped", but the actual behaviour is to return LeaseApplyOutcome::TtlOutOfRange. Update to say "rejected" so callers handle the right outcome arm; the existing rationale paragraph below already explains the rejection but the lead-in contradicted it. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Five small clean-ups uncovered during the PR-review pass. 1. bin/README.md gains a sozu top feature highlight matching the root README style and cross-linking the operator guide. 2. bin/src/ctl/top/app.rs::cluster_rows had a dead .max(0) arm at the end of the backends_available chain. Flatten to one if/else so the final no-rollup-no-per-backend-gauge fallback is the only path that uses cm.backends.len(). Same semantics; one less branch and the intent reads top-down rather than relying on a degenerate max. 3. lib/src/metrics/mod.rs::record_backend_metrics! macro now references names::backend::* constants instead of raw "bytes_in" / "bytes_out" / "backend_response_time" / "backend_connection_time" / "requests" literals. Values are unchanged on the wire; emitter and reader sides now go through the same typed-constant rename guard. Adds names::backend::CONNECTION_TIME for the previously-unconstantised "backend_connection_time" metric so the full set is captured. 4. bin/src/ctl/top/render.rs PanicHookGuard restores the prior panic hook on Drop. Repeated render::run calls in the same process (tests, embedded callers) no longer stack hook layers indefinitely. The new hook still chains the prior for banner emission; Drop just takes ownership of the prior back and reinstalls it as the active hook. 5. SOZU_TOP_SYNC=0 env opt-out is now actually wired at the BeginSynchronizedUpdate / EndSynchronizedUpdate call sites. Without this, operators on terminals that don't speak DEC mode 2026 had no escape hatch from the synchronised-output frame wrap. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
…heck reset `BackendMap::record_cluster_availability` is the sole emission site for the `cluster.available_backends` and `cluster.total_backends` rollup gauges, called from `add_backend`, `remove_backend`, `backend_from_cluster_id`, and the health-check completion path. Two control-plane paths populated or reset the backend map without invoking it, leaving the rollup absent until something else mutated the cluster: - `import_configuration_state` extended the map via `HashMap::extend` with no follow-up. A worker that loaded its cluster topology via SCM-fd hot-restart or `LoadState` therefore reported the gauges as missing on the first `QueryMetrics` poll until traffic or an admin add/remove fired. The TUI's three-tier `cluster_rows` fallback resolves to `cm.backends.len() = 0` under default `Cluster` detail, surfacing as `backends_available/total = 0/0` in the operator view. - `set_health_check_config(None)` resets every backend in the cluster to `HealthState::default()` (so the load balancer routes again after the operator drops the probe) but did not re-emit. The metrics view stayed pinned at the last health-check value while the routing view reflected the reset. Both paths now call `record_cluster_availability(cluster_id)` for each affected cluster. Adds unit tests asserting (1) the rollup gauges land in the Aggregator immediately after `import_configuration_state`, and (2) `set_health_check_config(None)` flips the cell back to `Available` after the health-state reset. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
`Aggregator::lease_apply` previously inserted unconditionally on a renewal (same `client_id` already present), overwriting the recorded `PeerBinding` with whatever the renewer presented. The single-owner- thread refactor on the client side made the legitimate `DetailGuard` always present the same `(peer_pid, peer_session_ulid)`, but it does not constrain other callers: any same-UID process that learns the victim's `client_id` (PID is enumerable through `/proc`; the random hex suffix lands in the audit log `lease_id=` column and any tail of `sozu top --debug` stderr) can re-apply against the lease, replace the binding to its own session, and lock the original owner out of their `Drop`-time `clear`. `lease_apply` now gates renewals: when an existing lease's apply-time binding is fully known (per `PeerBinding::is_known`), the renewer's presented binding MUST match. Unknown apply-time bindings (no `SO_PEERCRED`, pre-binding callers, intermediate proxies) keep their "accept any renewer" behaviour per the proto contract on `SetMetricDetail.peer_pid` / `peer_session_ulid`. The new `LeaseApplyOutcome::Unauthorized` variant feeds a `WorkerResponse:: error` in the `SetMetricDetail` worker arm — the message intentionally does not echo `client_id` so operator-controlled bytes don't reach the freeform reason field. The dedicated `lease_id` audit column still carries the client id through the strict sanitiser. Adds two regression tests: - `lease_apply_renewal_rejects_foreign_binding` reproduces the collision attack and asserts the renewal is refused AND the original owner's subsequent `clear` still succeeds. - `lease_apply_renewal_with_matching_binding_succeeds` covers the symmetry case so the TUI's own renewer thread doesn't regress. The existing `lease_apply_renewal_replaces_previous_for_same_client` test (`PeerBinding::default()` on both sides) continues to exercise the unknown-binding path. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
`is_unsafe_line` already stripped C0 + DEL + C1 (via `char::is_control`), BOM, and the Unicode line/paragraph separators. The bidirectional override / isolate controls — U+202A..=U+202E (LRE/RLE/PDF/LRO/RLO) and U+2066..=U+2069 (LRI/RLI/FSI/PDI) — were not covered. An operator- controlled value containing a Right-to-Left Override visually reverses the bytes that follow when an operator tails the audit log in a Unicode-aware terminal (`less`, `cat` under a UTF-8 locale, `journalctl`), so a row that legitimately attributes an action to one field can appear to attribute it to a different one. This is the Trojan-Source class (CVE-2021-42574) applied to audit rows rather than source code. Extend `is_unsafe_line` to reject the full bidi override / isolate range. `sanitize_for_audit_kv` inherits the additional coverage through the `is_unsafe_line` call inside `is_unsafe_kv`. Adds four regression tests: - `line_strips_rtl_override` for the canonical U+202E attack. - `line_strips_bidi_override_range` table-tests all five U+202A..U+202E. - `line_strips_bidi_isolate_range` table-tests all four U+2066..U+2069. - `line_preserves_legitimate_bidi_text` asserts that Hebrew/Arabic script content round-trips unchanged — only the explicit controls are rejected. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Both `WorkerTask::on_finish` and `SetMetricDetailTask::on_finish` join
the per-worker `response.message` strings into `extras.reason` via
`messages.join(", ")`. The rendered audit-log row passes that field
through the weak `sanitize_for_audit`, which leaves `,` and `=`
untouched — so a worker message that itself contained either character
forged adjacent KV columns when a SIEM splits on `, ` / `=`.
The worker error templates in `lib/src/server.rs::notify`'s
`SetMetricDetail` arm already redacted the operator-supplied
`client_id` from the freeform reason string, but a future internal
caller (or a non-`SetMetricDetail` verb whose worker arm legitimately
echoes operator-controlled bytes) would still smuggle past the join
site without sanitisation.
Two changes:
- `WorkerTask::on_finish` now runs each worker's `response.message`
through `sanitize_for_audit_kv` before formatting `{worker_id}: {msg}`.
- `SetMetricDetailTask::on_finish` does the same. This task does NOT
share the generic `WorkerTask` path, so the fix has to land at both
sites.
Adds a regression test asserting the canonical attack payload
(`x,actor_user=mallory,sozu_version=hijacked`) cannot survive the
format-then-sanitise sequence at the join sites.
Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Multi-line destructure of `Some(sozu_command_lib::proto::command::response_content::ContentType::WorkerMetrics(wm))` exceeds the per-line width budget; rustfmt prefers the parenthesised `Some(...)` form spread across three lines. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
f94ae33 to
4caf001
Compare
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
…p, getrandom Close out four remaining lease-primitive findings from the PR #1256 review pass. 2A — Loud TTL reject in lease_apply (was: silent clamp). The aggregator no longer transparently caps `ttl > LEASE_TTL_MAX` to LEASE_TTL_MAX. It now returns `LeaseApplyOutcome::TtlOutOfRange`, which the worker dispatch arm surfaces as `WorkerResponse::error`. The dispatch site itself already rejects out-of-range TTLs before reaching the aggregator; the new error arm catches any caller that bypasses dispatch (proto fuzzing, future internal use) instead of silently capping their intent. 2B — Master-side TTL pre-validate before fan-out. `worker_request` checks the operator-supplied `ttl_seconds` against `LEASE_TTL_MAX` BEFORE scattering to workers. A malicious or buggy `SetMetricDetail{ttl_seconds: u32::MAX}` no longer reaches the worker fan-out, eliminating the N×rejected-fan-out + N×audit-line amplifier. 2E — Lease HashMap cap + client_id length cap. New `LEASE_TABLE_CAP = 64` and `LEASE_CLIENT_ID_MAX_BYTES = 64` constants. `lease_apply` returns `LeaseApplyOutcome::TableFull` for fresh inserts that would overflow the table (renewals of existing entries still succeed, so an active operator never loses their lease just because the table is full) and `LeaseApplyOutcome::ClientIdTooLong` for oversized lease keys. Closes the CWE-770 vector where a same-UID attacker could roll `client_id` faster than expiry to grow the map unbounded toward worker OOM. The master applies the same length cap before fan-out via the new `sozu_lib::metrics::LEASE_CLIENT_ID_MAX_BYTES` export. 2D — Direct `libc::getrandom` on Linux for the lease-id random suffix. Replaces the `/dev/urandom` File::open path on Linux with the `getrandom(2)` syscall + `GRND_NONBLOCK` flag (no fs dependency, no chroot/sandbox failure path). Non-Linux Unix targets keep the `/dev/urandom` read because `getrandom`'s ABI varies (FreeBSD vs OpenBSD's `getentropy(2)` vs macOS's `SecRandomCopyBytes`). Switched from `u32::from_ne_bytes` to `u32::from_le_bytes` for cross-arch reproducibility of the rendered hex. Last-resort fallback to `subsec_nanos()` remains for total-entropy-failure environments and the fallback is now silent at the data layer (the `app.status` surface covers operator visibility in a later commit). Four new tests cover ClientIdTooLong, TableFull (including renewal- succeeds-when-full), TtlOutOfRange, and the LEASE_TTL_MAX boundary. Build/clippy clean; 29/29 metric tests pass (4 new). Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
Three operator-facing TUI hardening items from the PR #1256 review. 2C — Skin loader fails closed when the parent anchor can't be resolved. Previously, when `skins_anchor()` returned `None` (race-delete of the parent, weird `/proc` paths, unusual fs mounts) the confinement check was silently skipped and `from_toml` ran on the bare resolved path, defeating the symlink-escape defence the canonicalize block was added to provide. Now the anchor failure short-circuits to the default skin with a status-bar diagnostic. 2C (companion) — Close the skin loader TOCTOU window. The previous shape canonicalized the file, then went `from_toml(&resolved)` which called `std::fs::read_to_string(&Path)` — re-resolving the path through the kernel and re-opening the symlink chain. An attacker with write access to a shared skins dir could swap a symlink in the gap. Replaced with `from_open_file(&Path)` which calls `File::open` once after the parent-anchor check and reads from the `&mut File` handle. The path-based `from_toml` API is removed; tests parse via `toml::from_str` on a raw string. 3E — `ctrlc::set_handler` degrades gracefully on `MultipleHandlers` error. The previous `.expect()` aborted the TUI on programmatic re- entry (any caller that invokes `run` twice in the same process address space hits the second-install path). The crossterm event loop already observes Ctrl-C as a keypress, so the dedicated signal handler is belt-and-braces rather than the primary path; falling through with a status-bar note keeps embedded callers viable. Status-bar precedence: skin > lease-elevation > signal-handler diagnostic. (Renewer-thread status surface — review M-7 — deferred: the worker status channel needs threading through `DetailGuard` and the render-loop poll. Tracked as a follow-up to keep this commit surgical.) Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
…pe, privacy Three documentation fixes from the PR #1256 review. 3G — CHANGELOG referenced "three synchronous transport threads"; the CERTS-pane collector added a fourth in eec8422 but the entry was not updated. Fixed to "four synchronous transport threads (snapshots, listeners, certs, events)". 2G — CHANGELOG audit-scope claim updated to reflect this PR's actual behaviour: every cardinality transition emits METRIC_DETAIL_CHANGED, including the previously-silent worker-local janitor expiries and post-fan-out apply/clear paths, thanks to the new worker→master audit IPC. Lease ownership binding (SO_PEERCRED peer pid + session ULID) is called out so SecNumCloud-style reviewers see the trust model in plain text. The unsupported_workers field's wiring status is described honestly — the proto field + per-worker version snapshot ship in this release; capability-aware dispatch is tracked as follow-up. 3H + D5 — `doc/sozu-top.md` gains an explicit transport layout note (four threads, six unix-socket connections per invocation) and a privacy paragraph: the operator-supplied `--reason` text flows to the audit log and to SubscribeEvents subscribers, so PII / customer IDs should not be embedded. Length and character caps applied server-side are documented. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
Two cleanup items from the PR #1256 review: 4A — Remove the empty `bin/api` zero-byte file. The diff added it as a new empty file (`new file mode 100644, index 00000000..e69de29`) — clearly a stray `touch bin/api` from local experimentation that got `git add`-ed. No callers in tree. Lisa L-017 — Add a dedicated `cargo audit` CI job that runs on every push + PR. The `tui` Cargo feature pulled in 12 new optional crates (color-eyre, crossterm, ratatui, throbber-widgets-tui, tui-input, tui-big-text, tui-popup, tui-scrollview, tui-tree-widget, ctrlc, crossbeam-channel, toml) plus their transitive closure outside what the default-features build covers. Without an audit gate any RUSTSEC advisory landing against one of those crates can ship in a release unnoticed. Job: cache-all-crates, prefix-key `ci-audit`, `--deny warnings` so the gate fails fast on any new advisory. Two invocations to keep symmetry with the per-feature build matrix above; both read the same Cargo.lock that already contains the TUI deps. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
Two leftover items from PR #1256 review round-2. 3F — Regression guard for the C1 dispatch-whitelist fix. A future refactor that drops `SetMetricDetail` from the no-op match arm in `ConfigState::dispatch` and falls through to the catch-all `UndispatchableRequest` arm would silently re-break TUI cardinality elevation entirely (the original C1 bug). New test exercises the happy path with a fully-populated `SetMetricDetail` request and asserts `dispatch` returns `Ok`. 4B — Replace `LEASE_TTL_DEFAULT.as_secs() as u32` lossy cast in the SetMetricDetail worker arm with `u32::try_from(...).unwrap_or(60)`. The default fits in u32 by construction (60 s); the explicit checked conversion documents the bound and shields against any future tweak that grows LEASE_TTL_DEFAULT past `u32::MAX` seconds (≈ 136 years). D6 — Confirmed via `grep -nE "^## \[" CHANGELOG.md`: a single `## [Unreleased]` section. No edit needed. Build/clippy clean; 28/28 state tests pass (1 new). Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
PR #1256 review M-7: the renewer thread's `eprintln!`-on-error wrote to the wiped alt-screen, so the operator never saw "renewer dropped" diagnostics until the lease silently lapsed and per-backend panes went sparse. Wire a shared status slot: - New `cardinality::StatusSlot = Arc<Mutex<Option<String>>>` plus `new_status_slot()` / `take_status(slot)` / private `publish_status` helpers. Poisoned-lock recovery via `into_inner` so a panic in one background thread does not silently strand the next message. - `DetailGuard::apply` takes a `StatusSlot` parameter. The renewer thread receives a clone and writes through `publish_status` on its two error paths (channel open + renewal send), replacing the prior silent `eprintln!`. The `status` field on `DetailGuard` keeps the slot alive for the renewer's lifetime — the read path lives on the render-loop side (hence the `#[allow(dead_code)]` for the reader-not-on-guard pattern). - `RenderConfig` gains a `lease_status: StatusSlot` field; the render loop drains it once per tick and overwrites `App::status` so the F-key bar repaints the message on the next frame. New `App::mark_dirty()` triggers the redraw when no snapshot landed. Same plumbing is ready to receive the four transport collectors' `eprintln!` sites in a follow-up — kept out of this commit to keep the change focused on the renewer (review L-2 is the companion finding). Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
…CKENDS panes (A4) PR #1256 simplify A4. `panes/clusters.rs` and `panes/backends.rs` each carried an identical 19-LOC `sort_header` helper that produced the styled `ratatui::widgets::Cell` for an active-or-inactive sort column header. Only the column-key enum (`ClusterSortKey` vs `BackendSortKey`) differed. Replace the two copies with one shared `pub(super) fn sort_header( label, active, reverse, skin) -> Cell<'static>` in `bin/src/ctl/top/panes/mod.rs:27`. Call sites compute the boolean `active = current_key == key` themselves, so the helper stays generic-free. Insta snapshot tests for both panes pass unchanged — the rendered bytes are byte-for-byte identical (snapshot tests are the byte-equality guard). Pure refactor, no behaviour change. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
…ric helper (A1) PR #1256 simplify A1. `spawn_collector` (snapshots), `spawn_listeners`, and `spawn_certs` shared an identical `loop { match poll { Ok(v) => try_send; Err(e) => eprintln; } sleep_remaining }` shape — ~90 LOC of triplicated control flow. Extract a single private `poll_loop<T, F>(label, interval, tx, channel, poll)` helper that closes over the channel ownership and the per-thread polling closure. Each spawn site keeps its own `thread::Builder::new().spawn(...)` entry point, its per-thread `Channel` ownership, and the `bounded(1)` publish-or-skip discipline locked by the auditor — only the inner polling skeleton is shared. Constraints preserved: - Four `thread::spawn` sites: `spawn_collector`, `spawn_listeners`, `spawn_certs`, `spawn_events`. `spawn_events` has a different drainer shape and stays untouched. - `bounded(1)` channel capacity on the polling threads. - Publish-or-skip on `TrySendError::Full`; clean thread exit on `TrySendError::Disconnected`. - Per-thread `Channel` ownership (no `Mutex<Channel>`). Net LOC delta in `transport.rs`: -116 lines (442 → 326). Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
…spatcher (A2 + A9) Two PR #1256 Tier-A simplifications. A2 — PulseTracker.tick / tick_one used a two-pass `Vec<K>::collect()` + `for k in to_drop { map.remove(&k) }` loop to drop zero-aged entries. Same shape in two places. Replaced both with in-place `HashMap::retain(|_, v| { if *v == 0 { false } else { *v -= 1; true } })` — same semantics, no `String` key clones for the dropped set, no intermediate `Vec` allocation per render frame. A9 — `apply_palette` carried eight near-identical `match` arms mapping a string alias (`"overview"` / `"o"`, `"cluster"` / `"clusters"` / `"c"`, …) to a tab. Adding a new tab required patching `ActiveTab` AND remembering to also patch `apply_palette`. Centralised the alias table in a new `ActiveTab::from_alias(s) -> Option<Self>` resolver and let `apply_palette` fall back to a small fixed-cmd match (`help` / `quit` / empty / other) only when the alias resolver returns `None`. Future tab additions touch `ActiveTab` only. Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
… pane (A3) PR #1256 simplify A3. `bin/src/ctl/top/app.rs` carries `count_value` / `gauge_value` helpers that decode a `FilteredMetrics -> Option<{i64,u64}>`; `bin/src/ctl/top/panes/h2.rs` re-implemented the same two functions (different names: `count` / `gauge`) verbatim — same body, divergent identifiers. Promote the app-side helpers to `pub(super)` and import them into the H2 pane under their pane-local names (`use … as count, … as gauge`). Removes ~15 LOC of duplicated body and the `filtered_metrics::Inner` / `FilteredMetrics` imports the H2 pane no longer needs directly. Future renames or signature changes touch one site instead of two. Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
…lStatus reply (2H) Closes the PR #1256 follow-up where the `proto_version` field on `WorkerInfo` was wired (commit `b0ad5728`) but no dispatcher consumed it: `MetricDetailStatus.unsupported_workers[]` stayed permanently empty regardless of the master/worker version skew. New dedicated dispatch path `set_metric_detail_request` plus a `SetMetricDetailTask` GatheringTask in `bin/src/command/requests.rs`: - Snapshots master-side `(configured, effective)` before the request fans out so the synthesised reply carries `previous_effective`. - Mirrors `worker_request`'s peer-binding population (peer_pid + session_ulid from `ClientSession`) and length / TTL pre-validation. - Walks `server.workers`, partitioning by `proto_version >= MIN_PROTO_VERSION_FOR_SET_METRIC_DETAIL`. Capable workers are scattered to one-by-one via `scatter_on(Some(worker_id))`. Unsupported workers (typically inherited-after-`UpgradeMain` from a pre-tag-55 binary) skip the fan-out entirely and land in the response's `unsupported_workers[]` field. - Emits attempt-time + completion-time audit rows in the same shape the generic WorkerTask flow used, threading the operator-supplied `client_id` (lease_id column) + `reason` (metric_detail_reason column) through `MetricDetailAuditFields::into_extras`. - Synthesises a full `MetricDetailStatus` for the client: `configured`/`effective`/`previous_effective` from the master's Aggregator + `workers: BTreeMap<worker_id, WorkerMetricDetailStatus>` populated for every ACK'd worker + `unsupported_workers` from the pre-filter. Returns via `client.finish_ok_with_content`. Per-worker `(configured, effective, previous_effective, active_lease_count)` quartets currently mirror the master's view because the worker arm in `lib/src/server.rs` replies with `WorkerResponse::ok` (no content payload). A follow-up plumbing the actual per-worker view through a new `ResponseContent` variant is documented in the SetMetricDetailTask `on_finish` body; the wire schema is populated now, with master-view stand-ins, so consumers (TUI status bar) see a non-empty `workers` map even before that plumbing lands. `MIN_PROTO_VERSION_FOR_SET_METRIC_DETAIL = 1` constant declared local to the dispatcher so future per-verb gating doesn't get coupled to the global SOZU_PROTO_VERSION monotonic bump. Build/clippy clean; 1075/1075 workspace tests pass (12 suites, ~6 min). Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
Six small simplifications from PR #1256 review round-2's residual Tier-A / Tier-B list. Pure stylistic; no behaviour change. A6 — Hoist the HTTP-5xx error-status set out of the two inline `[S500, S502, S503, S504, S507]` iterators (cluster_rows + fold_overview) into a module-level `ERRORS_5XX: [&str; 5]` constant. Adding a new 5xx variant now touches one place. The dynamic sort comparator is left as-is — the heterogeneous `BackendSortKey` variant types (u64 / String) do not compose into a single tuple key without adding wrapper enums that would be heavier than the gain. A8 — Replace the five `(Self::default_dark(), Some(...))` fallback tuples in `Skin::resolve` with a local `let default_with = |msg| (Self::default_dark(), Some(msg));` closure. The fail-closed policy from commit `5b098d9b` and the `from_open_file` TOCTOU mitigation stay verbatim; diagnostic strings are byte-identical. A7 — Audited. The current `App::new` is already tight at 25 lines; adding `#[derive(Default)]` would require introducing arbitrary "first variant is default" choices on four enums (ActiveTab, ClusterSortKey, BackendSortKey, GlyphMode), which just relocates the explicit-default question rather than removing it. Net win is negligible; left as-is. L-8 — Clear `palette_input` on the unknown-command path in `App::apply_palette`. The success path already resets the input; the typo path used to leave the operator's previous text in place so the next `:` keypress re-opened the palette pre-populated with the bad command. Now both paths exit with a fresh input. B1 — Convert `.clone()` on `&String → String` sites to `.to_owned()` across the panes layer (listeners.rs, certs.rs). User preference per CLAUDE.md ("prefer ToOwned::to_owned() over Clone::clone() when going & str → String for ownership-intent clarity"). Same allocation behaviour; clearer intent. B5 / B9 — Collapse `format!(...) + &format!(" · {trend} 60 s")` in `panes/overview.rs::subtitle_for_rps` into a single `format!` call. Saves one allocation per subtitle render. Events pane: `Option<String>::clone().unwrap_or_default()` → `.as_deref().unwrap_or("").to_owned()` — skips the `String::new` allocation on the `None` branch (the kernel emits backend events without `cluster_id` populated for proxy-process-level transitions). Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
Final piece of the PR #1256 follow-through. Previously the capability-aware dispatcher in `bin/src/command/requests.rs` (`SetMetricDetailTask::on_finish`) synthesised `MetricDetailStatus.workers[<worker_id>]` using the master's aggregator view as a stand-in for each worker because workers replied with `WorkerResponse::ok(message.id)` carrying no payload. Each worker holds an independent `Aggregator` with its own lease table, so that stand-in obscured real per-worker drift (different configured floors, different active lease counts after a partial fan-out). Wire it properly end-to-end: - New `ResponseContent::WorkerMetricDetailStatus` oneof variant (tag 17 — proto additive). Carries the worker's own `(configured, effective, previous_effective, active_lease_count)` quartet, semantically distinct from the aggregated `MetricDetailStatus` at tag 16. - New `lib/src/server.rs::worker_metric_detail_status_content` helper that builds the response payload from a `(configured, effective, previous_effective, lease_count)` snapshot captured BEFORE the `METRICS.borrow_mut` scope ends (so the per- request snapshot is consistent with the transition that just happened). - The three ok-paths in the worker's SetMetricDetail arm (clear-Cleared, clear-NotFound, apply-Applied) now reply via `WorkerResponse:: ok_with_content` with the freshly-built payload instead of the payload-less `ok`. The `clear-NotFound` path reports `previous_effective == effective` (no transition). - Master-side `SetMetricDetailTask::on_finish` collects the per-worker payload from `response.content` and only falls back to skipping the worker entry when the response has no payload (e.g. an older worker that never went through `ok_with_content`). Removes the master-view stand-in noted as a follow-up in commit `70cd24af` (`set_metric_detail_request`). - `command/src/proto/display.rs` adds a silent OK match arm for the new variant — the per-worker payload flows master-side and is never printed directly on the operator's terminal. Build/clippy clean; 1075/1075 workspace tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS
added a commit
that referenced
this pull request
May 20, 2026
Strip the in-tree comment cross-references to review tags (`PR #1256 review M-7`, `simplify A3`, `C1 fix`), round labels (`round-4 follow-through`), and short-SHA references to other commits on this same branch. The technical rationale stays in place, inlined where the cross-reference used to be — committed text reads self-contained for any contributor without access to the local review pipeline. Affected sites: - `bin/src/command/requests.rs::handle_request` (SetMetricDetail match arm) - `bin/src/command/server.rs::handle_worker_response` (METRIC_DETAIL_CHANGED branch) - `bin/src/ctl/top/cardinality.rs` (DetailGuard.status, short_random_suffix) - `bin/src/ctl/top/panes/h2.rs` (gauge/count helper import note) - `bin/src/ctl/top/render.rs` (RenderConfig.lease_status) - `bin/src/ctl/top/theme.rs::Skin::resolve` (default_with closure doc) - `command/src/state.rs::dispatch_passes_through_set_metric_detail` - `lib/src/server.rs::worker_metric_detail_status_content` No behavioural change. Build/clippy clean; 53/53 TUI unit tests pass. Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a btop/htop-style live operator TUI behind a new
tuiCargo feature onbin/. Seven panes (OVERVIEW, CLUSTERS, BACKENDS, LISTENERS, CERTS, H2, EVENTS) surface metrics Sōzu already emits over four synchronous transport threads, plus a colon command palette, F-key bar, TOML skin loader, and big-text alert banner. Notokioruntime in v1 by design.Also introduces
SetMetricDetail, a runtime proto verb that lets clients lease elevated metric cardinality. The lease is TTL-bounded (default 60 s, hard cap 300 s),client_id-keyed, bound to the connecting peer'sSO_PEERCREDPID + session ULID so cross-operator clears are refused, length-capped on the wire (≤ 64-byteclient_id, ≤ 256-bytereason), and self-expires server-side so a crashedsozu topcannot permanently elevate cardinality. The TUI auto-leasesBackenddetail on startup, then clears on exit. Every transition — operator-initiated AND worker-local janitor-driven — is audited viaEventKind::METRIC_DETAIL_CHANGED. Workers that pre-date the verb return the standardunknown request typeerror which surfaces in the normal fan-out error tally; production keeps master + workers in sync viaUpgradeMainso the mixed-version state is transient.Review-pass fixes (2026-05-12)
Fourteen follow-up commits address findings from a multi-agent review pass (Claude + Codex + Lisa security threat-hunt + plan-vs-PR diff + guidelines + simplify).
Metrics-wiring root cause the operator reported (OVERVIEW pane stuck at RPS=0 and CLUSTERS pane at backends_up/total=0/0) traced to:
bin/src/command/requests.rs::query_metricsskippedmerge_metrics()wheneverserver.workers.len() <= 1. Single-worker fleets had every per-worker metric stranded inAggregatedMetrics.workerswhile the mergedclusters/proxyingmaps stayed empty. Three companion fixes also landed: per-cluster counter reading under backend-detail filing (004a1965), thecluster.available_backendsrollup gauge name constant (b9625b42), and the cluster RPS rate / backends rollup / service-time / cert addr / H2 trend wiring (f5e90156).Ship-blockers (P1) closed:
SIEM column-smuggling in the
SetMetricDetailfailure-reason aggregation. Worker error templates atlib/src/server.rspreviously echoed the operator-suppliedclient_idverbatim intoWorkerResponse::errorstrings, which the master joined intoextras.reasonand rendered through the weaksanitize_for_audit(control bytes only). Aclient_idlike"x,actor_user=mallory"(legal under the 64-byte cap, unrestricted alphabet) could forge sibling KV pairs in the audit row. The fix drops the rawclient_idecho from worker error templates and tightens bothsanitize_for_auditandsanitize_for_audit_kvto also reject the full Unicode control category (NELU+0085, CSIU+009B, all C1 controls), BOM (U+FEFF), and Unicode line/paragraph separators (U+2028,U+2029) — closing C1 control + Unicode line-break injection paths the prior C0-only predicate missed.Lease-renewal binding overwrite. The previous
DetailGuarddesign opened a separateChannelfor the renewer thread; the master stamps a freshsession_ulidon eachClientSession, so renewer writes overwrote the apply-timePeerBindingin the worker's lease table. After the first renewal (~30 s into a session), the apply channel's ownclear-on-Drop failed asUnauthorizedbecause the binding no longer matched; same-UID lease takeover became possible from another session with the sameclient_id. The refactor moves to a single-owner-thread topology: one thread owns theChanneland serialises Apply / Renew / Clear writes viacrossbeam_channel::Sender<DetailRequest>. Worker always sees the samesession_ulidfor every write from a givenDetailGuard; binding overwrite is structurally impossible from the legitimate path.Other should-fix items (P2) closed:
lease_clearnow enforcesLEASE_CLIENT_ID_MAX_BYTES(defence-in-depth against direct-worker-IPC bypass).Arc<AtomicBool>shutdown flag and a 1 s read timeout — channel leak closed;run_topnow joins all four transport handles on exit.RawModeGuard::installinverted so a partial-install failure (e.g.EnterAlternateScreenerrors afterenable_raw_modesucceeded) cannot leak raw mode to the operator's shell.ctrlc/terminationfeature enabled sokill -TERM(and SIGHUP) restores the terminal cleanly; the previous default-feature setup only caught SIGINT despite the docs claiming SIGTERM coverage.handle_keyarms (Tab, BackTab, sort cycle / reverse, palette open, palette typing, help toggle, direct tab jump) now callapp.mark_dirty()so the dirty-gated render loop redraws on input within one frame instead of waiting up to one snapshot tick (1 s default).O_NOFOLLOWon the leaf open — closes the TOCTOU window betweencanonicalize() + starts_with(anchor)andFile::open(&resolved)that a writable-skins-dir scenario could exploit via symlink swap.CtlError::SpawnFailed { label, source }and propagate cleanly; previously.expect("spawn …")panicked underRLIMIT_NPROC/ transient OOM.eprintln!sites intransport.rs(write to a wiped alt-screen, operator never sees them) reuse the existingStatusSlotpattern so transport errors surface in the status bar.Simplification (P2.4) accepted by the operator:
proto_versioncapability handshake is deleted entirely (WorkerInfo.proto_version,MetricDetailStatus.unsupported_workers[],sozu_command_lib::SOZU_PROTO_VERSION,WorkerSession.proto_version,MIN_PROTO_VERSION_FOR_SET_METRIC_DETAIL, and the capability-partition logic inset_metric_detail_request). Production deployments keep master + workers in sync via the existingUpgradeMainflow, so the mixed-version-fleet state the field was designed for does not occur in practice. The implementation was also structurally broken — the master stamped every worker'sproto_versionfrom its own compile-time constant rather than the worker's own. Workers without tag 55 surface asWorkerResponse::error("unknown request type")which folds into the standard fan-out error tally (extras.fanout.workers_err). The two proto tag slots arereserved 4;/reserved 5;per project convention.Polish (P3):
record_backend_metrics!macro emits vianames::backend::*constants instead of raw literals (addsnames::backend::CONNECTION_TIMEfor completeness; on-wire metric names unchanged).cluster_rowsbackends_availableflattens the dead.max(0)arm into a singleif/else.PanicHookGuardrestores the prior panic hook on Drop; repeatedrender::runcalls (tests, embedded callers) no longer stack hook layers indefinitely.SOZU_TOP_SYNC=0env opt-out now actually wired at theBeginSynchronizedUpdate/EndSynchronizedUpdatecall sites.bin/README.mdgains asozu tophighlight;doc/configure_cli.mdanddoc/sozu-top.mdscrubbed of stale--no-color/--log-filereferences;bin/tests/sozu_top_e2e.rsignored test stripped of the same removed flag (it would have failed when run with--ignored); two doc/code-mismatch comments fixed (transport.rs"drop oldest" → "drop newest";metrics/mod.rs"TTL clamped" → "TTL rejected withTtlOutOfRange").Behaviour and surface
sozu topTUI (opt-intuiCargo feature onbin/)ip:portrendering), H2 (active streams + flow-control gauges + CVE flood-mitigation counters with per-row 60-sample Unicode-bar trend), EVENTS (colour-coded SubscribeEvents tail).:overview,:clusters,:b,:help,:quit, …) viatui-input, plus a numbered F-key bar: F1 Help, F2 Glyphs (cycle Block/Braille/Tty), F3/F4 open the palette, F5/pPause (freezes the visible state without dropping the transport lease), F6 Sort (cycles the active pane's sort column), F10 Quit.--glyphsresolves intoGlyphMode→ fed into ratatui'sSparkline::bar_setAND the inline H2 trend renderer so Block / Braille / Tty alphabets land in both surfaces; resize events redraw viaApp::mark_dirtyso the dirty-gate cannot swallow a post-resize repaint; OVERVIEW sparklines renderRightToLeftso the newest sample anchors on the right edge and history scrolls leftward likevmstat/htop.$XDG_CONFIG_HOME/sozu/skins/<name>.tomlwithSOZU_TOP_SKINenv override (k9s parity). Loader is fail-closed: anchor canonicalize failure → default skin + status; outside-anchor escape → default + status; parse error → default + status. LeafFile::openusesO_NOFOLLOWso a symlink swap on the resolved path returnsELOOPinstead of opening the wrong file.take_dirty() || pulse.has_active()at a 30 fps cap; quiet-system CPU drops from ratatui buffer-diff cost to near zero. Every keypress that mutates visible state marks the app dirty so input redraws within one frame.ctrlc::set_handlerwith theterminationfeature catches SIGINT, SIGTERM, and SIGHUP —kill -TERM <pid>no longer leaves the terminal in alt-screen + raw mode. Graceful degrade when a prior handler exists in the same process address space (multi-run_topcallers). The crossterm event loop also observes Ctrl-C as a keypress, so the dedicated signal handler is belt-and-braces; failure surfaces on the status bar instead of aborting the binary.RawModeGuard::installconstructs the guard immediately afterenable_raw_mode()succeeds and progressively flipsalt_entered/mouse_enabledflags as the follow-onexecute!calls succeed, so a partial-install failure cleanly tears down whatever did succeed instead of leaking raw mode.PanicHookGuardrestores the prior panic hook on Drop so repeatedrender::runinvocations in the same process don't stack hook layers.SOZU_TOP_SYNC=0opts the synchronised-output (DEC mode 2026) frame wrap out at the operator's discretion.eprintln!to the wiped alt-screen. The four transport threads (collector, listeners, certs, events) use the same status surface — no moreeprintln!invisible to the operator.Arc<AtomicBool>shutdown flag and a 1 s read timeout, sorun_topcan drop receivers and join all four transport handles cleanly (no leaked SubscribeEvents subscriptions; tests can re-enterrun_topwithout leaking aChannelper call).CertificatesByAddressandCertificatesWithFingerprintsresponse variants (master returns the latter today forQueryCertificatesFromTheState); the conversion drops PEM + private-key fields immediately so key material never reaches the renderer or scrollback. Unexpected-variant errors print a discriminator name only — never the payload.SetMetricDetailproto verbRequest. TTL-bounded lease keyed byclient_id(the TUI usestop:<pid>:<8-hex>).ttl_secondsis capped at 300 s server-side; the master rejects out-of-range TTLs BEFORE fan-out so a buggy or malicious request can't N×amplify worker-side rejections + audit lines. The worker also enforces the cap on both apply AND clear branches (defence-in-depth).SO_PEERCREDPID + session ULID (peer_pid/peer_session_ulidfields, master-populated; clients leave them empty). Subsequentclearrequests are authorised against the binding — cross-operator clears returnFAILURE. TheDetailGuardnow uses a single owner thread that holds the oneChannelfor apply, renew, and clear; the worker always observes the samesession_ulidfor every write, so the binding-overwrite class of bug (which broke clear-on-Drop AND enabled same-UID takeover) is structurally impossible from the legitimate path.LEASE_TABLE_CAP = 64rejects new inserts (renewals of existing entries still succeed);LEASE_CLIENT_ID_MAX_BYTES = 64rejects oversized lease keys on both apply AND clear. Master enforces the same caps before fan-out.client_iduseslibc::getrandom(GRND_NONBLOCK)on Linux (no fs dependency); non-Linux Unix targets fall back to/dev/urandom;u32::from_le_bytesfor cross-arch reproducibility.MetricDetailStatusreply (newResponseContentvariant tag 16) carries the master's(configured, effective, previous_effective)+ per-workerWorkerMetricDetailStatus(tag 17 — each worker's own quartet, returned viaContentType::WorkerMetricDetailStatusin the worker's response payload). Workers that pre-date the verb return the standardWorkerResponse::error("unknown request type")which folds into the fan-out's existing failure tally (extras.fanout.workers_err); operators see "succeeded with errors" rather than a dedicated capability-skip list.is_mutating_verb: the verb is a runtime observability knob, not a state transition. The TUI auto-renews its lease everyttl/2seconds, so including it in the systemdRELOADING=1/READY=1bracket set would flap the unit throughreloadingevery 30 s for the whole TUI session lifetime. The audit-log trail (EventKind::METRIC_DETAIL_CHANGED) still records every apply, clear, and TTL expiry, so SOC visibility is preserved without flapping unit state.Audit log
EventKind::METRIC_DETAIL_CHANGED(tag 30). Operator-initiated transitions are audited at the master dispatch site; worker-local transitions (TTL janitor expiry, worker-arm apply/clear) emit aMetricDetailTransitionpayload through the existingEvent.metric_detailfield (tag 5) — the master folds those into the same audit pipeline so SOC tooling sees a complete picture regardless of origin.client_idandreasonin dedicated columns (lease_id=/metric_detail_reason=) rather than embedding them intarget.sanitize_for_audit_kvstrips control bytes,,,=, BOM (U+FEFF), and Unicode line/paragraph separators (U+2028/U+2029) from column-boundary fields so attacker-controlled values cannot smuggle a forged adjacent KV pair into a SIEM ingesting on,/=.sanitize_for_audit(the weak variant used for non-column-boundary fields) also now rejects the full Unicode control category.targetitself stays master-controlled (justmetric_detail:<level>).client_id; the dedicatedlease_id=audit column carries the operator string through the strict sanitiser. Aclient_idcontaining,,=, C1 controls, BOM, or Unicode line breaks cannot smuggle a forged KV pair throughextras.reason.client_idis capped at 64 chars andreasonat 256 chars at render time, matching the wire caps.Dependency hygiene
Two
unmaintainedRUSTSEC advisories that the cargo audit gate would have flagged are now closed by removing the offending crates outright rather than ignoring them:paste(RUSTSEC-2024-0436) — one call site ine2e/src/tests/protocol_pair_matrix.rsusingpaste::paste!{ [<…>] }token concatenation. Replaced with a dep-freemacro_rules!shape that nests each matrix into apub mod $name { fn try_h1_h1() … #[test] fn test_h1_h1() … }child module.rustls-pemfile(RUSTSEC-2025-0134) — three call sites (production cert resolver, one unit test, one bench). Eachrustls_pemfile::read_one+ 3-variantItem::{Pkcs1Key,Pkcs8Key,Sec1Key}match collapses to onePrivateKeyDer::from_pem_slice(bytes)?viarustls::pki_types::pem::PemObject.The
.github/workflows/ci.ymlcargo auditjob runs plaincargo audit --deny warningswith no per-ID ignores — any future advisory will block CI immediately. Local audit: 0 advisories across 448 scanned dependencies.Wire compat
All proto additions are field-additive — no existing field renamed or retyped. Two field tags have been freed by this PR and are now
reserved:WorkerInfo.proto_version(tag 4) →reserved 4; reserved "proto_version";MetricDetailStatus.unsupported_workers(tag 5) →reserved 5; reserved "unsupported_workers";Both fields were added earlier in this PR and removed in the review-pass clean-up after the operator confirmed the
UpgradeMainflow keeps master + workers in sync (the mixed-version state the fields were designed for does not occur in practice).New tags introduced and KEPT by this PR:
Request.set_metric_detail = 55+SetMetricDetail.peer_pid(6) +SetMetricDetail.peer_session_ulid(7)EventKind::METRIC_DETAIL_CHANGED = 30Event.metric_detail = 5carryingMetricDetailTransitionResponseContent.metric_detail_status = 16+ResponseContent.worker_metric_detail_status = 17Metric names produced on the wire are unchanged byte-for-byte by the
lib::metrics::names::<family>::<NAME>refactor — StatsD/Prometheus/dashboard consumers see the exact same dotted strings. Therecord_backend_metrics!macro now also references the typed constants on the emit side, matching the read side.Lean-baseline preservation
jemallocator,crypto-ring) unchanged.tuiis opt-in; not indefault = [...].optional = trueand aggregated under thetuifeature:color-eyre,crossbeam-channel,crossterm,ctrlc(withterminationfeature),ratatui,throbber-widgets-tui,toml,tui-big-text,tui-input,tui-popup,tui-scrollview,tui-tree-widget.instalives in[dev-dependencies]only.sozu --versionreports+tui/-tuiso build matrices can distinguish the variants.cargo tree -p sozu --features tui | grep -c '^[├└]── tokio'→0.Validation run locally
CI matrix exercises stable/beta/nightly + the four crypto cells (
crypto-ring,crypto-aws-lc-rs,crypto-openssl,fips),msrv-bare,msrv-full,Fuzz (nightly),cargo audit, build / Docker / benchmark cells.Test plan
cargo test -p sozu --features tui --tests -- --ignored sozu_topruns the subprocess e2e (spawns a real master, runssozu top --snapshot 1).sozu top, verify Backend rows render and CLUSTERS rps / backends_up/total populate within one refresh tick. Inspect the audit log:lease_id=andmetric_detail_reason=show up as their own columns; thetarget=field is justmetric_detail:<level>; noclient_id=...substring appears inreason=.sozu top. Terminal is restored cleanly; the worker emits alease_clearevent.kill -TERMa runningsozu top. Same clean restore (now covered byctrlc/termination).sozu top. The polled janitor expires the lease withinttl_seconds + LEASE_TICK_INTERVAL; the master emits alease_tick_expiredaudit row.sozu toprunning past one full lease-renewal cycle (~30 s default), then quit cleanly. The worker emits alease_clearaudit row (notUnauthorized) — proving the single-owner channel preserves binding identity across renewals.SetMetricDetail{client_id: <first-operator>, clear: true}. The worker refuses withpeer binding does not match.reason = "A".repeat(1_000_000)). The master rejects it before fan-out.client_idcontaining,and=(e.g."x,actor_user=mallory"). On any failure path (force the lease table to capacity, send a duplicate-binding clear, etc.) verify the rendered audit-linereason=column does NOT containactor_user=malloryas a separate KV pair.MetricDetailStatus.workers[<id>]reply — each worker reports its own quartet, not the master's view. Nounsupported_workers[]field anymore (rolled into the standard fan-out error tally).~/.config/sozu/skins/test.tomlpointing to/etc/shadow.sozu top --skin testrefuses to load it (O_NOFOLLOW→ELOOP) and falls back to the default skin with a status-bar note.0.0.0.0:0until a follow-up plumbs the listener address. Private-key PEM never appears in the rendered output or in any error message.Files of interest
In priority order:
command/src/command.proto—SetMetricDetail,Event.metric_detail/MetricDetailTransition,WorkerMetricDetailStatus(response payload tag 17).WorkerInfo.proto_versionandMetricDetailStatus.unsupported_workersarereserved.lib/src/metrics/mod.rs—LeaseEntry,PeerBinding,LeaseApplyOutcome/LeaseClearOutcome,LEASE_TABLE_CAP,LEASE_CLIENT_ID_MAX_BYTES.record_backend_metrics!macro now referencesnames::backend::*constants.lib/src/server.rs—SetMetricDetailworker arm with both apply AND clearLEASE_CLIENT_ID_MAX_BYTESenforcement;worker_metric_detail_status_content;push_metric_detail_transition. Error templates no longer echoclient_id.bin/src/command/requests.rs—set_metric_detail_request(now unconditional scatter — no capability partition);SetMetricDetailTask;MetricDetailAuditFields;audit_worker_metric_detail_transition;query_metrics(now unconditionally callsmerge_metrics()).bin/src/command/server.rs—METRIC_DETAIL_CHANGEDevent handling inhandle_worker_response.bin/src/command/sessions.rs—sanitize_for_audit/sanitize_for_audit_kv(now Unicode-control + BOM + line-separator aware).bin/src/ctl/top/cardinality.rs— single-owner channel topology (DetailRequestenum + owner thread + renewer-as-tick-emitter);libc::getrandomLinux path;StatusSlotrenewer + transport status surface.bin/src/ctl/top/transport.rs—poll_loop<T, F>withStatusSlotparameter; events thread observesArc<AtomicBool>shutdown + 1 s read timeout;CtlError::SpawnFailedpropagation; no moreeprintln!to wiped alt-screen.bin/src/ctl/top/theme.rs— fail-closed anchor;O_NOFOLLOWon leaf open.bin/src/ctl/top/render.rs—RawModeGuard(partial-success aware);PanicHookGuard(Drop restores prior hook);SOZU_TOP_SYNC=0opt-out;app.mark_dirty()on all input mutations; gracefulctrlcdegrade.bin/src/ctl/top/panes/{mod,h2,clusters,backends}.rs— sharedsort_header,names::*migration, consolidated helpers.bin/src/ctl/top/app.rs—ActiveTab::from_alias,PulseTracker::retain,ERRORS_5XXconstant;cluster_rowsbackends_availableflattened.command/src/state.rs— dispatch-whitelist regression guard.lib/src/tls.rs+lib/src/crypto.rs+lib/benches/crypto_provider.rs—rustls-pemfile→rustls::pki_types::pem::PemObjectmigration.e2e/src/tests/protocol_pair_matrix.rs—paste→ dep-free per-matrix module macro..github/workflows/ci.yml—cargo auditjob without per-ID ignores.doc/sozu-top.md+CHANGELOG.md+doc/configure_cli.md— transport accuracy, audit scope, privacy, scrubbed stale flag references.