Skip to content

feat: sozu top btop-style operator TUI + cardinality-lease verb#1256

Merged
FlorentinDUBOIS merged 95 commits into
mainfrom
feat/sozu-top
May 20, 2026
Merged

feat: sozu top btop-style operator TUI + cardinality-lease verb#1256
FlorentinDUBOIS merged 95 commits into
mainfrom
feat/sozu-top

Conversation

@FlorentinDUBOIS
Copy link
Copy Markdown
Collaborator

@FlorentinDUBOIS FlorentinDUBOIS commented May 11, 2026

Summary

Adds a btop/htop-style live operator TUI behind a new tui Cargo feature on bin/. Seven panes (OVERVIEW, CLUSTERS, BACKENDS, LISTENERS, CERTS, H2, EVENTS) surface metrics Sōzu already emits over four synchronous transport threads, plus a colon command palette, F-key bar, TOML skin loader, and big-text alert banner. No tokio runtime in v1 by design.

Also introduces SetMetricDetail, a runtime proto verb that lets clients lease elevated metric cardinality. The lease is TTL-bounded (default 60 s, hard cap 300 s), client_id-keyed, bound to the connecting peer's SO_PEERCRED PID + session ULID so cross-operator clears are refused, length-capped on the wire (≤ 64-byte client_id, ≤ 256-byte reason), and self-expires server-side so a crashed sozu top cannot permanently elevate cardinality. The TUI auto-leases Backend detail on startup, then clears on exit. Every transition — operator-initiated AND worker-local janitor-driven — is audited via EventKind::METRIC_DETAIL_CHANGED. Workers that pre-date the verb return the standard unknown request type error which surfaces in the normal fan-out error tally; production keeps master + workers in sync via UpgradeMain so the mixed-version state is transient.

Review-pass fixes (2026-05-12)

Fourteen follow-up commits address findings from a multi-agent review pass (Claude + Codex + Lisa security threat-hunt + plan-vs-PR diff + guidelines + simplify).

Metrics-wiring root cause the operator reported (OVERVIEW pane stuck at RPS=0 and CLUSTERS pane at backends_up/total=0/0) traced to:

  • bin/src/command/requests.rs::query_metrics skipped merge_metrics() whenever server.workers.len() <= 1. Single-worker fleets had every per-worker metric stranded in AggregatedMetrics.workers while the merged clusters / proxying maps stayed empty. Three companion fixes also landed: per-cluster counter reading under backend-detail filing (004a1965), the cluster.available_backends rollup gauge name constant (b9625b42), and the cluster RPS rate / backends rollup / service-time / cert addr / H2 trend wiring (f5e90156).

Ship-blockers (P1) closed:

  • SIEM column-smuggling in the SetMetricDetail failure-reason aggregation. Worker error templates at lib/src/server.rs previously echoed the operator-supplied client_id verbatim into WorkerResponse::error strings, which the master joined into extras.reason and rendered through the weak sanitize_for_audit (control bytes only). A client_id like "x,actor_user=mallory" (legal under the 64-byte cap, unrestricted alphabet) could forge sibling KV pairs in the audit row. The fix drops the raw client_id echo from worker error templates and tightens both sanitize_for_audit and sanitize_for_audit_kv to also reject the full Unicode control category (NEL U+0085, CSI U+009B, all C1 controls), BOM (U+FEFF), and Unicode line/paragraph separators (U+2028, U+2029) — closing C1 control + Unicode line-break injection paths the prior C0-only predicate missed.

  • Lease-renewal binding overwrite. The previous DetailGuard design opened a separate Channel for the renewer thread; the master stamps a fresh session_ulid on each ClientSession, so renewer writes overwrote the apply-time PeerBinding in the worker's lease table. After the first renewal (~30 s into a session), the apply channel's own clear-on-Drop failed as Unauthorized because the binding no longer matched; same-UID lease takeover became possible from another session with the same client_id. The refactor moves to a single-owner-thread topology: one thread owns the Channel and serialises Apply / Renew / Clear writes via crossbeam_channel::Sender<DetailRequest>. Worker always sees the same session_ulid for every write from a given DetailGuard; binding overwrite is structurally impossible from the legitimate path.

Other should-fix items (P2) closed:

  • Worker lease_clear now enforces LEASE_CLIENT_ID_MAX_BYTES (defence-in-depth against direct-worker-IPC bypass).
  • Events transport thread observes an Arc<AtomicBool> shutdown flag and a 1 s read timeout — channel leak closed; run_top now joins all four transport handles on exit.
  • RawModeGuard::install inverted so a partial-install failure (e.g. EnterAlternateScreen errors after enable_raw_mode succeeded) cannot leak raw mode to the operator's shell.
  • ctrlc/termination feature enabled so kill -TERM (and SIGHUP) restores the terminal cleanly; the previous default-feature setup only caught SIGINT despite the docs claiming SIGTERM coverage.
  • Seven handle_key arms (Tab, BackTab, sort cycle / reverse, palette open, palette typing, help toggle, direct tab jump) now call app.mark_dirty() so the dirty-gated render loop redraws on input within one frame instead of waiting up to one snapshot tick (1 s default).
  • Skin loader uses O_NOFOLLOW on the leaf open — closes the TOCTOU window between canonicalize() + starts_with(anchor) and File::open(&resolved) that a writable-skins-dir scenario could exploit via symlink swap.
  • Background-thread spawn failures map to a new CtlError::SpawnFailed { label, source } and propagate cleanly; previously .expect("spawn …") panicked under RLIMIT_NPROC / transient OOM.
  • Three eprintln! sites in transport.rs (write to a wiped alt-screen, operator never sees them) reuse the existing StatusSlot pattern so transport errors surface in the status bar.

Simplification (P2.4) accepted by the operator:

  • The proto_version capability handshake is deleted entirely (WorkerInfo.proto_version, MetricDetailStatus.unsupported_workers[], sozu_command_lib::SOZU_PROTO_VERSION, WorkerSession.proto_version, MIN_PROTO_VERSION_FOR_SET_METRIC_DETAIL, and the capability-partition logic in set_metric_detail_request). Production deployments keep master + workers in sync via the existing UpgradeMain flow, so the mixed-version-fleet state the field was designed for does not occur in practice. The implementation was also structurally broken — the master stamped every worker's proto_version from its own compile-time constant rather than the worker's own. Workers without tag 55 surface as WorkerResponse::error("unknown request type") which folds into the standard fan-out error tally (extras.fanout.workers_err). The two proto tag slots are reserved 4; / reserved 5; per project convention.

Polish (P3):

  • record_backend_metrics! macro emits via names::backend::* constants instead of raw literals (adds names::backend::CONNECTION_TIME for completeness; on-wire metric names unchanged).
  • cluster_rows backends_available flattens the dead .max(0) arm into a single if/else.
  • PanicHookGuard restores the prior panic hook on Drop; repeated render::run calls (tests, embedded callers) no longer stack hook layers indefinitely.
  • SOZU_TOP_SYNC=0 env opt-out now actually wired at the BeginSynchronizedUpdate / EndSynchronizedUpdate call sites.
  • bin/README.md gains a sozu top highlight; doc/configure_cli.md and doc/sozu-top.md scrubbed of stale --no-color / --log-file references; bin/tests/sozu_top_e2e.rs ignored test stripped of the same removed flag (it would have failed when run with --ignored); two doc/code-mismatch comments fixed (transport.rs "drop oldest" → "drop newest"; metrics/mod.rs "TTL clamped" → "TTL rejected with TtlOutOfRange").

Behaviour and surface

sozu top TUI (opt-in tui Cargo feature on bin/)

  • Seven panes: OVERVIEW (sparklines + active sessions + RPS + backend-latency p99 + sozu service-time p99 + saturation), CLUSTERS (sortable table — default sort 5xx-rate desc; rps column auto-scales K/M/G), BACKENDS (flat per-backend rows; bw down/up rendered as Mbps with Kbps/Gbps auto-scale), LISTENERS (HTTP/HTTPS/TCP listener config), CERTS (per-address certificate summaries with ip:port rendering), H2 (active streams + flow-control gauges + CVE flood-mitigation counters with per-row 60-sample Unicode-bar trend), EVENTS (colour-coded SubscribeEvents tail).
  • Colon command palette (:overview, :clusters, :b, :help, :quit, …) via tui-input, plus a numbered F-key bar: F1 Help, F2 Glyphs (cycle Block/Braille/Tty), F3/F4 open the palette, F5/p Pause (freezes the visible state without dropping the transport lease), F6 Sort (cycles the active pane's sort column), F10 Quit.
  • --glyphs resolves into GlyphMode → fed into ratatui's Sparkline::bar_set AND the inline H2 trend renderer so Block / Braille / Tty alphabets land in both surfaces; resize events redraw via App::mark_dirty so the dirty-gate cannot swallow a post-resize repaint; OVERVIEW sparklines render RightToLeft so the newest sample anchors on the right edge and history scrolls leftward like vmstat / htop.
  • TOML skin loader at $XDG_CONFIG_HOME/sozu/skins/<name>.toml with SOZU_TOP_SKIN env override (k9s parity). Loader is fail-closed: anchor canonicalize failure → default skin + status; outside-anchor escape → default + status; parse error → default + status. Leaf File::open uses O_NOFOLLOW so a symlink swap on the resolved path returns ELOOP instead of opening the wrong file.
  • Render loop dirty-gates on take_dirty() || pulse.has_active() at a 30 fps cap; quiet-system CPU drops from ratatui buffer-diff cost to near zero. Every keypress that mutates visible state marks the app dirty so input redraws within one frame.
  • Big-text alert banner at the top when the threshold table flags HIGH LATENCY / ERROR SURGE / SATURATION; pulse tint on transitions (cluster disappeared, backend down, new cluster).
  • ctrlc::set_handler with the termination feature catches SIGINT, SIGTERM, and SIGHUP — kill -TERM <pid> no longer leaves the terminal in alt-screen + raw mode. Graceful degrade when a prior handler exists in the same process address space (multi-run_top callers). The crossterm event loop also observes Ctrl-C as a keypress, so the dedicated signal handler is belt-and-braces; failure surfaces on the status bar instead of aborting the binary.
  • RawModeGuard::install constructs the guard immediately after enable_raw_mode() succeeds and progressively flips alt_entered / mouse_enabled flags as the follow-on execute! calls succeed, so a partial-install failure cleanly tears down whatever did succeed instead of leaking raw mode.
  • PanicHookGuard restores the prior panic hook on Drop so repeated render::run invocations in the same process don't stack hook layers.
  • SOZU_TOP_SYNC=0 opts the synchronised-output (DEC mode 2026) frame wrap out at the operator's discretion.
  • Renewer thread errors land on a shared status slot the render loop drains once per tick, instead of eprintln! to the wiped alt-screen. The four transport threads (collector, listeners, certs, events) use the same status surface — no more eprintln! invisible to the operator.
  • Events transport thread observes an Arc<AtomicBool> shutdown flag and a 1 s read timeout, so run_top can drop receivers and join all four transport handles cleanly (no leaked SubscribeEvents subscriptions; tests can re-enter run_top without leaking a Channel per call).
  • Certs poll accepts both CertificatesByAddress and CertificatesWithFingerprints response variants (master returns the latter today for QueryCertificatesFromTheState); the conversion drops PEM + private-key fields immediately so key material never reaches the renderer or scrollback. Unexpected-variant errors print a discriminator name only — never the payload.

SetMetricDetail proto verb

  • Tag 55 on Request. TTL-bounded lease keyed by client_id (the TUI uses top:<pid>:<8-hex>). ttl_seconds is capped at 300 s server-side; the master rejects out-of-range TTLs BEFORE fan-out so a buggy or malicious request can't N×amplify worker-side rejections + audit lines. The worker also enforces the cap on both apply AND clear branches (defence-in-depth).
  • Lease ownership is bound at apply-time to the connecting peer's SO_PEERCRED PID + session ULID (peer_pid / peer_session_ulid fields, master-populated; clients leave them empty). Subsequent clear requests are authorised against the binding — cross-operator clears return FAILURE. The DetailGuard now uses a single owner thread that holds the one Channel for apply, renew, and clear; the worker always observes the same session_ulid for every write, so the binding-overwrite class of bug (which broke clear-on-Drop AND enabled same-UID takeover) is structurally impossible from the legitimate path.
  • Server-side caps: LEASE_TABLE_CAP = 64 rejects new inserts (renewals of existing entries still succeed); LEASE_CLIENT_ID_MAX_BYTES = 64 rejects oversized lease keys on both apply AND clear. Master enforces the same caps before fan-out.
  • Random suffix in client_id uses libc::getrandom(GRND_NONBLOCK) on Linux (no fs dependency); non-Linux Unix targets fall back to /dev/urandom; u32::from_le_bytes for cross-arch reproducibility.
  • MetricDetailStatus reply (new ResponseContent variant tag 16) carries the master's (configured, effective, previous_effective) + per-worker WorkerMetricDetailStatus (tag 17 — each worker's own quartet, returned via ContentType::WorkerMetricDetailStatus in the worker's response payload). Workers that pre-date the verb return the standard WorkerResponse::error("unknown request type") which folds into the fan-out's existing failure tally (extras.fanout.workers_err); operators see "succeeded with errors" rather than a dedicated capability-skip list.
  • Deliberately NOT in is_mutating_verb: the verb is a runtime observability knob, not a state transition. The TUI auto-renews its lease every ttl/2 seconds, so including it in the systemd RELOADING=1 / READY=1 bracket set would flap the unit through reloading every 30 s for the whole TUI session lifetime. The audit-log trail (EventKind::METRIC_DETAIL_CHANGED) still records every apply, clear, and TTL expiry, so SOC visibility is preserved without flapping unit state.

Audit log

  • Every cardinality transition emits EventKind::METRIC_DETAIL_CHANGED (tag 30). Operator-initiated transitions are audited at the master dispatch site; worker-local transitions (TTL janitor expiry, worker-arm apply/clear) emit a MetricDetailTransition payload through the existing Event.metric_detail field (tag 5) — the master folds those into the same audit pipeline so SOC tooling sees a complete picture regardless of origin.
  • Audit text + JSON sinks render the operator-supplied client_id and reason in dedicated columns (lease_id= / metric_detail_reason=) rather than embedding them in target. sanitize_for_audit_kv strips control bytes, ,, =, BOM (U+FEFF), and Unicode line/paragraph separators (U+2028/U+2029) from column-boundary fields so attacker-controlled values cannot smuggle a forged adjacent KV pair into a SIEM ingesting on , / =. sanitize_for_audit (the weak variant used for non-column-boundary fields) also now rejects the full Unicode control category. target itself stays master-controlled (just metric_detail:<level>).
  • Worker error messages no longer echo the operator-supplied client_id; the dedicated lease_id= audit column carries the operator string through the strict sanitiser. A client_id containing ,, =, C1 controls, BOM, or Unicode line breaks cannot smuggle a forged KV pair through extras.reason.
  • client_id is capped at 64 chars and reason at 256 chars at render time, matching the wire caps.

Dependency hygiene

Two unmaintained RUSTSEC advisories that the cargo audit gate would have flagged are now closed by removing the offending crates outright rather than ignoring them:

  • paste (RUSTSEC-2024-0436) — one call site in e2e/src/tests/protocol_pair_matrix.rs using paste::paste!{ [<…>] } token concatenation. Replaced with a dep-free macro_rules! shape that nests each matrix into a pub mod $name { fn try_h1_h1() … #[test] fn test_h1_h1() … } child module.
  • rustls-pemfile (RUSTSEC-2025-0134) — three call sites (production cert resolver, one unit test, one bench). Each rustls_pemfile::read_one + 3-variant Item::{Pkcs1Key,Pkcs8Key,Sec1Key} match collapses to one PrivateKeyDer::from_pem_slice(bytes)? via rustls::pki_types::pem::PemObject.

The .github/workflows/ci.yml cargo audit job runs plain cargo audit --deny warnings with no per-ID ignores — any future advisory will block CI immediately. Local audit: 0 advisories across 448 scanned dependencies.

Wire compat

All proto additions are field-additive — no existing field renamed or retyped. Two field tags have been freed by this PR and are now reserved:

  • WorkerInfo.proto_version (tag 4) → reserved 4; reserved "proto_version";
  • MetricDetailStatus.unsupported_workers (tag 5) → reserved 5; reserved "unsupported_workers";

Both fields were added earlier in this PR and removed in the review-pass clean-up after the operator confirmed the UpgradeMain flow keeps master + workers in sync (the mixed-version state the fields were designed for does not occur in practice).

New tags introduced and KEPT by this PR:

  • Request.set_metric_detail = 55 + SetMetricDetail.peer_pid (6) + SetMetricDetail.peer_session_ulid (7)
  • EventKind::METRIC_DETAIL_CHANGED = 30
  • Event.metric_detail = 5 carrying MetricDetailTransition
  • ResponseContent.metric_detail_status = 16 + ResponseContent.worker_metric_detail_status = 17

Metric names produced on the wire are unchanged byte-for-byte by the lib::metrics::names::<family>::<NAME> refactor — StatsD/Prometheus/dashboard consumers see the exact same dotted strings. The record_backend_metrics! macro now also references the typed constants on the emit side, matching the read side.

Lean-baseline preservation

  • Default features (jemallocator, crypto-ring) unchanged.
  • tui is opt-in; not in default = [...].
  • All TUI deps declared optional = true and aggregated under the tui feature: color-eyre, crossbeam-channel, crossterm, ctrlc (with termination feature), ratatui, throbber-widgets-tui, toml, tui-big-text, tui-input, tui-popup, tui-scrollview, tui-tree-widget.
  • insta lives in [dev-dependencies] only.
  • sozu --version reports +tui / -tui so build matrices can distinguish the variants.
  • cargo tree -p sozu --features tui | grep -c '^[├└]── tokio'0.

Validation run locally

cargo +nightly fmt --all -- --check                                                # PASS
cargo clippy --all-features --all-targets --locked -- -D warnings                  # No issues found
cargo build --all-features --locked                                                # PASS
cargo build -p sozu --no-default-features --features crypto-ring --locked          # PASS (lean baseline)
cargo build -p sozu --features tui --locked                                        # PASS
cargo test --workspace --locked --lib                                              # 1051 / 1051 passed (4 suites)
cargo test -p sozu --features tui --lib --locked                                   # 75 / 75
cargo test -p sozu --features tui --tests --locked                                 # 152 / 152 (+ 1 ignored)
cargo audit --deny warnings                                                        # 0 advisories on 448 deps
cargo tree -p sozu --features tui --no-default-features | grep -c '^[├└]── tokio'   # 0

CI matrix exercises stable/beta/nightly + the four crypto cells (crypto-ring, crypto-aws-lc-rs, crypto-openssl, fips), msrv-bare, msrv-full, Fuzz (nightly), cargo audit, build / Docker / benchmark cells.

Test plan

  • CI green across all matrix cells.
  • cargo test -p sozu --features tui --tests -- --ignored sozu_top runs the subprocess e2e (spawns a real master, runs sozu top --snapshot 1).
  • Manual smoke: sozu top, verify Backend rows render and CLUSTERS rps / backends_up/total populate within one refresh tick. Inspect the audit log: lease_id= and metric_detail_reason= show up as their own columns; the target= field is just metric_detail:<level>; no client_id=... substring appears in reason=.
  • Manual smoke: SIGINT a running sozu top. Terminal is restored cleanly; the worker emits a lease_clear event.
  • Manual smoke: kill -TERM a running sozu top. Same clean restore (now covered by ctrlc/termination).
  • Manual smoke: SIGKILL sozu top. The polled janitor expires the lease within ttl_seconds + LEASE_TICK_INTERVAL; the master emits a lease_tick_expired audit row.
  • Manual smoke: leave sozu top running past one full lease-renewal cycle (~30 s default), then quit cleanly. The worker emits a lease_clear audit row (not Unauthorized) — proving the single-owner channel preserves binding identity across renewals.
  • Manual smoke: from a second operator (different session, same UID) send SetMetricDetail{client_id: <first-operator>, clear: true}. The worker refuses with peer binding does not match.
  • Manual smoke: post an oversized payload (reason = "A".repeat(1_000_000)). The master rejects it before fan-out.
  • Manual smoke: post a client_id containing , and = (e.g. "x,actor_user=mallory"). On any failure path (force the lease table to capacity, send a duplicate-binding clear, etc.) verify the rendered audit-line reason= column does NOT contain actor_user=mallory as a separate KV pair.
  • Manual smoke: inspect the MetricDetailStatus.workers[<id>] reply — each worker reports its own quartet, not the master's view. No unsupported_workers[] field anymore (rolled into the standard fan-out error tally).
  • Manual smoke: drop a symlink under ~/.config/sozu/skins/test.toml pointing to /etc/shadow. sozu top --skin test refuses to load it (O_NOFOLLOWELOOP) and falls back to the default skin with a status-bar note.
  • CERTS pane renders one row per certificate; the bound address column shows 0.0.0.0:0 until a follow-up plumbs the listener address. Private-key PEM never appears in the rendered output or in any error message.

Files of interest

In priority order:

  1. command/src/command.protoSetMetricDetail, Event.metric_detail / MetricDetailTransition, WorkerMetricDetailStatus (response payload tag 17). WorkerInfo.proto_version and MetricDetailStatus.unsupported_workers are reserved.
  2. lib/src/metrics/mod.rsLeaseEntry, PeerBinding, LeaseApplyOutcome / LeaseClearOutcome, LEASE_TABLE_CAP, LEASE_CLIENT_ID_MAX_BYTES. record_backend_metrics! macro now references names::backend::* constants.
  3. lib/src/server.rsSetMetricDetail worker arm with both apply AND clear LEASE_CLIENT_ID_MAX_BYTES enforcement; worker_metric_detail_status_content; push_metric_detail_transition. Error templates no longer echo client_id.
  4. bin/src/command/requests.rsset_metric_detail_request (now unconditional scatter — no capability partition); SetMetricDetailTask; MetricDetailAuditFields; audit_worker_metric_detail_transition; query_metrics (now unconditionally calls merge_metrics()).
  5. bin/src/command/server.rsMETRIC_DETAIL_CHANGED event handling in handle_worker_response.
  6. bin/src/command/sessions.rssanitize_for_audit / sanitize_for_audit_kv (now Unicode-control + BOM + line-separator aware).
  7. bin/src/ctl/top/cardinality.rs — single-owner channel topology (DetailRequest enum + owner thread + renewer-as-tick-emitter); libc::getrandom Linux path; StatusSlot renewer + transport status surface.
  8. bin/src/ctl/top/transport.rspoll_loop<T, F> with StatusSlot parameter; events thread observes Arc<AtomicBool> shutdown + 1 s read timeout; CtlError::SpawnFailed propagation; no more eprintln! to wiped alt-screen.
  9. bin/src/ctl/top/theme.rs — fail-closed anchor; O_NOFOLLOW on leaf open.
  10. bin/src/ctl/top/render.rsRawModeGuard (partial-success aware); PanicHookGuard (Drop restores prior hook); SOZU_TOP_SYNC=0 opt-out; app.mark_dirty() on all input mutations; graceful ctrlc degrade.
  11. bin/src/ctl/top/panes/{mod,h2,clusters,backends}.rs — shared sort_header, names::* migration, consolidated helpers.
  12. bin/src/ctl/top/app.rsActiveTab::from_alias, PulseTracker::retain, ERRORS_5XX constant; cluster_rows backends_available flattened.
  13. command/src/state.rs — dispatch-whitelist regression guard.
  14. lib/src/tls.rs + lib/src/crypto.rs + lib/benches/crypto_provider.rsrustls-pemfilerustls::pki_types::pem::PemObject migration.
  15. e2e/src/tests/protocol_pair_matrix.rspaste → dep-free per-matrix module macro.
  16. .github/workflows/ci.ymlcargo audit job without per-ID ignores.
  17. doc/sozu-top.md + CHANGELOG.md + doc/configure_cli.md — transport accuracy, audit scope, privacy, scrubbed stale flag references.

FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
…p, getrandom

Close out four remaining lease-primitive findings from the PR #1256
review pass.

2A — Loud TTL reject in lease_apply (was: silent clamp). The aggregator
no longer transparently caps `ttl > LEASE_TTL_MAX` to LEASE_TTL_MAX. It
now returns `LeaseApplyOutcome::TtlOutOfRange`, which the worker
dispatch arm surfaces as `WorkerResponse::error`. The dispatch site
itself already rejects out-of-range TTLs before reaching the
aggregator; the new error arm catches any caller that bypasses
dispatch (proto fuzzing, future internal use) instead of silently
capping their intent.

2B — Master-side TTL pre-validate before fan-out. `worker_request`
checks the operator-supplied `ttl_seconds` against `LEASE_TTL_MAX`
BEFORE scattering to workers. A malicious or buggy
`SetMetricDetail{ttl_seconds: u32::MAX}` no longer reaches the worker
fan-out, eliminating the N×rejected-fan-out + N×audit-line amplifier.

2E — Lease HashMap cap + client_id length cap. New
`LEASE_TABLE_CAP = 64` and `LEASE_CLIENT_ID_MAX_BYTES = 64` constants.
`lease_apply` returns `LeaseApplyOutcome::TableFull` for fresh inserts
that would overflow the table (renewals of existing entries still
succeed, so an active operator never loses their lease just because the
table is full) and `LeaseApplyOutcome::ClientIdTooLong` for
oversized lease keys. Closes the CWE-770 vector where a same-UID
attacker could roll `client_id` faster than expiry to grow the map
unbounded toward worker OOM. The master applies the same length cap
before fan-out via the new `sozu_lib::metrics::LEASE_CLIENT_ID_MAX_BYTES`
export.

2D — Direct `libc::getrandom` on Linux for the lease-id random suffix.
Replaces the `/dev/urandom` File::open path on Linux with the
`getrandom(2)` syscall + `GRND_NONBLOCK` flag (no fs dependency, no
chroot/sandbox failure path). Non-Linux Unix targets keep the
`/dev/urandom` read because `getrandom`'s ABI varies (FreeBSD vs
OpenBSD's `getentropy(2)` vs macOS's `SecRandomCopyBytes`). Switched
from `u32::from_ne_bytes` to `u32::from_le_bytes` for cross-arch
reproducibility of the rendered hex. Last-resort fallback to
`subsec_nanos()` remains for total-entropy-failure environments and the
fallback is now silent at the data layer (the `app.status` surface
covers operator visibility in a later commit).

Four new tests cover ClientIdTooLong, TableFull (including renewal-
succeeds-when-full), TtlOutOfRange, and the LEASE_TTL_MAX boundary.

Build/clippy clean; 29/29 metric tests pass (4 new).

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
Three operator-facing TUI hardening items from the PR #1256 review.

2C — Skin loader fails closed when the parent anchor can't be resolved.
Previously, when `skins_anchor()` returned `None` (race-delete of the
parent, weird `/proc` paths, unusual fs mounts) the confinement check
was silently skipped and `from_toml` ran on the bare resolved path,
defeating the symlink-escape defence the canonicalize block was added
to provide. Now the anchor failure short-circuits to the default skin
with a status-bar diagnostic.

2C (companion) — Close the skin loader TOCTOU window. The previous
shape canonicalized the file, then went `from_toml(&resolved)` which
called `std::fs::read_to_string(&Path)` — re-resolving the path through
the kernel and re-opening the symlink chain. An attacker with write
access to a shared skins dir could swap a symlink in the gap. Replaced
with `from_open_file(&Path)` which calls `File::open` once after the
parent-anchor check and reads from the `&mut File` handle. The
path-based `from_toml` API is removed; tests parse via `toml::from_str`
on a raw string.

3E — `ctrlc::set_handler` degrades gracefully on `MultipleHandlers`
error. The previous `.expect()` aborted the TUI on programmatic re-
entry (any caller that invokes `run` twice in the same process address
space hits the second-install path). The crossterm event loop already
observes Ctrl-C as a keypress, so the dedicated signal handler is
belt-and-braces rather than the primary path; falling through with a
status-bar note keeps embedded callers viable. Status-bar precedence:
skin > lease-elevation > signal-handler diagnostic.

(Renewer-thread status surface — review M-7 — deferred: the worker
status channel needs threading through `DetailGuard` and the
render-loop poll. Tracked as a follow-up to keep this commit
surgical.)

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
…pe, privacy

Three documentation fixes from the PR #1256 review.

3G — CHANGELOG referenced "three synchronous transport threads"; the
CERTS-pane collector added a fourth in eec8422 but the entry was not
updated. Fixed to "four synchronous transport threads (snapshots,
listeners, certs, events)".

2G — CHANGELOG audit-scope claim updated to reflect this PR's actual
behaviour: every cardinality transition emits METRIC_DETAIL_CHANGED,
including the previously-silent worker-local janitor expiries and
post-fan-out apply/clear paths, thanks to the new worker→master audit
IPC. Lease ownership binding (SO_PEERCRED peer pid + session ULID) is
called out so SecNumCloud-style reviewers see the trust model in
plain text. The unsupported_workers field's wiring status is described
honestly — the proto field + per-worker version snapshot ship in this
release; capability-aware dispatch is tracked as follow-up.

3H + D5 — `doc/sozu-top.md` gains an explicit transport layout note
(four threads, six unix-socket connections per invocation) and a
privacy paragraph: the operator-supplied `--reason` text flows to the
audit log and to SubscribeEvents subscribers, so PII / customer IDs
should not be embedded. Length and character caps applied server-side
are documented.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
Two cleanup items from the PR #1256 review:

4A — Remove the empty `bin/api` zero-byte file. The diff added it as a
new empty file (`new file mode 100644, index 00000000..e69de29`) —
clearly a stray `touch bin/api` from local experimentation that got
`git add`-ed. No callers in tree.

Lisa L-017 — Add a dedicated `cargo audit` CI job that runs on every
push + PR. The `tui` Cargo feature pulled in 12 new optional crates
(color-eyre, crossterm, ratatui, throbber-widgets-tui, tui-input,
tui-big-text, tui-popup, tui-scrollview, tui-tree-widget, ctrlc,
crossbeam-channel, toml) plus their transitive closure outside what
the default-features build covers. Without an audit gate any RUSTSEC
advisory landing against one of those crates can ship in a release
unnoticed. Job: cache-all-crates, prefix-key `ci-audit`, `--deny
warnings` so the gate fails fast on any new advisory. Two
invocations to keep symmetry with the per-feature build matrix above;
both read the same Cargo.lock that already contains the TUI deps.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
Two leftover items from PR #1256 review round-2.

3F — Regression guard for the C1 dispatch-whitelist fix. A future
refactor that drops `SetMetricDetail` from the no-op match arm in
`ConfigState::dispatch` and falls through to the catch-all
`UndispatchableRequest` arm would silently re-break TUI cardinality
elevation entirely (the original C1 bug). New test exercises the
happy path with a fully-populated `SetMetricDetail` request and
asserts `dispatch` returns `Ok`.

4B — Replace `LEASE_TTL_DEFAULT.as_secs() as u32` lossy cast in the
SetMetricDetail worker arm with `u32::try_from(...).unwrap_or(60)`.
The default fits in u32 by construction (60 s); the explicit checked
conversion documents the bound and shields against any future tweak
that grows LEASE_TTL_DEFAULT past `u32::MAX` seconds (≈ 136 years).

D6 — Confirmed via `grep -nE "^## \[" CHANGELOG.md`: a single
`## [Unreleased]` section. No edit needed.

Build/clippy clean; 28/28 state tests pass (1 new).

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
PR #1256 review M-7: the renewer thread's `eprintln!`-on-error wrote
to the wiped alt-screen, so the operator never saw "renewer dropped"
diagnostics until the lease silently lapsed and per-backend panes
went sparse.

Wire a shared status slot:

- New `cardinality::StatusSlot = Arc<Mutex<Option<String>>>` plus
  `new_status_slot()` / `take_status(slot)` / private `publish_status`
  helpers. Poisoned-lock recovery via `into_inner` so a panic in one
  background thread does not silently strand the next message.
- `DetailGuard::apply` takes a `StatusSlot` parameter. The renewer
  thread receives a clone and writes through `publish_status` on its
  two error paths (channel open + renewal send), replacing the prior
  silent `eprintln!`. The `status` field on `DetailGuard` keeps the
  slot alive for the renewer's lifetime — the read path lives on the
  render-loop side (hence the `#[allow(dead_code)]` for the
  reader-not-on-guard pattern).
- `RenderConfig` gains a `lease_status: StatusSlot` field; the render
  loop drains it once per tick and overwrites `App::status` so the
  F-key bar repaints the message on the next frame. New
  `App::mark_dirty()` triggers the redraw when no snapshot landed.

Same plumbing is ready to receive the four transport collectors'
`eprintln!` sites in a follow-up — kept out of this commit to keep
the change focused on the renewer (review L-2 is the companion
finding).

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
…CKENDS panes (A4)

PR #1256 simplify A4. `panes/clusters.rs` and `panes/backends.rs`
each carried an identical 19-LOC `sort_header` helper that produced
the styled `ratatui::widgets::Cell` for an active-or-inactive sort
column header. Only the column-key enum (`ClusterSortKey` vs
`BackendSortKey`) differed.

Replace the two copies with one shared `pub(super) fn sort_header(
label, active, reverse, skin) -> Cell<'static>` in
`bin/src/ctl/top/panes/mod.rs:27`. Call sites compute the boolean
`active = current_key == key` themselves, so the helper stays
generic-free. Insta snapshot tests for both panes pass unchanged —
the rendered bytes are byte-for-byte identical (snapshot tests are
the byte-equality guard).

Pure refactor, no behaviour change.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
…ric helper (A1)

PR #1256 simplify A1. `spawn_collector` (snapshots),
`spawn_listeners`, and `spawn_certs` shared an identical
`loop { match poll { Ok(v) => try_send; Err(e) => eprintln; }
sleep_remaining }` shape — ~90 LOC of triplicated control flow.

Extract a single private `poll_loop<T, F>(label, interval, tx,
channel, poll)` helper that closes over the channel ownership and
the per-thread polling closure. Each spawn site keeps its own
`thread::Builder::new().spawn(...)` entry point, its per-thread
`Channel` ownership, and the `bounded(1)` publish-or-skip discipline
locked by the auditor — only the inner polling skeleton is shared.

Constraints preserved:

- Four `thread::spawn` sites: `spawn_collector`, `spawn_listeners`,
  `spawn_certs`, `spawn_events`. `spawn_events` has a different
  drainer shape and stays untouched.
- `bounded(1)` channel capacity on the polling threads.
- Publish-or-skip on `TrySendError::Full`; clean thread exit on
  `TrySendError::Disconnected`.
- Per-thread `Channel` ownership (no `Mutex<Channel>`).

Net LOC delta in `transport.rs`: -116 lines (442 → 326).

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
…spatcher (A2 + A9)

Two PR #1256 Tier-A simplifications.

A2 — PulseTracker.tick / tick_one used a two-pass `Vec<K>::collect()`
+ `for k in to_drop { map.remove(&k) }` loop to drop zero-aged
entries. Same shape in two places. Replaced both with in-place
`HashMap::retain(|_, v| { if *v == 0 { false } else { *v -= 1; true } })`
— same semantics, no `String` key clones for the dropped set, no
intermediate `Vec` allocation per render frame.

A9 — `apply_palette` carried eight near-identical `match` arms
mapping a string alias (`"overview"` / `"o"`, `"cluster"` /
`"clusters"` / `"c"`, …) to a tab. Adding a new tab required
patching `ActiveTab` AND remembering to also patch `apply_palette`.
Centralised the alias table in a new `ActiveTab::from_alias(s)
-> Option<Self>` resolver and let `apply_palette` fall back to a
small fixed-cmd match (`help` / `quit` / empty / other) only when
the alias resolver returns `None`. Future tab additions touch
`ActiveTab` only.

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
… pane (A3)

PR #1256 simplify A3. `bin/src/ctl/top/app.rs` carries
`count_value` / `gauge_value` helpers that decode a
`FilteredMetrics -> Option<{i64,u64}>`; `bin/src/ctl/top/panes/h2.rs`
re-implemented the same two functions (different names: `count` /
`gauge`) verbatim — same body, divergent identifiers.

Promote the app-side helpers to `pub(super)` and import them into
the H2 pane under their pane-local names (`use … as count, … as
gauge`). Removes ~15 LOC of duplicated body and the
`filtered_metrics::Inner` / `FilteredMetrics` imports the H2 pane
no longer needs directly. Future renames or signature changes touch
one site instead of two.

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
…lStatus reply (2H)

Closes the PR #1256 follow-up where the `proto_version` field on
`WorkerInfo` was wired (commit `b0ad5728`) but no dispatcher consumed
it: `MetricDetailStatus.unsupported_workers[]` stayed permanently
empty regardless of the master/worker version skew.

New dedicated dispatch path `set_metric_detail_request` plus a
`SetMetricDetailTask` GatheringTask in `bin/src/command/requests.rs`:

- Snapshots master-side `(configured, effective)` before the request
  fans out so the synthesised reply carries `previous_effective`.
- Mirrors `worker_request`'s peer-binding population (peer_pid +
  session_ulid from `ClientSession`) and length / TTL pre-validation.
- Walks `server.workers`, partitioning by
  `proto_version >= MIN_PROTO_VERSION_FOR_SET_METRIC_DETAIL`. Capable
  workers are scattered to one-by-one via `scatter_on(Some(worker_id))`.
  Unsupported workers (typically inherited-after-`UpgradeMain` from a
  pre-tag-55 binary) skip the fan-out entirely and land in the
  response's `unsupported_workers[]` field.
- Emits attempt-time + completion-time audit rows in the same shape
  the generic WorkerTask flow used, threading the operator-supplied
  `client_id` (lease_id column) + `reason` (metric_detail_reason
  column) through `MetricDetailAuditFields::into_extras`.
- Synthesises a full `MetricDetailStatus` for the client:
  `configured`/`effective`/`previous_effective` from the master's
  Aggregator + `workers: BTreeMap<worker_id, WorkerMetricDetailStatus>`
  populated for every ACK'd worker + `unsupported_workers` from the
  pre-filter. Returns via `client.finish_ok_with_content`.

Per-worker `(configured, effective, previous_effective,
active_lease_count)` quartets currently mirror the master's view
because the worker arm in `lib/src/server.rs` replies with
`WorkerResponse::ok` (no content payload). A follow-up plumbing the
actual per-worker view through a new `ResponseContent` variant is
documented in the SetMetricDetailTask `on_finish` body; the wire
schema is populated now, with master-view stand-ins, so consumers
(TUI status bar) see a non-empty `workers` map even before that
plumbing lands.

`MIN_PROTO_VERSION_FOR_SET_METRIC_DETAIL = 1` constant declared
local to the dispatcher so future per-verb gating doesn't get coupled
to the global SOZU_PROTO_VERSION monotonic bump.

Build/clippy clean; 1075/1075 workspace tests pass (12 suites,
~6 min).

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
Six small simplifications from PR #1256 review round-2's residual
Tier-A / Tier-B list. Pure stylistic; no behaviour change.

A6 — Hoist the HTTP-5xx error-status set out of the two inline
`[S500, S502, S503, S504, S507]` iterators (cluster_rows +
fold_overview) into a module-level `ERRORS_5XX: [&str; 5]` constant.
Adding a new 5xx variant now touches one place. The dynamic sort
comparator is left as-is — the heterogeneous `BackendSortKey` variant
types (u64 / String) do not compose into a single tuple key without
adding wrapper enums that would be heavier than the gain.

A8 — Replace the five `(Self::default_dark(), Some(...))` fallback
tuples in `Skin::resolve` with a local `let default_with = |msg|
(Self::default_dark(), Some(msg));` closure. The fail-closed policy
from commit `5b098d9b` and the `from_open_file` TOCTOU mitigation
stay verbatim; diagnostic strings are byte-identical.

A7 — Audited. The current `App::new` is already tight at 25 lines;
adding `#[derive(Default)]` would require introducing arbitrary
"first variant is default" choices on four enums (ActiveTab,
ClusterSortKey, BackendSortKey, GlyphMode), which just relocates the
explicit-default question rather than removing it. Net win is
negligible; left as-is.

L-8 — Clear `palette_input` on the unknown-command path in
`App::apply_palette`. The success path already resets the input; the
typo path used to leave the operator's previous text in place so the
next `:` keypress re-opened the palette pre-populated with the bad
command. Now both paths exit with a fresh input.

B1 — Convert `.clone()` on `&String → String` sites to `.to_owned()`
across the panes layer (listeners.rs, certs.rs). User preference per
CLAUDE.md ("prefer ToOwned::to_owned() over Clone::clone() when going & str → String for ownership-intent clarity"). Same allocation
behaviour; clearer intent.

B5 / B9 — Collapse `format!(...) + &format!(" · {trend} 60 s")` in
`panes/overview.rs::subtitle_for_rps` into a single `format!` call.
Saves one allocation per subtitle render.

Events pane: `Option<String>::clone().unwrap_or_default()` →
`.as_deref().unwrap_or("").to_owned()` — skips the `String::new`
allocation on the `None` branch (the kernel emits backend events
without `cluster_id` populated for proxy-process-level transitions).

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
Final piece of the PR #1256 follow-through. Previously the
capability-aware dispatcher in `bin/src/command/requests.rs`
(`SetMetricDetailTask::on_finish`) synthesised
`MetricDetailStatus.workers[<worker_id>]` using the master's
aggregator view as a stand-in for each worker because workers replied
with `WorkerResponse::ok(message.id)` carrying no payload. Each
worker holds an independent `Aggregator` with its own lease table, so
that stand-in obscured real per-worker drift (different configured
floors, different active lease counts after a partial fan-out).

Wire it properly end-to-end:

- New `ResponseContent::WorkerMetricDetailStatus` oneof variant
  (tag 17 — proto additive). Carries the worker's own
  `(configured, effective, previous_effective, active_lease_count)`
  quartet, semantically distinct from the aggregated
  `MetricDetailStatus` at tag 16.
- New `lib/src/server.rs::worker_metric_detail_status_content`
  helper that builds the response payload from a
  `(configured, effective, previous_effective, lease_count)` snapshot
  captured BEFORE the `METRICS.borrow_mut` scope ends (so the per-
  request snapshot is consistent with the transition that just
  happened).
- The three ok-paths in the worker's SetMetricDetail arm (clear-Cleared,
  clear-NotFound, apply-Applied) now reply via `WorkerResponse::
  ok_with_content` with the freshly-built payload instead of the
  payload-less `ok`. The `clear-NotFound` path reports
  `previous_effective == effective` (no transition).
- Master-side `SetMetricDetailTask::on_finish` collects the per-worker
  payload from `response.content` and only falls back to skipping the
  worker entry when the response has no payload (e.g. an older
  worker that never went through `ok_with_content`). Removes the
  master-view stand-in noted as a follow-up in
  commit `70cd24af` (`set_metric_detail_request`).
- `command/src/proto/display.rs` adds a silent OK match arm for the
  new variant — the per-worker payload flows master-side and is
  never printed directly on the operator's terminal.

Build/clippy clean; 1075/1075 workspace tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 11, 2026
Strip the in-tree comment cross-references to review tags
(`PR #1256 review M-7`, `simplify A3`, `C1 fix`), round labels
(`round-4 follow-through`), and short-SHA references to other commits
on this same branch. The technical rationale stays in place, inlined
where the cross-reference used to be — committed text reads
self-contained for any contributor without access to the local
review pipeline.

Affected sites:
- `bin/src/command/requests.rs::handle_request` (SetMetricDetail
  match arm)
- `bin/src/command/server.rs::handle_worker_response`
  (METRIC_DETAIL_CHANGED branch)
- `bin/src/ctl/top/cardinality.rs` (DetailGuard.status, short_random_suffix)
- `bin/src/ctl/top/panes/h2.rs` (gauge/count helper import note)
- `bin/src/ctl/top/render.rs` (RenderConfig.lease_status)
- `bin/src/ctl/top/theme.rs::Skin::resolve` (default_with closure
  doc)
- `command/src/state.rs::dispatch_passes_through_set_metric_detail`
- `lib/src/server.rs::worker_metric_detail_status_content`

No behavioural change. Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
@FlorentinDUBOIS FlorentinDUBOIS self-assigned this May 12, 2026
Wires ratatui 0.30, crossterm 0.29, crossbeam-channel, ctrlc, color-eyre,
and the tui-* polish crates (tui-input, tui-popup, tui-big-text,
tui-tree-widget, tui-scrollview, throbber-widgets-tui) as workspace deps,
all marked `optional = true` on `bin/`. The `tui` feature on `bin/`
activates them; default builds (`jemallocator`, `crypto-ring`) keep the
production binary lean.

Deliberately omits `tokio` and `tokio-util`: the `sozu top` TUI runs on
two synchronous transport threads + the UI thread, matching the existing
`bin/` style and avoiding an async runtime in v1. Crossterm features pin
to `events`/`bracketed-paste`; ratatui to `crossterm`/`macros`/
`underline-color`. `tui-logger` is intentionally excluded — the EVENTS
pane (week 3) will surface proxy events directly.

Adds `insta` to `[dev-dependencies]` for upcoming snapshot tests of the
TUI panes (week 4); strictly dev-only, never production.

Verified: `cargo check -p sozu --no-default-features --features
crypto-ring` and `cargo check -p sozu --features tui` both pass;
`cargo tree -p sozu --features tui | grep -c '^tokio'` returns 0.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
`sozu top` (and any future TUI client) needs to elevate the metrics
cardinality knob (`MetricDetail`) from the configured `Cluster` floor up
to `Backend` for the duration of an interactive session. A blind global
set + restore-on-Drop fails on crash and doesn't compose with multiple
concurrent clients. This commit adds a TTL-leased model on `Aggregator`:

- `configured` is the static floor from `MetricsConfig.detail`.
- `leases: HashMap<client_id, (level, expires_at)>` tracks active leases.
- `effective = max(configured, max(active leases))` is recomputed off
  the metric-emission hot path (only on apply/clear/expire).
- `lease_apply(client_id, level, ttl)` registers or renews a lease with
  the TTL clamped at `LEASE_TTL_MAX = 300s` to bound the worst-case
  effect of a stuck renewer; returns `(previous, new)` so callers can
  decide whether to emit a `MetricDetailChanged` audit event.
- `lease_clear(&client_id)` releases by id; mismatched ids are silent
  no-ops so other clients' leases are unaffected.
- `lease_tick(now)` is a polled expiry janitor: cheap when nothing has
  expired, returns `Some(previous)` when expiry actually moved the
  effective level. `lease_tick_due(now)` gates the call so the worker
  only walks the lease table every 5 s.

Crash safety falls out of the design: a dead `sozu top` cannot
permanently elevate cardinality because its lease self-expires after
`ttl_seconds`. `Aggregator::receive_metric` now reads `effective` (single
field load on the hot path) instead of `detail`.

12 unit tests cover monotonic apply, the configured floor (a lower lease
cannot push effective below the floor), renewal-replaces-not-duplicates,
max-merge across multiple clients, silent no-op on unknown id,
deterministic expiry via `lease_tick(now)`, and the TTL clamp. All 22
metrics-module tests pass; existing `set_up_detail` semantics preserved.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Adds a runtime cardinality lease verb so `sozu top` can elevate the
metrics drain to `MetricDetailLevel::Backend` for the duration of an
interactive session. The lease design (TTL-bounded, `client_id`-keyed,
self-expiring) is crash-safe and composes with multiple concurrent
clients — see `Aggregator` lease bookkeeping in the previous commit.

Proto (additive, backwards-compat):

- `Request.request_type::SetMetricDetail = 55` carries
  `{ client_id, detail?, ttl_seconds?, clear?, reason? }`.
- `ResponseContent::metric_detail_status = 16` returns
  `MetricDetailStatus { configured, effective, previous_effective,
  workers: map<id, WorkerMetricDetailStatus>, unsupported_workers[] }`
  for mixed-version-fleet safety.
- `EventKind::METRIC_DETAIL_CHANGED = 30` on the `SubscribeEvents`
  audit stream; distinct from `METRICS_CONFIGURED` (Enabled/Disabled
  /Clear) since the cause is different.
- `command/build.rs` re-attaches `Hash, Eq` for `MetricDetailStatus`
  (the embedded `map<string, WorkerMetricDetailStatus>` strips the
  prost auto-derive, which propagates to `ResponseContent.content_type`
  and `Request.request_type`).
- `command/src/proto/display.rs` adds arms for `RequestType::
  SetMetricDetail`, `ContentType::MetricDetailStatus` (with a
  prettytable renderer that lists per-worker configured/effective
  /previous_effective + unsupported workers), and `EventKind::
  MetricDetailChanged`.
- `command/src/request.rs` routes `SetMetricDetail` through the
  worker-level dispatch group (mirrors `ConfigureMetrics`).

Master + worker plumbing:

- `bin/src/command/requests.rs::is_mutating_verb` learns the new verb
  so the master brackets it with `RELOADING=1`/`READY=1` systemd hints.
- The dispatch match routes through the existing `worker_request`
  fan-out path (same shape as `ConfigureMetrics`); per-worker
  `MetricDetailStatus` aggregation lands in week 2 when the TUI
  starts consuming it.
- `lib/src/server.rs::notify` adds two hooks: a polled lease-expiry
  janitor at the top of every dispatch (gated by `lease_tick_due` so it
  only walks the lease table every 5 s), and the `SetMetricDetail` arm
  itself. The arm clamps the TTL via the `Aggregator` setter, decodes
  the `MetricDetail` enum defensively, and acks with `WorkerResponse::
  ok()` for now (week 2 will return the per-worker
  `WorkerMetricDetailStatus` payload).

`MetricDetailChanged` audit emission is left as a `TODO(sozu-top week
2)` at both `lease_apply`/`lease_clear` call sites and the janitor —
plumbed through `bin/src/command/requests.rs::audit_emit_inline` once
the master collects per-worker effective levels back.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Wires the `sozu top` clap subcommand behind `#[cfg(feature = "tui")]`.
Default `sozu` builds keep the subcommand hidden — `sozu --help` does
not list it without `--features tui`.

Clap surface (10 flags):

- `--refresh-ms` (default 1000): data poll interval; render is capped
  at 30 fps independently.
- `--no-color`, `--no-mouse`: disable ANSI colour / mouse capture
  (auto-detected by default; honours `NO_COLOR` and `TERM=dumb`).
- `--skin <name>`: looks up `$XDG_CONFIG_HOME/sozu/skins/<name>.toml`,
  with `SOZU_TOP_SKIN` env override (k9s parity).
- `--detail` (Process|Frontend|Cluster|Backend): override the lease
  level; default `Backend` (auto-elevate, lease self-expires
  server-side after `--lease-ttl-seconds`, default 60s).
- `--snapshot N`, `--tick-once`: test affordances that drive a fixed
  number of frames or one tick and exit (no terminal control).
- `--log-file`: ship internal TUI logs to a file (avoids stomping the
  rendered screen).
- `--glyphs` (Braille|Block|Tty): force a glyph mode; auto-detect by
  default.

`bin/src/ctl/top/mod.rs` is a placeholder that prints the resolved
argument bag and exits cleanly — the render loop, transport threads,
DetailGuard lease lifecycle, and pane implementations land in
subsequent steps. `TopArgs` mirrors the clap variant so reviewers can
confirm the wiring before the renderer is built.

`bin/build.rs` adds `("TUI", "tui")` to the feature-flag table so
`sozu --version` reports `+tui` (with feature) or `-tui` (without). The
flag materially changes the binary size and dep graph, so operators
need to spot it on the banner before deploying.

Verified: `sozu top --help` lists every flag; default `cargo build`
hides the subcommand; `cargo build -p sozu --features tui` builds
clean; `cargo clippy --all-targets --features tui` is clean;
`cargo +nightly fmt --check` is clean; `cargo test --workspace --
features tui` passes 1024 unit/integration tests + 7 doctests.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Lays in three skeleton modules used by the upcoming render loop and panes.

`theme.rs` — single hard-coded `Skin` with Okabe-Ito categorical palette
(colour-blind safe in isolation; pairs distinguishable across the three
common dichromatic types) plus a Viridis-shaped sparkline gradient
(cool/warm/hot tiers via `Skin::spark_color`). `GlyphMode` enum with
three sets of sparkline ramp characters (Braille / Block / TTY-ASCII)
and three status glyphs (`▲ ▼ ●`) so the colour-blind cue is always
backed up by a shape. The `--skin` lookup, `SOZU_TOP_SKIN` env override,
TOML loader, and `LANG`/`TERM` capability cascade land in week 3.

`transport.rs` — synchronous dual-channel design (no tokio in v1, per
the Codex cross-check in `tasks/todo.md`). Two threads on two separate
unix-socket `Channel` connections to the master:

- `spawn_collector` polls `RequestType::QueryMetrics` on the configurable
  `--refresh-ms` ticker; pushes each `AggregatedMetrics` into a
  `crossbeam_channel::bounded::<Snapshot>(1)` with newest-wins overwrite,
  so a slow UI tick never queues a fan-out pile-up.
- `spawn_events` opens `RequestType::SubscribeEvents` once and forwards
  every inbound `Event` into a `crossbeam_channel::bounded::<TopEvent>(64)`.

The unix `Channel<W,R>` carries no message-id correlation, so multiplexing
the streamed events with the discrete metrics round-trip on one socket is
unsafe — hence the second connection. Both threads exit cleanly when the
UI drops the receivers; transient socket errors log via `eprintln!` and
shut the thread down rather than crashing the UI.

`cardinality.rs` — `DetailGuard` RAII handle. On `apply` it sends an
initial `SetMetricDetail{ client_id, detail = Backend, ttl_seconds = 60 }`
on a dedicated channel and spawns a renewer thread that re-sends every
`ttl/2` seconds. On `Drop` the guard sends a best-effort
`SetMetricDetail{ client_id, clear: true }` so the cardinality drops back
to its configured floor on clean exit. Crash safety: even if `Drop` never
runs (panic, kill -9), the lease self-expires server-side after
`ttl_seconds`, so a dead `sozu top` cannot permanently elevate cardinality.
The `client_id` shape `top:<pid>:<8 hex>` keeps multiple concurrent TUIs
isolated without pulling `rand`.

All three modules are wired into `bin/src/ctl/top/mod.rs` but not yet
consumed by the placeholder `run_top` — the renderer + app state land in
the next two commits, at which point the `dead_code` warnings on the
public APIs clear naturally. Clippy is clean.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
`App` is the pure-data root of the UI: tabs, the OVERVIEW state, recent
events, threshold table, last-snapshot anchor, status text, and a
`RateCalculator`. The render loop snapshots it for each frame; the
transport threads push into it via `App::ingest_snapshot` and
`App::ingest_event`. No I/O, no rendering — the split keeps the data
fold testable without spinning up a terminal.

Three primitives:

- `SparkRing` — fixed-capacity `VecDeque<u64>` (default 60 samples = one
  minute at the 1 s data tick). Newest at the back; oldest drops off
  the front when capacity is hit. Matches the proto
  `FilteredTimeSerie.last_minute` cadence so a future server-side time
  series swap is a one-line change.
- `RateCalculator` — turns Sōzu's cumulative `Count` metrics into
  per-second deltas. Detects the hourly `LocalDrain::clear` (which
  drops `Count`/`Time` while preserving Gauges) by looking for
  monotonic-decrease between samples and emits `0` for that tick
  instead of a negative spike. First observation returns `None` so the
  caller can show a "no baseline yet" placeholder.
- `ThresholdTable` — colour-coding boundaries: 5xx ratio > 1 %,
  `slab.usage_percent` > 80, p99 > 500 ms, etc. Defaults are sane
  starting points; revisited in the docs commit (week 4) once
  operators see them in anger.

`App::ingest_snapshot` folds an `AggregatedMetrics` into the OVERVIEW
ring buffers:

- RPS — sum of per-cluster `requests` counters via `RateCalculator`.
  Falls back to `proxying.requests` when no cluster metrics are exposed
  (worker configured at `MetricDetail::Process`).
- 5xx ratio — derived ratio of summed `http.status.5xx` counters over
  the request rate; first sample shows 0 % (no baseline).
- p99 latency — max p99 across clusters (averaging percentiles is
  meaningless; operators want "is anyone slow").
- Saturation — prefers `slab.usage_percent`; falls back to
  `client.connections / connections_max`.

`ActiveTab` mirrors numbered tabs at the top of the screen
(`1 OVERVIEW … 7 EVENTS`). `from_digit` handles `1`-`7` muscle memory;
`cycle` powers `Tab` / `Shift-Tab`. `EVENTS` (no `LOGS` pane in v1, per
Codex's `tui-logger` rejection) carries a 200-event ring sliding window.

5 unit tests cover ring overflow, the RateCalculator's three branches
(first observation, monotonic increase, hourly-reset clamp to 0), and
the tab digit/cycle helpers. Build clean, clippy clean.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
OVERVIEW pane is the first impression of the proxy's health: four
sparklines arranged in a 2x2 grid (REQUESTS / SEC, LATENCY p99 ms,
5xx ERRORS, SATURATION %) with big-numeral headers and trend glyphs
(`▲ ▼ ●`) so the colour signal always has a redundant shape cue.
Each cell uses a rounded `Block::ROUNDED` border and pulls its
gradient tier (cool / warm / hot) from `Skin::spark_color` based on
the latest sample's normalised position. Latency and saturation
sparklines are anchored at fixed thresholds so a quiet system
draws short bars and a spike pegs the cell — operators care about
"is anyone slow" not the wandering autoscale.

CLUSTERS pane is a sortable table with one row per cluster_id. Six
columns (cluster_id, rps, err %, p50, p99, backends_available/total).
Default sort: 5xx error rate descending, then RPS — Codex's
recommendation in the cross-check. `s` cycles the sort column,
`S` reverses; the active column header carries an arrow glyph (▼/▲)
and the accent colour. Critical rows (5xx ratio > threshold,
p99 > threshold, all backends down) flip to `Skin::row_critical`.

`App` gains:

- `last_metrics: Option<AggregatedMetrics>` — the freshest snapshot
  is retained verbatim so table-shaped panes don't have to maintain
  their own derivation. `ingest_snapshot` clones the wire payload
  (cheap relative to what the master already paid).
- `ClusterRow`/`cluster_rows()` — pure builder that materialises the
  per-cluster summary on demand, sorted per the active key. Computed
  per frame so changes to `cluster_sort` / `cluster_sort_reverse`
  take effect on the next paint without a re-fold.
- `ClusterSortKey` enum + cycle helper so the renderer can map the
  `s` keypress to a new sort.

`panes/mod.rs::render_placeholder` is the consistent empty-state
widget that BACKENDS / LISTENERS / CERTS / H2 / EVENTS use until
their real implementations land in week 3. Same rounded border, same
muted/secondary colour layering — visually it's "more sōzu top",
not a bug.

Build clean, clippy clean, fmt clean. Renderer wiring lands in the
next commit.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Synchronous 30-fps render loop driving the OVERVIEW + CLUSTERS panes
plus placeholder cells for the remaining tabs. No tokio runtime in v1
by design (Codex cross-check); the loop is a hand-rolled `crossterm::
event::poll(timeout)` against a `RENDER_INTERVAL = 33 ms` budget.

Each iteration drains the bounded `crossbeam_channel` snapshot + event
receivers via `try_recv`, polls input with the smaller of `next_render
- now()` and 50 ms (so an idle UI still drains channels at 20 Hz),
then redraws if the frame budget has elapsed. Synchronized output
(DEC mode 2026 via `BeginSynchronizedUpdate` / `EndSynchronizedUpdate`)
wraps each `terminal.draw` so tmux + iTerm2 see a single atomic paint
instead of per-cell flicker. Anthropic's own TUI team had to do the
same — see `tasks/ask-sozu-top-research-external.md` §6.

Layout (Layout::Vertical):

- 3-line tabs row: numbered `1 OVERVIEW … 7 EVENTS` with the focused
  tab carrying the accent background, others in muted grey. Title
  text reports the live/no-snapshot status anchored on
  `last_snapshot_at`.
- flex middle pane (the active tab's render).
- 1-line htop-style F-key bar: `F1 Help · F2 Theme · F3 Find · F4
  Filter · F5 Pause · F6 Sort · F7 Detail- · F8 Detail+ · F9
  Config · F10 Quit`. Wired today: F1 toggles help, F10 quits, plus
  Tab cycle / `1`-`7` direct, `q`/`Q`/`Ctrl-C` exit, `s`/`S` sort
  cycle/reverse on CLUSTERS. The bar reserves slots for the rest so
  week-3 panes plug in without re-laying out.

`RawModeGuard` is the RAII handle that owns the terminal: install
enables raw mode + alt-screen + (optional) mouse capture + cursor
hide; Drop reverses the same sequence. Combined with the
`ctrlc::try_set_handler` shared `AtomicBool` flag the loop checks each
iteration, panic / SIGINT / SIGTERM / clean-quit all restore the
terminal predictably. Mouse capture is opt-out via `--no-mouse` for
multiplexers that mis-route SGR events.

`run_top` glues everything: spawns the two transport threads (own
`Channel` connections, no multiplexing), applies the
`SetMetricDetail` lease via `DetailGuard::apply` (continuing without
elevation if the master is too old to decode the verb), and runs the
render loop until the user quits or `--snapshot N` / `--tick-once`
exhausts. Drop order: lease → transport receivers → guard, so the
terminal restores last and a panic mid-render never strands the
shell.

Verified: `sozu --features tui top --tick-once` renders one frame and
exits cleanly; `cargo build`/`clippy --all-targets`/`+nightly fmt
--check`/`test --workspace --features tui` all pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Pickup from `cargo +nightly fmt` after the renderer landed: a single
chained `Channel::write_message().map_err()` reflowed onto three
lines per the project's nightly rustfmt rules. No behaviour change.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Renders the recent-events ring populated by the events transport thread.
Newest events at the top so the eye lands on "what just happened".

Five-column table (when, event, cluster, backend, address) with
colour-coded rows:

- hot tier: BACKEND_DOWN, NO_AVAILABLE_BACKENDS, HEALTH_CHECK_UNHEALTHY,
  CLUSTER_REMOVED, WORKER_KILLED, SOZU_STOP_REQUESTED — operational red
  flags that need attention.
- cool tier: BACKEND_UP, CLUSTER_RECOVERED, HEALTH_CHECK_HEALTHY,
  CLUSTER_ADDED — recoveries.
- accent: METRIC_DETAIL_CHANGED — the audit signal the SetMetricDetail
  lease emits, surfaced so operators can see another `sozu top` (or
  scraper) elevate cardinality on the same fleet.
- secondary muted: every other audit / mutation event so they don't
  drown out the actionable rows.

Empty-state copy explains where events come from (the events thread
subscribes on startup; first event arrives whenever the master emits
one). `format_relative_age` shows `Ns ago` / `NmMs` / `NhMm` so the
operator sees freshness at a glance without sub-second precision noise.

Wires `panes::events::render` in `render::draw_pane` so the EVENTS tab
(numbered `7`) now shows live data instead of the placeholder. Build
clean, clippy clean, fmt clean.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Three-section vertical layout pulling H2 metrics directly from the
freshest snapshot's `proxying` map.

Section 1 — streams: big-numeral active-streams gauge plus accent-tier
"H2 share of accepts" percentage (alpn.h2 / (alpn.h2 + alpn.http11)).
Companion table: active streams, H2 connections accepted, HTTP/1.1
accepted, client.connections gauge.

Section 2 — flow control: gauges for h2.connection.window_bytes and
h2.connection.pending_window_updates plus rate-of-change for
flow_control_stall, frames.tx.{window_update,rst_stream,goaway},
headers.rejected.budget_overrun. Counters that warn when non-zero
(stalls, RST_STREAM transmissions, GOAWAY transmissions, header
budget overruns) flip to the row-critical style.

Section 3 — flood mitigations: standalone hot-tier-titled block
listing every CVE-mitigation counter (h2.flood.violation.{glitch_window,
rapid_reset, continuation, made_you_reset, ping, settings, priority},
h2.window_update_dropped, h2.close_with_active_streams). Any non-zero
value is a documented attack mitigation firing — bold red so the eye
lands on it. Title names the three CVEs Sōzu tracks
(CVE-2023-44487 Rapid Reset, CVE-2024-27316 CONTINUATION flood,
CVE-2025-8671 MadeYouReset).

Wires `panes::h2::render` into `render::draw_pane` so tab `5` (H2)
shows live data instead of the placeholder. The "trend (60 s)" column
header reserves space for per-row sparklines once the renderer can
afford the second pass without copying. Build clean, clippy clean.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Renders one row per `(cluster_id, backend_id)` pair across the freshest
snapshot. Default sort: bandwidth descending (sum of `back_bytes_in` +
`back_bytes_out`) — the busiest backend lands at the top, which is the
"who is on fire" question operators ask first.

Seven columns: cluster, backend, bw down/up B, conn, p50, p99, req.
Bandwidth is rendered with a human-friendly suffix (`K`/`M`/`G`); the
rest are raw numerics. Rows where `p99 ≥ thresholds.latency_p99
_critical_ms` flip to the row-critical hot/bold style so a slow backend
visually stands out even when it isn't the busiest.

Sort columns: cluster, backend, bandwidth, connections, latency_p99,
requests. `s` cycles, `S` reverses (same shortcuts the CLUSTERS pane
already uses; the renderer dispatches by active tab).

Empty-state copy explains the two reasons backends might be missing:
no traffic yet, or `metrics.detail < backend` and the SetMetricDetail
lease hasn't acknowledged. Points at the EVENTS pane for the
METRIC_DETAIL_CHANGED audit signal.

`App` gains `backend_sort` / `backend_sort_reverse` and a `BackendRow`
+ `backend_rows()` builder that flattens the per-backend metric maps
in the snapshot. Sort tie-breaks deterministically on
(cluster_id, backend_id) so the on-screen layout doesn't jiggle when
two backends share the same primary key.

In-table cluster-scope filter (drill from the CLUSTERS row's
`cluster_id`) is week-4 polish; the flat view already gives the
"which backend is loaded" answer at a glance. Build clean, clippy
clean, fmt clean.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Third transport thread (`spawn_listeners`) opens its own `Channel` and
polls `RequestType::ListListeners` every 5 s. The slower cadence matches
the brief's "cold subjects" tier — listener state changes are operator-
paced and the EVENTS pane already surfaces LISTENER_ADDED /
LISTENER_REMOVED / LISTENER_UPDATED / LISTENER_ACTIVATED /
LISTENER_DEACTIVATED audit events as they happen.

Pushed snapshots land in a `crossbeam_channel::bounded::<ListenersSnapshot
>(1)` with the same newest-wins semantics the metrics collector uses.
`App` gains a `last_listeners: Option<ListenersSnapshot>` slot, populated
by `App::ingest_listeners` from the render-loop drain.

LISTENERS pane renders a flat three-column table (proto, address,
status) across the freshest `ListenersList`:

- HTTP listeners: `active=<bool>`.
- HTTPS listeners: `active=<bool> · alpn=<csv>`.
- TCP listeners: `active=<bool>`.

Title strip shows refresh cadence + live/no-snapshot status so the
operator can immediately see whether the inventory is current.
Empty-state copy points at the EVENTS pane for audit transitions
while the first 5-s poll is in flight.

`render::run` signature takes the third receiver; `run_top` spawns the
third thread and threads the receiver through. Build clean, clippy
clean.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
…port

The events transport thread blocked indefinitely on
read_message_blocking_timeout(None). The other three transport threads
use bounded intervals and observe try_send's Disconnected to exit on
receiver drop; the events thread had no such wake path because dropping
the crossbeam receiver does not propagate into the unix socket.

Consequences:
- The thread held its SubscribeEvents subscription open forever.
- The master's per-session event ring kept queueing for a never-draining
  consumer (memory leak proportional to event traffic for the worker's
  lifetime).
- Programmatic re-entry (tests, embedded callers, repeated run_top)
  leaked one Channel per invocation.

Wire a finite read timeout (EVENTS_READ_TIMEOUT = 1s) plus an
Arc<AtomicBool> shutdown flag owned by run_top. The events loop checks
the flag between reads and exits cleanly when run_top sets it on the
way out. Join all four transport handles on shutdown for symmetry.

Module-level docs at transport.rs already promised "all threads exit
cleanly when their crossbeam_channel peer is dropped" — the events
thread now honours that contract.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Two related terminal-lifecycle bugs.

1. RawModeGuard::install enabled raw mode BEFORE constructing the guard.
   If a subsequent EnterAlternateScreen / Hide / EnableMouseCapture
   failed (rare but real on EBADF stdout in nspawn / systemd-run shells,
   EOF on closed pty after session detach), the function returned Err
   with raw mode still on but no Drop scheduled. Operator's shell was
   stuck in raw mode without echo; recovery via 'reset' or 'stty sane'.

   Invert the construction: build the guard immediately after
   enable_raw_mode, track partial-success state on alt_entered /
   mouse_enabled flags, and progressively set them as each step
   succeeds. Drop checks each flag before issuing the reverse, so a
   partial install also gets a partial teardown.

2. ctrlc::set_handler only catches SIGINT by default. The module-level
   docs at bin/src/ctl/top/mod.rs advertised SIGTERM coverage too, but
   the ctrlc dep was on default features only. 'kill -TERM <pid>'
   (which systemd-run --user --pty deliberately sends on close) left
   the terminal in alt-screen + raw mode.

   Enable ctrlc's 'termination' feature so the same handler also catches
   SIGTERM and SIGHUP. Update the module docstring to call out the
   feature dependency explicitly.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
…tely

The render loop's dirty-gate skipped terminal.draw when
!app.take_dirty() && !app.pulse.has_active(). Several handle_key arms
mutated visible state without setting the dirty flag, so on a quiet
system (1 s snapshot tick by default) the operator pressed Tab, sort,
':', '?', or F1 and saw nothing for up to one second.

Affected arms:
- KeyCode::Char(':') -> open_palette
- KeyCode::Char('s' | 'S') on Clusters and Backends panes (sort cycle /
  reverse)
- KeyCode::Tab and BackTab
- KeyCode::Char('1'..='7') (direct tab jump)
- KeyCode::Char('?') and F1 (help toggle)
- the palette typing path

Add app.mark_dirty() at each site. The 30 fps render cap still bounds
the paint rate; the dirty flag is the canonical 'frame needs repaint'
signal independent of snapshot arrival.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
The four transport-thread spawn sites in bin/src/ctl/top/transport.rs
used .expect("spawn sozu-top <label>"). The thread::Builder::spawn path
fails under RLIMIT_NPROC pressure or transient OOM; the previous
behaviour panicked the binary with a backtrace.

The RawModeGuard Drop restores the terminal even on panic, but the
operator sees an abort rather than a clean error. Map the spawn failure
to a new CtlError::SpawnFailed { label, source } variant and propagate
it through the Result<...> return so the caller (run_top) surfaces a
one-line "failed to spawn thread 'sozu-top-collector': ..." message
instead of a panic banner.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Three eprintln! sites in bin/src/ctl/top/transport.rs wrote transient
errors to stderr after the alt-screen takeover redirected the terminal.
The operator never saw them; recovery required leaving the TUI and
inspecting the parent shell's stderr buffer.

Reuse the StatusSlot pattern already used by the DetailGuard renewer
thread (bin/src/ctl/top/cardinality.rs). The four transport spawn
functions now accept a shared StatusSlot; on a recoverable error (poll
round-trip failure, SubscribeEvents write/read error) they push a
message to the slot, which the render loop drains once per tick and
surfaces in the status bar.

publish_status was promoted from private to pub(super) so the transport
module can call it. The render loop's existing take_status drain
already handles the unified mailbox — no rendering changes needed.

run_top now constructs the shared StatusSlot before the spawn calls and
threads a clone into each transport thread + the DetailGuard apply.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
The variant is only constructed from `bin/src/ctl/top/transport.rs`,
which is itself gated behind the `tui` Cargo feature. Without the gate
the lean default-features build (`--no-default-features --features
crypto-ring`) emits `warning: variant `SpawnFailed` is never
constructed`, which CI's `-D warnings` clippy gate would catch.

Add `#[cfg(feature = "tui")]` to the variant declaration so the lean
build no longer sees the dead variant and the TUI build keeps it for
the spawn-failure propagation path.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Four small documentation fixes uncovered during the PR-review pass.

1. doc/configure_cli.md flag table no longer documents --no-color and
   --log-file; both were removed from the clap surface earlier in this
   PR but the table still listed them. Operators reading the table
   would get a 'clap: unexpected argument' error if they tried the
   advertised flags. The --no-mouse row is kept and clarified since
   that flag is still wired in.

2. doc/sozu-top.md narrative referenced NO_COLOR/--no-color in the
   accessibility section; strike the sentence so the doc matches the
   trimmed clap surface (neither the env var nor the flag is wired
   into the TUI today).

3. bin/src/ctl/top/transport.rs had two inline comments claiming "drop
   oldest semantics on overflow", but try_send on a full bounded
   channel drops the *newest* sample (publish-or-skip). The
   publish-or-skip contract is the documented behaviour shared with
   the snapshot channels above; fix both events-channel comments to
   match.

4. lib/src/metrics/mod.rs doc comment near lease_apply claimed TTLs
   above LEASE_TTL_MAX are "clamped", but the actual behaviour is to
   return LeaseApplyOutcome::TtlOutOfRange. Update to say "rejected"
   so callers handle the right outcome arm; the existing rationale
   paragraph below already explains the rejection but the lead-in
   contradicted it.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Five small clean-ups uncovered during the PR-review pass.

1. bin/README.md gains a sozu top feature highlight matching the root
   README style and cross-linking the operator guide.

2. bin/src/ctl/top/app.rs::cluster_rows had a dead .max(0) arm at the
   end of the backends_available chain. Flatten to one if/else so the
   final no-rollup-no-per-backend-gauge fallback is the only path that
   uses cm.backends.len(). Same semantics; one less branch and the
   intent reads top-down rather than relying on a degenerate max.

3. lib/src/metrics/mod.rs::record_backend_metrics! macro now references
   names::backend::* constants instead of raw "bytes_in" / "bytes_out"
   / "backend_response_time" / "backend_connection_time" / "requests"
   literals. Values are unchanged on the wire; emitter and reader sides
   now go through the same typed-constant rename guard. Adds
   names::backend::CONNECTION_TIME for the previously-unconstantised
   "backend_connection_time" metric so the full set is captured.

4. bin/src/ctl/top/render.rs PanicHookGuard restores the prior panic
   hook on Drop. Repeated render::run calls in the same process (tests,
   embedded callers) no longer stack hook layers indefinitely. The new
   hook still chains the prior for banner emission; Drop just takes
   ownership of the prior back and reinstalls it as the active hook.

5. SOZU_TOP_SYNC=0 env opt-out is now actually wired at the
   BeginSynchronizedUpdate / EndSynchronizedUpdate call sites. Without
   this, operators on terminals that don't speak DEC mode 2026 had no
   escape hatch from the synchronised-output frame wrap.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
…heck reset

`BackendMap::record_cluster_availability` is the sole emission site for
the `cluster.available_backends` and `cluster.total_backends` rollup
gauges, called from `add_backend`, `remove_backend`,
`backend_from_cluster_id`, and the health-check completion path. Two
control-plane paths populated or reset the backend map without invoking
it, leaving the rollup absent until something else mutated the cluster:

- `import_configuration_state` extended the map via `HashMap::extend`
  with no follow-up. A worker that loaded its cluster topology via
  SCM-fd hot-restart or `LoadState` therefore reported the gauges as
  missing on the first `QueryMetrics` poll until traffic or an admin
  add/remove fired. The TUI's three-tier `cluster_rows` fallback
  resolves to `cm.backends.len() = 0` under default `Cluster` detail,
  surfacing as `backends_available/total = 0/0` in the operator view.

- `set_health_check_config(None)` resets every backend in the cluster
  to `HealthState::default()` (so the load balancer routes again after
  the operator drops the probe) but did not re-emit. The metrics view
  stayed pinned at the last health-check value while the routing view
  reflected the reset.

Both paths now call `record_cluster_availability(cluster_id)` for each
affected cluster. Adds unit tests asserting (1) the rollup gauges land
in the Aggregator immediately after `import_configuration_state`, and
(2) `set_health_check_config(None)` flips the cell back to `Available`
after the health-state reset.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
`Aggregator::lease_apply` previously inserted unconditionally on a
renewal (same `client_id` already present), overwriting the recorded
`PeerBinding` with whatever the renewer presented. The single-owner-
thread refactor on the client side made the legitimate `DetailGuard`
always present the same `(peer_pid, peer_session_ulid)`, but it does
not constrain other callers: any same-UID process that learns the
victim's `client_id` (PID is enumerable through `/proc`; the random
hex suffix lands in the audit log `lease_id=` column and any tail of
`sozu top --debug` stderr) can re-apply against the lease, replace
the binding to its own session, and lock the original owner out of
their `Drop`-time `clear`.

`lease_apply` now gates renewals: when an existing lease's apply-time
binding is fully known (per `PeerBinding::is_known`), the renewer's
presented binding MUST match. Unknown apply-time bindings (no
`SO_PEERCRED`, pre-binding callers, intermediate proxies) keep their
"accept any renewer" behaviour per the proto contract on
`SetMetricDetail.peer_pid` / `peer_session_ulid`. The new
`LeaseApplyOutcome::Unauthorized` variant feeds a `WorkerResponse::
error` in the `SetMetricDetail` worker arm — the message intentionally
does not echo `client_id` so operator-controlled bytes don't reach
the freeform reason field. The dedicated `lease_id` audit column
still carries the client id through the strict sanitiser.

Adds two regression tests:

- `lease_apply_renewal_rejects_foreign_binding` reproduces the
  collision attack and asserts the renewal is refused AND the
  original owner's subsequent `clear` still succeeds.
- `lease_apply_renewal_with_matching_binding_succeeds` covers the
  symmetry case so the TUI's own renewer thread doesn't regress.

The existing `lease_apply_renewal_replaces_previous_for_same_client`
test (`PeerBinding::default()` on both sides) continues to exercise
the unknown-binding path.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
`is_unsafe_line` already stripped C0 + DEL + C1 (via `char::is_control`),
BOM, and the Unicode line/paragraph separators. The bidirectional
override / isolate controls — U+202A..=U+202E (LRE/RLE/PDF/LRO/RLO)
and U+2066..=U+2069 (LRI/RLI/FSI/PDI) — were not covered. An operator-
controlled value containing a Right-to-Left Override visually reverses
the bytes that follow when an operator tails the audit log in a
Unicode-aware terminal (`less`, `cat` under a UTF-8 locale,
`journalctl`), so a row that legitimately attributes an action to one
field can appear to attribute it to a different one. This is the
Trojan-Source class (CVE-2021-42574) applied to audit rows rather
than source code.

Extend `is_unsafe_line` to reject the full bidi override / isolate
range. `sanitize_for_audit_kv` inherits the additional coverage
through the `is_unsafe_line` call inside `is_unsafe_kv`.

Adds four regression tests:

- `line_strips_rtl_override` for the canonical U+202E attack.
- `line_strips_bidi_override_range` table-tests all five U+202A..U+202E.
- `line_strips_bidi_isolate_range` table-tests all four U+2066..U+2069.
- `line_preserves_legitimate_bidi_text` asserts that Hebrew/Arabic
  script content round-trips unchanged — only the explicit controls
  are rejected.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Both `WorkerTask::on_finish` and `SetMetricDetailTask::on_finish` join
the per-worker `response.message` strings into `extras.reason` via
`messages.join(", ")`. The rendered audit-log row passes that field
through the weak `sanitize_for_audit`, which leaves `,` and `=`
untouched — so a worker message that itself contained either character
forged adjacent KV columns when a SIEM splits on `, ` / `=`.

The worker error templates in `lib/src/server.rs::notify`'s
`SetMetricDetail` arm already redacted the operator-supplied
`client_id` from the freeform reason string, but a future internal
caller (or a non-`SetMetricDetail` verb whose worker arm legitimately
echoes operator-controlled bytes) would still smuggle past the join
site without sanitisation.

Two changes:

- `WorkerTask::on_finish` now runs each worker's `response.message`
  through `sanitize_for_audit_kv` before formatting `{worker_id}: {msg}`.
- `SetMetricDetailTask::on_finish` does the same. This task does NOT
  share the generic `WorkerTask` path, so the fix has to land at both
  sites.

Adds a regression test asserting the canonical attack payload
(`x,actor_user=mallory,sozu_version=hijacked`) cannot survive the
format-then-sanitise sequence at the join sites.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
Multi-line destructure of `Some(sozu_command_lib::proto::command::response_content::ContentType::WorkerMetrics(wm))` exceeds the per-line width budget; rustfmt prefers the parenthesised `Some(...)` form spread across three lines.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
@FlorentinDUBOIS FlorentinDUBOIS merged commit 830a609 into main May 20, 2026
21 checks passed
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
…p, getrandom

Close out four remaining lease-primitive findings from the PR #1256
review pass.

2A — Loud TTL reject in lease_apply (was: silent clamp). The aggregator
no longer transparently caps `ttl > LEASE_TTL_MAX` to LEASE_TTL_MAX. It
now returns `LeaseApplyOutcome::TtlOutOfRange`, which the worker
dispatch arm surfaces as `WorkerResponse::error`. The dispatch site
itself already rejects out-of-range TTLs before reaching the
aggregator; the new error arm catches any caller that bypasses
dispatch (proto fuzzing, future internal use) instead of silently
capping their intent.

2B — Master-side TTL pre-validate before fan-out. `worker_request`
checks the operator-supplied `ttl_seconds` against `LEASE_TTL_MAX`
BEFORE scattering to workers. A malicious or buggy
`SetMetricDetail{ttl_seconds: u32::MAX}` no longer reaches the worker
fan-out, eliminating the N×rejected-fan-out + N×audit-line amplifier.

2E — Lease HashMap cap + client_id length cap. New
`LEASE_TABLE_CAP = 64` and `LEASE_CLIENT_ID_MAX_BYTES = 64` constants.
`lease_apply` returns `LeaseApplyOutcome::TableFull` for fresh inserts
that would overflow the table (renewals of existing entries still
succeed, so an active operator never loses their lease just because the
table is full) and `LeaseApplyOutcome::ClientIdTooLong` for
oversized lease keys. Closes the CWE-770 vector where a same-UID
attacker could roll `client_id` faster than expiry to grow the map
unbounded toward worker OOM. The master applies the same length cap
before fan-out via the new `sozu_lib::metrics::LEASE_CLIENT_ID_MAX_BYTES`
export.

2D — Direct `libc::getrandom` on Linux for the lease-id random suffix.
Replaces the `/dev/urandom` File::open path on Linux with the
`getrandom(2)` syscall + `GRND_NONBLOCK` flag (no fs dependency, no
chroot/sandbox failure path). Non-Linux Unix targets keep the
`/dev/urandom` read because `getrandom`'s ABI varies (FreeBSD vs
OpenBSD's `getentropy(2)` vs macOS's `SecRandomCopyBytes`). Switched
from `u32::from_ne_bytes` to `u32::from_le_bytes` for cross-arch
reproducibility of the rendered hex. Last-resort fallback to
`subsec_nanos()` remains for total-entropy-failure environments and the
fallback is now silent at the data layer (the `app.status` surface
covers operator visibility in a later commit).

Four new tests cover ClientIdTooLong, TableFull (including renewal-
succeeds-when-full), TtlOutOfRange, and the LEASE_TTL_MAX boundary.

Build/clippy clean; 29/29 metric tests pass (4 new).

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
Three operator-facing TUI hardening items from the PR #1256 review.

2C — Skin loader fails closed when the parent anchor can't be resolved.
Previously, when `skins_anchor()` returned `None` (race-delete of the
parent, weird `/proc` paths, unusual fs mounts) the confinement check
was silently skipped and `from_toml` ran on the bare resolved path,
defeating the symlink-escape defence the canonicalize block was added
to provide. Now the anchor failure short-circuits to the default skin
with a status-bar diagnostic.

2C (companion) — Close the skin loader TOCTOU window. The previous
shape canonicalized the file, then went `from_toml(&resolved)` which
called `std::fs::read_to_string(&Path)` — re-resolving the path through
the kernel and re-opening the symlink chain. An attacker with write
access to a shared skins dir could swap a symlink in the gap. Replaced
with `from_open_file(&Path)` which calls `File::open` once after the
parent-anchor check and reads from the `&mut File` handle. The
path-based `from_toml` API is removed; tests parse via `toml::from_str`
on a raw string.

3E — `ctrlc::set_handler` degrades gracefully on `MultipleHandlers`
error. The previous `.expect()` aborted the TUI on programmatic re-
entry (any caller that invokes `run` twice in the same process address
space hits the second-install path). The crossterm event loop already
observes Ctrl-C as a keypress, so the dedicated signal handler is
belt-and-braces rather than the primary path; falling through with a
status-bar note keeps embedded callers viable. Status-bar precedence:
skin > lease-elevation > signal-handler diagnostic.

(Renewer-thread status surface — review M-7 — deferred: the worker
status channel needs threading through `DetailGuard` and the
render-loop poll. Tracked as a follow-up to keep this commit
surgical.)

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
…pe, privacy

Three documentation fixes from the PR #1256 review.

3G — CHANGELOG referenced "three synchronous transport threads"; the
CERTS-pane collector added a fourth in eec8422 but the entry was not
updated. Fixed to "four synchronous transport threads (snapshots,
listeners, certs, events)".

2G — CHANGELOG audit-scope claim updated to reflect this PR's actual
behaviour: every cardinality transition emits METRIC_DETAIL_CHANGED,
including the previously-silent worker-local janitor expiries and
post-fan-out apply/clear paths, thanks to the new worker→master audit
IPC. Lease ownership binding (SO_PEERCRED peer pid + session ULID) is
called out so SecNumCloud-style reviewers see the trust model in
plain text. The unsupported_workers field's wiring status is described
honestly — the proto field + per-worker version snapshot ship in this
release; capability-aware dispatch is tracked as follow-up.

3H + D5 — `doc/sozu-top.md` gains an explicit transport layout note
(four threads, six unix-socket connections per invocation) and a
privacy paragraph: the operator-supplied `--reason` text flows to the
audit log and to SubscribeEvents subscribers, so PII / customer IDs
should not be embedded. Length and character caps applied server-side
are documented.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
Two cleanup items from the PR #1256 review:

4A — Remove the empty `bin/api` zero-byte file. The diff added it as a
new empty file (`new file mode 100644, index 00000000..e69de29`) —
clearly a stray `touch bin/api` from local experimentation that got
`git add`-ed. No callers in tree.

Lisa L-017 — Add a dedicated `cargo audit` CI job that runs on every
push + PR. The `tui` Cargo feature pulled in 12 new optional crates
(color-eyre, crossterm, ratatui, throbber-widgets-tui, tui-input,
tui-big-text, tui-popup, tui-scrollview, tui-tree-widget, ctrlc,
crossbeam-channel, toml) plus their transitive closure outside what
the default-features build covers. Without an audit gate any RUSTSEC
advisory landing against one of those crates can ship in a release
unnoticed. Job: cache-all-crates, prefix-key `ci-audit`, `--deny
warnings` so the gate fails fast on any new advisory. Two
invocations to keep symmetry with the per-feature build matrix above;
both read the same Cargo.lock that already contains the TUI deps.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
Two leftover items from PR #1256 review round-2.

3F — Regression guard for the C1 dispatch-whitelist fix. A future
refactor that drops `SetMetricDetail` from the no-op match arm in
`ConfigState::dispatch` and falls through to the catch-all
`UndispatchableRequest` arm would silently re-break TUI cardinality
elevation entirely (the original C1 bug). New test exercises the
happy path with a fully-populated `SetMetricDetail` request and
asserts `dispatch` returns `Ok`.

4B — Replace `LEASE_TTL_DEFAULT.as_secs() as u32` lossy cast in the
SetMetricDetail worker arm with `u32::try_from(...).unwrap_or(60)`.
The default fits in u32 by construction (60 s); the explicit checked
conversion documents the bound and shields against any future tweak
that grows LEASE_TTL_DEFAULT past `u32::MAX` seconds (≈ 136 years).

D6 — Confirmed via `grep -nE "^## \[" CHANGELOG.md`: a single
`## [Unreleased]` section. No edit needed.

Build/clippy clean; 28/28 state tests pass (1 new).

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
PR #1256 review M-7: the renewer thread's `eprintln!`-on-error wrote
to the wiped alt-screen, so the operator never saw "renewer dropped"
diagnostics until the lease silently lapsed and per-backend panes
went sparse.

Wire a shared status slot:

- New `cardinality::StatusSlot = Arc<Mutex<Option<String>>>` plus
  `new_status_slot()` / `take_status(slot)` / private `publish_status`
  helpers. Poisoned-lock recovery via `into_inner` so a panic in one
  background thread does not silently strand the next message.
- `DetailGuard::apply` takes a `StatusSlot` parameter. The renewer
  thread receives a clone and writes through `publish_status` on its
  two error paths (channel open + renewal send), replacing the prior
  silent `eprintln!`. The `status` field on `DetailGuard` keeps the
  slot alive for the renewer's lifetime — the read path lives on the
  render-loop side (hence the `#[allow(dead_code)]` for the
  reader-not-on-guard pattern).
- `RenderConfig` gains a `lease_status: StatusSlot` field; the render
  loop drains it once per tick and overwrites `App::status` so the
  F-key bar repaints the message on the next frame. New
  `App::mark_dirty()` triggers the redraw when no snapshot landed.

Same plumbing is ready to receive the four transport collectors'
`eprintln!` sites in a follow-up — kept out of this commit to keep
the change focused on the renewer (review L-2 is the companion
finding).

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
…CKENDS panes (A4)

PR #1256 simplify A4. `panes/clusters.rs` and `panes/backends.rs`
each carried an identical 19-LOC `sort_header` helper that produced
the styled `ratatui::widgets::Cell` for an active-or-inactive sort
column header. Only the column-key enum (`ClusterSortKey` vs
`BackendSortKey`) differed.

Replace the two copies with one shared `pub(super) fn sort_header(
label, active, reverse, skin) -> Cell<'static>` in
`bin/src/ctl/top/panes/mod.rs:27`. Call sites compute the boolean
`active = current_key == key` themselves, so the helper stays
generic-free. Insta snapshot tests for both panes pass unchanged —
the rendered bytes are byte-for-byte identical (snapshot tests are
the byte-equality guard).

Pure refactor, no behaviour change.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
…ric helper (A1)

PR #1256 simplify A1. `spawn_collector` (snapshots),
`spawn_listeners`, and `spawn_certs` shared an identical
`loop { match poll { Ok(v) => try_send; Err(e) => eprintln; }
sleep_remaining }` shape — ~90 LOC of triplicated control flow.

Extract a single private `poll_loop<T, F>(label, interval, tx,
channel, poll)` helper that closes over the channel ownership and
the per-thread polling closure. Each spawn site keeps its own
`thread::Builder::new().spawn(...)` entry point, its per-thread
`Channel` ownership, and the `bounded(1)` publish-or-skip discipline
locked by the auditor — only the inner polling skeleton is shared.

Constraints preserved:

- Four `thread::spawn` sites: `spawn_collector`, `spawn_listeners`,
  `spawn_certs`, `spawn_events`. `spawn_events` has a different
  drainer shape and stays untouched.
- `bounded(1)` channel capacity on the polling threads.
- Publish-or-skip on `TrySendError::Full`; clean thread exit on
  `TrySendError::Disconnected`.
- Per-thread `Channel` ownership (no `Mutex<Channel>`).

Net LOC delta in `transport.rs`: -116 lines (442 → 326).

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
…spatcher (A2 + A9)

Two PR #1256 Tier-A simplifications.

A2 — PulseTracker.tick / tick_one used a two-pass `Vec<K>::collect()`
+ `for k in to_drop { map.remove(&k) }` loop to drop zero-aged
entries. Same shape in two places. Replaced both with in-place
`HashMap::retain(|_, v| { if *v == 0 { false } else { *v -= 1; true } })`
— same semantics, no `String` key clones for the dropped set, no
intermediate `Vec` allocation per render frame.

A9 — `apply_palette` carried eight near-identical `match` arms
mapping a string alias (`"overview"` / `"o"`, `"cluster"` /
`"clusters"` / `"c"`, …) to a tab. Adding a new tab required
patching `ActiveTab` AND remembering to also patch `apply_palette`.
Centralised the alias table in a new `ActiveTab::from_alias(s)
-> Option<Self>` resolver and let `apply_palette` fall back to a
small fixed-cmd match (`help` / `quit` / empty / other) only when
the alias resolver returns `None`. Future tab additions touch
`ActiveTab` only.

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
… pane (A3)

PR #1256 simplify A3. `bin/src/ctl/top/app.rs` carries
`count_value` / `gauge_value` helpers that decode a
`FilteredMetrics -> Option<{i64,u64}>`; `bin/src/ctl/top/panes/h2.rs`
re-implemented the same two functions (different names: `count` /
`gauge`) verbatim — same body, divergent identifiers.

Promote the app-side helpers to `pub(super)` and import them into
the H2 pane under their pane-local names (`use … as count, … as
gauge`). Removes ~15 LOC of duplicated body and the
`filtered_metrics::Inner` / `FilteredMetrics` imports the H2 pane
no longer needs directly. Future renames or signature changes touch
one site instead of two.

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
…lStatus reply (2H)

Closes the PR #1256 follow-up where the `proto_version` field on
`WorkerInfo` was wired (commit `b0ad5728`) but no dispatcher consumed
it: `MetricDetailStatus.unsupported_workers[]` stayed permanently
empty regardless of the master/worker version skew.

New dedicated dispatch path `set_metric_detail_request` plus a
`SetMetricDetailTask` GatheringTask in `bin/src/command/requests.rs`:

- Snapshots master-side `(configured, effective)` before the request
  fans out so the synthesised reply carries `previous_effective`.
- Mirrors `worker_request`'s peer-binding population (peer_pid +
  session_ulid from `ClientSession`) and length / TTL pre-validation.
- Walks `server.workers`, partitioning by
  `proto_version >= MIN_PROTO_VERSION_FOR_SET_METRIC_DETAIL`. Capable
  workers are scattered to one-by-one via `scatter_on(Some(worker_id))`.
  Unsupported workers (typically inherited-after-`UpgradeMain` from a
  pre-tag-55 binary) skip the fan-out entirely and land in the
  response's `unsupported_workers[]` field.
- Emits attempt-time + completion-time audit rows in the same shape
  the generic WorkerTask flow used, threading the operator-supplied
  `client_id` (lease_id column) + `reason` (metric_detail_reason
  column) through `MetricDetailAuditFields::into_extras`.
- Synthesises a full `MetricDetailStatus` for the client:
  `configured`/`effective`/`previous_effective` from the master's
  Aggregator + `workers: BTreeMap<worker_id, WorkerMetricDetailStatus>`
  populated for every ACK'd worker + `unsupported_workers` from the
  pre-filter. Returns via `client.finish_ok_with_content`.

Per-worker `(configured, effective, previous_effective,
active_lease_count)` quartets currently mirror the master's view
because the worker arm in `lib/src/server.rs` replies with
`WorkerResponse::ok` (no content payload). A follow-up plumbing the
actual per-worker view through a new `ResponseContent` variant is
documented in the SetMetricDetailTask `on_finish` body; the wire
schema is populated now, with master-view stand-ins, so consumers
(TUI status bar) see a non-empty `workers` map even before that
plumbing lands.

`MIN_PROTO_VERSION_FOR_SET_METRIC_DETAIL = 1` constant declared
local to the dispatcher so future per-verb gating doesn't get coupled
to the global SOZU_PROTO_VERSION monotonic bump.

Build/clippy clean; 1075/1075 workspace tests pass (12 suites,
~6 min).

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
Six small simplifications from PR #1256 review round-2's residual
Tier-A / Tier-B list. Pure stylistic; no behaviour change.

A6 — Hoist the HTTP-5xx error-status set out of the two inline
`[S500, S502, S503, S504, S507]` iterators (cluster_rows +
fold_overview) into a module-level `ERRORS_5XX: [&str; 5]` constant.
Adding a new 5xx variant now touches one place. The dynamic sort
comparator is left as-is — the heterogeneous `BackendSortKey` variant
types (u64 / String) do not compose into a single tuple key without
adding wrapper enums that would be heavier than the gain.

A8 — Replace the five `(Self::default_dark(), Some(...))` fallback
tuples in `Skin::resolve` with a local `let default_with = |msg|
(Self::default_dark(), Some(msg));` closure. The fail-closed policy
from commit `5b098d9b` and the `from_open_file` TOCTOU mitigation
stay verbatim; diagnostic strings are byte-identical.

A7 — Audited. The current `App::new` is already tight at 25 lines;
adding `#[derive(Default)]` would require introducing arbitrary
"first variant is default" choices on four enums (ActiveTab,
ClusterSortKey, BackendSortKey, GlyphMode), which just relocates the
explicit-default question rather than removing it. Net win is
negligible; left as-is.

L-8 — Clear `palette_input` on the unknown-command path in
`App::apply_palette`. The success path already resets the input; the
typo path used to leave the operator's previous text in place so the
next `:` keypress re-opened the palette pre-populated with the bad
command. Now both paths exit with a fresh input.

B1 — Convert `.clone()` on `&String → String` sites to `.to_owned()`
across the panes layer (listeners.rs, certs.rs). User preference per
CLAUDE.md ("prefer ToOwned::to_owned() over Clone::clone() when going & str → String for ownership-intent clarity"). Same allocation
behaviour; clearer intent.

B5 / B9 — Collapse `format!(...) + &format!(" · {trend} 60 s")` in
`panes/overview.rs::subtitle_for_rps` into a single `format!` call.
Saves one allocation per subtitle render.

Events pane: `Option<String>::clone().unwrap_or_default()` →
`.as_deref().unwrap_or("").to_owned()` — skips the `String::new`
allocation on the `None` branch (the kernel emits backend events
without `cluster_id` populated for proxy-process-level transitions).

Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
Final piece of the PR #1256 follow-through. Previously the
capability-aware dispatcher in `bin/src/command/requests.rs`
(`SetMetricDetailTask::on_finish`) synthesised
`MetricDetailStatus.workers[<worker_id>]` using the master's
aggregator view as a stand-in for each worker because workers replied
with `WorkerResponse::ok(message.id)` carrying no payload. Each
worker holds an independent `Aggregator` with its own lease table, so
that stand-in obscured real per-worker drift (different configured
floors, different active lease counts after a partial fan-out).

Wire it properly end-to-end:

- New `ResponseContent::WorkerMetricDetailStatus` oneof variant
  (tag 17 — proto additive). Carries the worker's own
  `(configured, effective, previous_effective, active_lease_count)`
  quartet, semantically distinct from the aggregated
  `MetricDetailStatus` at tag 16.
- New `lib/src/server.rs::worker_metric_detail_status_content`
  helper that builds the response payload from a
  `(configured, effective, previous_effective, lease_count)` snapshot
  captured BEFORE the `METRICS.borrow_mut` scope ends (so the per-
  request snapshot is consistent with the transition that just
  happened).
- The three ok-paths in the worker's SetMetricDetail arm (clear-Cleared,
  clear-NotFound, apply-Applied) now reply via `WorkerResponse::
  ok_with_content` with the freshly-built payload instead of the
  payload-less `ok`. The `clear-NotFound` path reports
  `previous_effective == effective` (no transition).
- Master-side `SetMetricDetailTask::on_finish` collects the per-worker
  payload from `response.content` and only falls back to skipping the
  worker entry when the response has no payload (e.g. an older
  worker that never went through `ok_with_content`). Removes the
  master-view stand-in noted as a follow-up in
  commit `70cd24af` (`set_metric_detail_request`).
- `command/src/proto/display.rs` adds a silent OK match arm for the
  new variant — the per-worker payload flows master-side and is
  never printed directly on the operator's terminal.

Build/clippy clean; 1075/1075 workspace tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
FlorentinDUBOIS added a commit that referenced this pull request May 20, 2026
Strip the in-tree comment cross-references to review tags
(`PR #1256 review M-7`, `simplify A3`, `C1 fix`), round labels
(`round-4 follow-through`), and short-SHA references to other commits
on this same branch. The technical rationale stays in place, inlined
where the cross-reference used to be — committed text reads
self-contained for any contributor without access to the local
review pipeline.

Affected sites:
- `bin/src/command/requests.rs::handle_request` (SetMetricDetail
  match arm)
- `bin/src/command/server.rs::handle_worker_response`
  (METRIC_DETAIL_CHANGED branch)
- `bin/src/ctl/top/cardinality.rs` (DetailGuard.status, short_random_suffix)
- `bin/src/ctl/top/panes/h2.rs` (gauge/count helper import note)
- `bin/src/ctl/top/render.rs` (RenderConfig.lease_status)
- `bin/src/ctl/top/theme.rs::Skin::resolve` (default_with closure
  doc)
- `command/src/state.rs::dispatch_passes_through_set_metric_detail`
- `lib/src/server.rs::worker_metric_detail_status_content`

No behavioural change. Build/clippy clean; 53/53 TUI unit tests pass.

Signed-off-by: Florentin Dubois <florentin.dubois@clever.cloud>
@FlorentinDUBOIS FlorentinDUBOIS deleted the feat/sozu-top branch May 20, 2026 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant