[6/n][guardian-integration] restart safety: rehydrate limiter + wid cache from S3 by 0xsiddharthks · Pull Request #465 · MystenLabs/hashi

0xsiddharthks · 2026-04-17T13:52:31Z

Stacked on #464 → #463 → #423 → #466 → #449.

Summary

On a guardian restart the new session generated fresh ephemeral keys and started with a mock LimiterState (bucket at max capacity, next_seq = 0) plus an empty wid-keyed response cache (PR-3). That combination meant retries of prior-session withdrawals would either be rejected outright (seq mismatch) or double-debit the bucket. This PR closes both gaps.

Log schema

WithdrawalLogMessage::Success carries an optional limiter_state_post — the full LimiterState snapshot after the withdrawal was applied. `#[serde(default)]` keeps the format backward compatible with logs written before this field existed.
Withdrawal log Object Lock retention bumped from 5 min to 1 hour so the KP can read prior-session tails across realistic restart windows without racing retention expiry.

KP-side rehydration (`crates/hashi-monitor/src/kp`)

`heartbeat_checks::collect_recent_sessions` exposes the per-session heartbeat summary so rehydration can enumerate prior sessions alongside the new live one.
`kp::rehydrate_limiter_state` scans prior-session withdrawal logs in the recent window and picks the highest-seq `limiter_state_post` snapshot. Falls back to a fresh max-capacity state when no prior logs exist (first-ever provisioning).

Guardian-side rehydration (`crates/hashi-guardian/src/init`)

After `finalize_init` and before marking `provisioner_init_logging_complete`, the guardian:
1. Lists prior-session heartbeats to enumerate prior session ids.
2. Resolves each prior session's attested Ed25519 signing pubkey from `init/{session_id}-oi-attestation-unsigned.json`.
3. Reads the prior session's withdrawal success logs, verifying each against that session's pubkey.
4. Re-signs the unsigned `response` with the CURRENT session's key.
5. Caches the re-signed response by wid.
Reads are tolerant of missing directories; failures log a warning and continue rather than blocking withdrawal serving. Worst case we fall back to the pre-PR-5 behavior of an empty cache.

Why re-sign on rehydration

The log's `response` field is the unsigned `StandardWithdrawalResponse` (BTC Schnorr sigs for each input). Those sigs are deterministic in the enclave BTC key, which is shared across sessions. Only the Ed25519 envelope (timestamp + signature) is session-specific. Re-signing with the new session's key means clients that `get_guardian_info`-fetched the new pubkey verify cached retries cleanly.

Tests

All existing unit tests still pass (hashi 240/240, hashi-guardian 17/17, hashi-types 47/47, hashi-monitor 11/11 excluding the pre-existing `lookup_btc_confirmation_with_local_regtest` which requires a local regtest node). Rehydration integration tests require real S3 (or MinIO) so they land with the guardian e2e harness (task #10).

Follow-ups

Hashi leader should re-fetch `guardian_signing_pubkey` on verification failure so it picks up a restarted guardian's new key without a hashi restart. Out of scope for this PR.
E2E harness that exercises a full guardian restart end-to-end — captured as task Enable parallel build #10 / PR-2b.

On a guardian restart the new session generated fresh ephemeral keys and started with a mock LimiterState (bucket at max capacity, next_seq = 0) plus an empty wid-keyed response cache. That combination meant retries of prior-session withdrawals would either be rejected outright or double-debit the bucket. This PR closes both gaps. Log schema: - WithdrawalLogMessage::Success now carries an optional limiter_state_post snapshot — the full LimiterState after the withdrawal was applied. #[serde(default)] keeps us compatible with pre-existing log records that don't have the field. - Withdrawal log Object Lock retention bumped from 5 min to 1 hour so the KP can read prior-session tails across restart windows without racing retention expiry. KP-side rehydration (hashi-monitor/src/kp): - heartbeat_checks exposes collect_recent_sessions so callers can enumerate prior sessions in addition to the live one. - kp::run replaces the mock LimiterState with rehydrate_limiter_state, which scans prior-session withdrawal logs in the recent window and picks the highest-seq limiter_state_post snapshot. Falls back to a fresh max-capacity state when no prior logs exist. Guardian-side rehydration (hashi-guardian/src/init): - After finalize_init and before marking provisioner_init_logging_ complete, the guardian scans prior-session withdrawal success logs, verifies each against the prior session's attested signing pubkey, extracts the unsigned response, re-signs it with the CURRENT session's key, and caches it by wid. Retries of prior-session withdrawals now hit the cache instead of the consume path. - Reads are tolerant of missing directories; failures log a warning and continue rather than blocking withdrawal serving.

0xsiddharthks · 2026-04-23T11:14:29Z

Deferred alongside #463. The main restart-safety value is wid-cache rehydration, which is meaningless without the cache itself. limiter_state_post in logs + KP-side rehydrate_limiter_state + Object Lock retention bump (5 min → 1 hour) stand on their own, but are not required for initial signet/devnet bring-up — a restarted dev guardian resetting its bucket is acceptable at this stage.

To re-apply: restore the wid-cache first (re-open #463), then re-apply this branch. See .claude/plans/golden-finding-castle.md. Branch siddharth/guardian-restart-safety preserved locally.

Replaces the soft-reserve round-trip at Step 2 with a local `capacity_at(ts)` check, and moves the hard reserve to post-MPC via `validate_consume` → guardian `StandardWithdrawal` → verify Ed25519 response → `apply_consume`. Any rejection (seq mismatch, rate-limited, unavailable) snaps local state to the guardian and bails so the next leader tick retries cleanly. Serializes hard reserves to concurrency=1 when the guardian is configured so timestamps arrive monotonic; the baseline (no guardian) keeps the configured cap. - Step 2 in `process_approved_withdrawal_request_batch` skips the iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below the aggregate external-out amount. No round-trip. - Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks `seq` from `LocalLimiter::validate_consume`, fans out `SignGuardianWithdrawalRequest` BLS signatures to the committee (each validator re-fetches the txn from chain and reconstructs the same `StandardWithdrawalRequest` deterministically), forwards the signed request to the guardian, verifies the response envelope, then `apply_consume`. - New BridgeService RPC `SignGuardianWithdrawalRequest` + `build_guardian_withdrawal_request` / `compute_withdrawal_wid` helpers in `withdrawals.rs`. - Guardian side: `RateLimiter::consume` now takes a `wid` (unused for now, prepping the idempotency cache in #463). - E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts `local_limiter().snapshot() == guardian.state.limiter_state()` and `next_seq == 1` after a successful withdrawal. Follow-ups (known gaps): - Wid-keyed idempotency cache on `consume` (#463): transient RPC failures currently double-debit on retry. - Guardian restart safety / S3 rehydrate (#465). - Step 2/Step 3 timestamp unification via a Move-side change.

0xsiddharthks requested a review from bmwill as a code owner April 17, 2026 13:52

0xsiddharthks marked this pull request as draft April 17, 2026 14:04

0xsiddharthks mentioned this pull request Apr 17, 2026

[2/n][guardian-integration] in-process e2e harness #466

Merged

0xsiddharthks closed this Apr 17, 2026

0xsiddharthks removed the request for review from bmwill April 23, 2026 09:57

0xsiddharthks reopened this Apr 23, 2026

0xsiddharthks force-pushed the siddharth/guardian-soft-reserve branch from cadedf6 to 791b34a Compare April 23, 2026 10:17

0xsiddharthks force-pushed the siddharth/guardian-restart-safety branch from 74c0717 to cd63866 Compare April 23, 2026 10:19

0xsiddharthks changed the title ~~[5/n][guardian-integration] restart safety: rehydrate limiter + wid cache from S3~~ [6/n][guardian-integration] restart safety: rehydrate limiter + wid cache from S3 Apr 23, 2026

This was referenced Apr 23, 2026

[5/n][guardian-integration] wire explicit call to guardian during withdrawal flow #423

Open

[4/n][guardian-integration] wid-keyed idempotency cache #463

Closed

0xsiddharthks closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[6/n][guardian-integration] restart safety: rehydrate limiter + wid cache from S3#465

[6/n][guardian-integration] restart safety: rehydrate limiter + wid cache from S3#465
0xsiddharthks wants to merge 1 commit intosiddharth/guardian-soft-reservefrom
siddharth/guardian-restart-safety

0xsiddharthks commented Apr 17, 2026 •

edited

Loading

Uh oh!

0xsiddharthks commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0xsiddharthks commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Log schema

KP-side rehydration (crates/hashi-monitor/src/kp)

Guardian-side rehydration (crates/hashi-guardian/src/init)

Why re-sign on rehydration

Tests

Follow-ups

Uh oh!

0xsiddharthks commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

0xsiddharthks commented Apr 17, 2026 •

edited

Loading

KP-side rehydration (`crates/hashi-monitor/src/kp`)

Guardian-side rehydration (`crates/hashi-guardian/src/init`)