Skip to content

[6/n][guardian-integration] restart safety: rehydrate limiter + wid cache from S3#465

Closed
0xsiddharthks wants to merge 1 commit intosiddharth/guardian-soft-reservefrom
siddharth/guardian-restart-safety
Closed

[6/n][guardian-integration] restart safety: rehydrate limiter + wid cache from S3#465
0xsiddharthks wants to merge 1 commit intosiddharth/guardian-soft-reservefrom
siddharth/guardian-restart-safety

Conversation

@0xsiddharthks
Copy link
Copy Markdown
Contributor

@0xsiddharthks 0xsiddharthks commented Apr 17, 2026

Stacked on #464#463#423#466#449.

Summary

On a guardian restart the new session generated fresh ephemeral keys and started with a mock LimiterState (bucket at max capacity, next_seq = 0) plus an empty wid-keyed response cache (PR-3). That combination meant retries of prior-session withdrawals would either be rejected outright (seq mismatch) or double-debit the bucket. This PR closes both gaps.

Log schema

  • WithdrawalLogMessage::Success carries an optional limiter_state_post — the full LimiterState snapshot after the withdrawal was applied. `#[serde(default)]` keeps the format backward compatible with logs written before this field existed.
  • Withdrawal log Object Lock retention bumped from 5 min to 1 hour so the KP can read prior-session tails across realistic restart windows without racing retention expiry.

KP-side rehydration (crates/hashi-monitor/src/kp)

  • `heartbeat_checks::collect_recent_sessions` exposes the per-session heartbeat summary so rehydration can enumerate prior sessions alongside the new live one.
  • `kp::rehydrate_limiter_state` scans prior-session withdrawal logs in the recent window and picks the highest-seq `limiter_state_post` snapshot. Falls back to a fresh max-capacity state when no prior logs exist (first-ever provisioning).

Guardian-side rehydration (crates/hashi-guardian/src/init)

  • After `finalize_init` and before marking `provisioner_init_logging_complete`, the guardian:
    1. Lists prior-session heartbeats to enumerate prior session ids.
    2. Resolves each prior session's attested Ed25519 signing pubkey from `init/{session_id}-oi-attestation-unsigned.json`.
    3. Reads the prior session's withdrawal success logs, verifying each against that session's pubkey.
    4. Re-signs the unsigned `response` with the CURRENT session's key.
    5. Caches the re-signed response by wid.
  • Reads are tolerant of missing directories; failures log a warning and continue rather than blocking withdrawal serving. Worst case we fall back to the pre-PR-5 behavior of an empty cache.

Why re-sign on rehydration

The log's `response` field is the unsigned `StandardWithdrawalResponse` (BTC Schnorr sigs for each input). Those sigs are deterministic in the enclave BTC key, which is shared across sessions. Only the Ed25519 envelope (timestamp + signature) is session-specific. Re-signing with the new session's key means clients that `get_guardian_info`-fetched the new pubkey verify cached retries cleanly.

Tests

All existing unit tests still pass (hashi 240/240, hashi-guardian 17/17, hashi-types 47/47, hashi-monitor 11/11 excluding the pre-existing `lookup_btc_confirmation_with_local_regtest` which requires a local regtest node). Rehydration integration tests require real S3 (or MinIO) so they land with the guardian e2e harness (task #10).

Follow-ups

  • Hashi leader should re-fetch `guardian_signing_pubkey` on verification failure so it picks up a restarted guardian's new key without a hashi restart. Out of scope for this PR.
  • E2E harness that exercises a full guardian restart end-to-end — captured as task Enable parallel build #10 / PR-2b.

@0xsiddharthks 0xsiddharthks requested a review from bmwill as a code owner April 17, 2026 13:52
@0xsiddharthks 0xsiddharthks marked this pull request as draft April 17, 2026 14:04
@0xsiddharthks 0xsiddharthks removed the request for review from bmwill April 23, 2026 09:57
@0xsiddharthks 0xsiddharthks reopened this Apr 23, 2026
@0xsiddharthks 0xsiddharthks force-pushed the siddharth/guardian-soft-reserve branch from cadedf6 to 791b34a Compare April 23, 2026 10:17
On a guardian restart the new session generated fresh ephemeral keys
and started with a mock LimiterState (bucket at max capacity, next_seq
= 0) plus an empty wid-keyed response cache. That combination meant
retries of prior-session withdrawals would either be rejected outright
or double-debit the bucket. This PR closes both gaps.

Log schema:
- WithdrawalLogMessage::Success now carries an optional
  limiter_state_post snapshot — the full LimiterState after the
  withdrawal was applied. #[serde(default)] keeps us compatible with
  pre-existing log records that don't have the field.
- Withdrawal log Object Lock retention bumped from 5 min to 1 hour so
  the KP can read prior-session tails across restart windows without
  racing retention expiry.

KP-side rehydration (hashi-monitor/src/kp):
- heartbeat_checks exposes collect_recent_sessions so callers can
  enumerate prior sessions in addition to the live one.
- kp::run replaces the mock LimiterState with rehydrate_limiter_state,
  which scans prior-session withdrawal logs in the recent window and
  picks the highest-seq limiter_state_post snapshot. Falls back to a
  fresh max-capacity state when no prior logs exist.

Guardian-side rehydration (hashi-guardian/src/init):
- After finalize_init and before marking provisioner_init_logging_
  complete, the guardian scans prior-session withdrawal success logs,
  verifies each against the prior session's attested signing pubkey,
  extracts the unsigned response, re-signs it with the CURRENT
  session's key, and caches it by wid. Retries of prior-session
  withdrawals now hit the cache instead of the consume path.
- Reads are tolerant of missing directories; failures log a warning
  and continue rather than blocking withdrawal serving.
@0xsiddharthks 0xsiddharthks force-pushed the siddharth/guardian-restart-safety branch from 74c0717 to cd63866 Compare April 23, 2026 10:19
@0xsiddharthks 0xsiddharthks changed the title [5/n][guardian-integration] restart safety: rehydrate limiter + wid cache from S3 [6/n][guardian-integration] restart safety: rehydrate limiter + wid cache from S3 Apr 23, 2026
@0xsiddharthks
Copy link
Copy Markdown
Contributor Author

Deferred alongside #463. The main restart-safety value is wid-cache rehydration, which is meaningless without the cache itself. limiter_state_post in logs + KP-side rehydrate_limiter_state + Object Lock retention bump (5 min → 1 hour) stand on their own, but are not required for initial signet/devnet bring-up — a restarted dev guardian resetting its bucket is acceptable at this stage.

To re-apply: restore the wid-cache first (re-open #463), then re-apply this branch. See .claude/plans/golden-finding-castle.md. Branch siddharth/guardian-restart-safety preserved locally.

0xsiddharthks added a commit that referenced this pull request Apr 23, 2026
Replaces the soft-reserve round-trip at Step 2 with a local
`capacity_at(ts)` check, and moves the hard reserve to post-MPC via
`validate_consume` → guardian `StandardWithdrawal` → verify Ed25519
response → `apply_consume`. Any rejection (seq mismatch, rate-limited,
unavailable) snaps local state to the guardian and bails so the next
leader tick retries cleanly. Serializes hard reserves to concurrency=1
when the guardian is configured so timestamps arrive monotonic; the
baseline (no guardian) keeps the configured cap.

- Step 2 in `process_approved_withdrawal_request_batch` skips the
  iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below
  the aggregate external-out amount. No round-trip.
- Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks
  `seq` from `LocalLimiter::validate_consume`, fans out
  `SignGuardianWithdrawalRequest` BLS signatures to the committee
  (each validator re-fetches the txn from chain and reconstructs the
  same `StandardWithdrawalRequest` deterministically), forwards the
  signed request to the guardian, verifies the response envelope,
  then `apply_consume`.
- New BridgeService RPC `SignGuardianWithdrawalRequest` +
  `build_guardian_withdrawal_request` / `compute_withdrawal_wid`
  helpers in `withdrawals.rs`.
- Guardian side: `RateLimiter::consume` now takes a `wid` (unused
  for now, prepping the idempotency cache in #463).
- E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts
  `local_limiter().snapshot() == guardian.state.limiter_state()`
  and `next_seq == 1` after a successful withdrawal.

Follow-ups (known gaps):
- Wid-keyed idempotency cache on `consume` (#463): transient RPC
  failures currently double-debit on retry.
- Guardian restart safety / S3 rehydrate (#465).
- Step 2/Step 3 timestamp unification via a Move-side change.
0xsiddharthks added a commit that referenced this pull request Apr 23, 2026
Replaces the soft-reserve round-trip at Step 2 with a local
`capacity_at(ts)` check, and moves the hard reserve to post-MPC via
`validate_consume` → guardian `StandardWithdrawal` → verify Ed25519
response → `apply_consume`. Any rejection (seq mismatch, rate-limited,
unavailable) snaps local state to the guardian and bails so the next
leader tick retries cleanly. Serializes hard reserves to concurrency=1
when the guardian is configured so timestamps arrive monotonic; the
baseline (no guardian) keeps the configured cap.

- Step 2 in `process_approved_withdrawal_request_batch` skips the
  iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below
  the aggregate external-out amount. No round-trip.
- Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks
  `seq` from `LocalLimiter::validate_consume`, fans out
  `SignGuardianWithdrawalRequest` BLS signatures to the committee
  (each validator re-fetches the txn from chain and reconstructs the
  same `StandardWithdrawalRequest` deterministically), forwards the
  signed request to the guardian, verifies the response envelope,
  then `apply_consume`.
- New BridgeService RPC `SignGuardianWithdrawalRequest` +
  `build_guardian_withdrawal_request` / `compute_withdrawal_wid`
  helpers in `withdrawals.rs`.
- Guardian side: `RateLimiter::consume` now takes a `wid` (unused
  for now, prepping the idempotency cache in #463).
- E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts
  `local_limiter().snapshot() == guardian.state.limiter_state()`
  and `next_seq == 1` after a successful withdrawal.

Follow-ups (known gaps):
- Wid-keyed idempotency cache on `consume` (#463): transient RPC
  failures currently double-debit on retry.
- Guardian restart safety / S3 rehydrate (#465).
- Step 2/Step 3 timestamp unification via a Move-side change.
0xsiddharthks added a commit that referenced this pull request Apr 26, 2026
Replaces the soft-reserve round-trip at Step 2 with a local
`capacity_at(ts)` check, and moves the hard reserve to post-MPC via
`validate_consume` → guardian `StandardWithdrawal` → verify Ed25519
response → `apply_consume`. Any rejection (seq mismatch, rate-limited,
unavailable) snaps local state to the guardian and bails so the next
leader tick retries cleanly. Serializes hard reserves to concurrency=1
when the guardian is configured so timestamps arrive monotonic; the
baseline (no guardian) keeps the configured cap.

- Step 2 in `process_approved_withdrawal_request_batch` skips the
  iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below
  the aggregate external-out amount. No round-trip.
- Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks
  `seq` from `LocalLimiter::validate_consume`, fans out
  `SignGuardianWithdrawalRequest` BLS signatures to the committee
  (each validator re-fetches the txn from chain and reconstructs the
  same `StandardWithdrawalRequest` deterministically), forwards the
  signed request to the guardian, verifies the response envelope,
  then `apply_consume`.
- New BridgeService RPC `SignGuardianWithdrawalRequest` +
  `build_guardian_withdrawal_request` / `compute_withdrawal_wid`
  helpers in `withdrawals.rs`.
- Guardian side: `RateLimiter::consume` now takes a `wid` (unused
  for now, prepping the idempotency cache in #463).
- E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts
  `local_limiter().snapshot() == guardian.state.limiter_state()`
  and `next_seq == 1` after a successful withdrawal.

Follow-ups (known gaps):
- Wid-keyed idempotency cache on `consume` (#463): transient RPC
  failures currently double-debit on retry.
- Guardian restart safety / S3 rehydrate (#465).
- Step 2/Step 3 timestamp unification via a Move-side change.
0xsiddharthks added a commit that referenced this pull request Apr 26, 2026
Replaces the soft-reserve round-trip at Step 2 with a local
`capacity_at(ts)` check, and moves the hard reserve to post-MPC via
`validate_consume` → guardian `StandardWithdrawal` → verify Ed25519
response → `apply_consume`. Any rejection (seq mismatch, rate-limited,
unavailable) snaps local state to the guardian and bails so the next
leader tick retries cleanly. Serializes hard reserves to concurrency=1
when the guardian is configured so timestamps arrive monotonic; the
baseline (no guardian) keeps the configured cap.

- Step 2 in `process_approved_withdrawal_request_batch` skips the
  iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below
  the aggregate external-out amount. No round-trip.
- Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks
  `seq` from `LocalLimiter::validate_consume`, fans out
  `SignGuardianWithdrawalRequest` BLS signatures to the committee
  (each validator re-fetches the txn from chain and reconstructs the
  same `StandardWithdrawalRequest` deterministically), forwards the
  signed request to the guardian, verifies the response envelope,
  then `apply_consume`.
- New BridgeService RPC `SignGuardianWithdrawalRequest` +
  `build_guardian_withdrawal_request` / `compute_withdrawal_wid`
  helpers in `withdrawals.rs`.
- Guardian side: `RateLimiter::consume` now takes a `wid` (unused
  for now, prepping the idempotency cache in #463).
- E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts
  `local_limiter().snapshot() == guardian.state.limiter_state()`
  and `next_seq == 1` after a successful withdrawal.

Follow-ups (known gaps):
- Wid-keyed idempotency cache on `consume` (#463): transient RPC
  failures currently double-debit on retry.
- Guardian restart safety / S3 rehydrate (#465).
- Step 2/Step 3 timestamp unification via a Move-side change.
0xsiddharthks added a commit that referenced this pull request Apr 28, 2026
Replaces the soft-reserve round-trip at Step 2 with a local
`capacity_at(ts)` check, and moves the hard reserve to post-MPC via
`validate_consume` → guardian `StandardWithdrawal` → verify Ed25519
response → `apply_consume`. Any rejection (seq mismatch, rate-limited,
unavailable) snaps local state to the guardian and bails so the next
leader tick retries cleanly. Serializes hard reserves to concurrency=1
when the guardian is configured so timestamps arrive monotonic; the
baseline (no guardian) keeps the configured cap.

- Step 2 in `process_approved_withdrawal_request_batch` skips the
  iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below
  the aggregate external-out amount. No round-trip.
- Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks
  `seq` from `LocalLimiter::validate_consume`, fans out
  `SignGuardianWithdrawalRequest` BLS signatures to the committee
  (each validator re-fetches the txn from chain and reconstructs the
  same `StandardWithdrawalRequest` deterministically), forwards the
  signed request to the guardian, verifies the response envelope,
  then `apply_consume`.
- New BridgeService RPC `SignGuardianWithdrawalRequest` +
  `build_guardian_withdrawal_request` / `compute_withdrawal_wid`
  helpers in `withdrawals.rs`.
- Guardian side: `RateLimiter::consume` now takes a `wid` (unused
  for now, prepping the idempotency cache in #463).
- E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts
  `local_limiter().snapshot() == guardian.state.limiter_state()`
  and `next_seq == 1` after a successful withdrawal.

Follow-ups (known gaps):
- Wid-keyed idempotency cache on `consume` (#463): transient RPC
  failures currently double-debit on retry.
- Guardian restart safety / S3 rehydrate (#465).
- Step 2/Step 3 timestamp unification via a Move-side change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant