[6/n][guardian-integration] restart safety: rehydrate limiter + wid cache from S3#465
Closed
0xsiddharthks wants to merge 1 commit intosiddharth/guardian-soft-reservefrom
Closed
Conversation
cadedf6 to
791b34a
Compare
On a guardian restart the new session generated fresh ephemeral keys and started with a mock LimiterState (bucket at max capacity, next_seq = 0) plus an empty wid-keyed response cache. That combination meant retries of prior-session withdrawals would either be rejected outright or double-debit the bucket. This PR closes both gaps. Log schema: - WithdrawalLogMessage::Success now carries an optional limiter_state_post snapshot — the full LimiterState after the withdrawal was applied. #[serde(default)] keeps us compatible with pre-existing log records that don't have the field. - Withdrawal log Object Lock retention bumped from 5 min to 1 hour so the KP can read prior-session tails across restart windows without racing retention expiry. KP-side rehydration (hashi-monitor/src/kp): - heartbeat_checks exposes collect_recent_sessions so callers can enumerate prior sessions in addition to the live one. - kp::run replaces the mock LimiterState with rehydrate_limiter_state, which scans prior-session withdrawal logs in the recent window and picks the highest-seq limiter_state_post snapshot. Falls back to a fresh max-capacity state when no prior logs exist. Guardian-side rehydration (hashi-guardian/src/init): - After finalize_init and before marking provisioner_init_logging_ complete, the guardian scans prior-session withdrawal success logs, verifies each against the prior session's attested signing pubkey, extracts the unsigned response, re-signs it with the CURRENT session's key, and caches it by wid. Retries of prior-session withdrawals now hit the cache instead of the consume path. - Reads are tolerant of missing directories; failures log a warning and continue rather than blocking withdrawal serving.
74c0717 to
cd63866
Compare
This was referenced Apr 23, 2026
Contributor
Author
|
Deferred alongside #463. The main restart-safety value is wid-cache rehydration, which is meaningless without the cache itself. To re-apply: restore the wid-cache first (re-open #463), then re-apply this branch. See |
0xsiddharthks
added a commit
that referenced
this pull request
Apr 23, 2026
Replaces the soft-reserve round-trip at Step 2 with a local `capacity_at(ts)` check, and moves the hard reserve to post-MPC via `validate_consume` → guardian `StandardWithdrawal` → verify Ed25519 response → `apply_consume`. Any rejection (seq mismatch, rate-limited, unavailable) snaps local state to the guardian and bails so the next leader tick retries cleanly. Serializes hard reserves to concurrency=1 when the guardian is configured so timestamps arrive monotonic; the baseline (no guardian) keeps the configured cap. - Step 2 in `process_approved_withdrawal_request_batch` skips the iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below the aggregate external-out amount. No round-trip. - Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks `seq` from `LocalLimiter::validate_consume`, fans out `SignGuardianWithdrawalRequest` BLS signatures to the committee (each validator re-fetches the txn from chain and reconstructs the same `StandardWithdrawalRequest` deterministically), forwards the signed request to the guardian, verifies the response envelope, then `apply_consume`. - New BridgeService RPC `SignGuardianWithdrawalRequest` + `build_guardian_withdrawal_request` / `compute_withdrawal_wid` helpers in `withdrawals.rs`. - Guardian side: `RateLimiter::consume` now takes a `wid` (unused for now, prepping the idempotency cache in #463). - E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts `local_limiter().snapshot() == guardian.state.limiter_state()` and `next_seq == 1` after a successful withdrawal. Follow-ups (known gaps): - Wid-keyed idempotency cache on `consume` (#463): transient RPC failures currently double-debit on retry. - Guardian restart safety / S3 rehydrate (#465). - Step 2/Step 3 timestamp unification via a Move-side change.
0xsiddharthks
added a commit
that referenced
this pull request
Apr 23, 2026
Replaces the soft-reserve round-trip at Step 2 with a local `capacity_at(ts)` check, and moves the hard reserve to post-MPC via `validate_consume` → guardian `StandardWithdrawal` → verify Ed25519 response → `apply_consume`. Any rejection (seq mismatch, rate-limited, unavailable) snaps local state to the guardian and bails so the next leader tick retries cleanly. Serializes hard reserves to concurrency=1 when the guardian is configured so timestamps arrive monotonic; the baseline (no guardian) keeps the configured cap. - Step 2 in `process_approved_withdrawal_request_batch` skips the iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below the aggregate external-out amount. No round-trip. - Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks `seq` from `LocalLimiter::validate_consume`, fans out `SignGuardianWithdrawalRequest` BLS signatures to the committee (each validator re-fetches the txn from chain and reconstructs the same `StandardWithdrawalRequest` deterministically), forwards the signed request to the guardian, verifies the response envelope, then `apply_consume`. - New BridgeService RPC `SignGuardianWithdrawalRequest` + `build_guardian_withdrawal_request` / `compute_withdrawal_wid` helpers in `withdrawals.rs`. - Guardian side: `RateLimiter::consume` now takes a `wid` (unused for now, prepping the idempotency cache in #463). - E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts `local_limiter().snapshot() == guardian.state.limiter_state()` and `next_seq == 1` after a successful withdrawal. Follow-ups (known gaps): - Wid-keyed idempotency cache on `consume` (#463): transient RPC failures currently double-debit on retry. - Guardian restart safety / S3 rehydrate (#465). - Step 2/Step 3 timestamp unification via a Move-side change.
0xsiddharthks
added a commit
that referenced
this pull request
Apr 26, 2026
Replaces the soft-reserve round-trip at Step 2 with a local `capacity_at(ts)` check, and moves the hard reserve to post-MPC via `validate_consume` → guardian `StandardWithdrawal` → verify Ed25519 response → `apply_consume`. Any rejection (seq mismatch, rate-limited, unavailable) snaps local state to the guardian and bails so the next leader tick retries cleanly. Serializes hard reserves to concurrency=1 when the guardian is configured so timestamps arrive monotonic; the baseline (no guardian) keeps the configured cap. - Step 2 in `process_approved_withdrawal_request_batch` skips the iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below the aggregate external-out amount. No round-trip. - Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks `seq` from `LocalLimiter::validate_consume`, fans out `SignGuardianWithdrawalRequest` BLS signatures to the committee (each validator re-fetches the txn from chain and reconstructs the same `StandardWithdrawalRequest` deterministically), forwards the signed request to the guardian, verifies the response envelope, then `apply_consume`. - New BridgeService RPC `SignGuardianWithdrawalRequest` + `build_guardian_withdrawal_request` / `compute_withdrawal_wid` helpers in `withdrawals.rs`. - Guardian side: `RateLimiter::consume` now takes a `wid` (unused for now, prepping the idempotency cache in #463). - E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts `local_limiter().snapshot() == guardian.state.limiter_state()` and `next_seq == 1` after a successful withdrawal. Follow-ups (known gaps): - Wid-keyed idempotency cache on `consume` (#463): transient RPC failures currently double-debit on retry. - Guardian restart safety / S3 rehydrate (#465). - Step 2/Step 3 timestamp unification via a Move-side change.
0xsiddharthks
added a commit
that referenced
this pull request
Apr 26, 2026
Replaces the soft-reserve round-trip at Step 2 with a local `capacity_at(ts)` check, and moves the hard reserve to post-MPC via `validate_consume` → guardian `StandardWithdrawal` → verify Ed25519 response → `apply_consume`. Any rejection (seq mismatch, rate-limited, unavailable) snaps local state to the guardian and bails so the next leader tick retries cleanly. Serializes hard reserves to concurrency=1 when the guardian is configured so timestamps arrive monotonic; the baseline (no guardian) keeps the configured cap. - Step 2 in `process_approved_withdrawal_request_batch` skips the iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below the aggregate external-out amount. No round-trip. - Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks `seq` from `LocalLimiter::validate_consume`, fans out `SignGuardianWithdrawalRequest` BLS signatures to the committee (each validator re-fetches the txn from chain and reconstructs the same `StandardWithdrawalRequest` deterministically), forwards the signed request to the guardian, verifies the response envelope, then `apply_consume`. - New BridgeService RPC `SignGuardianWithdrawalRequest` + `build_guardian_withdrawal_request` / `compute_withdrawal_wid` helpers in `withdrawals.rs`. - Guardian side: `RateLimiter::consume` now takes a `wid` (unused for now, prepping the idempotency cache in #463). - E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts `local_limiter().snapshot() == guardian.state.limiter_state()` and `next_seq == 1` after a successful withdrawal. Follow-ups (known gaps): - Wid-keyed idempotency cache on `consume` (#463): transient RPC failures currently double-debit on retry. - Guardian restart safety / S3 rehydrate (#465). - Step 2/Step 3 timestamp unification via a Move-side change.
0xsiddharthks
added a commit
that referenced
this pull request
Apr 28, 2026
Replaces the soft-reserve round-trip at Step 2 with a local `capacity_at(ts)` check, and moves the hard reserve to post-MPC via `validate_consume` → guardian `StandardWithdrawal` → verify Ed25519 response → `apply_consume`. Any rejection (seq mismatch, rate-limited, unavailable) snaps local state to the guardian and bails so the next leader tick retries cleanly. Serializes hard reserves to concurrency=1 when the guardian is configured so timestamps arrive monotonic; the baseline (no guardian) keeps the configured cap. - Step 2 in `process_approved_withdrawal_request_batch` skips the iteration when `capacity_at(checkpoint_timestamp_ms/1000)` is below the aggregate external-out amount. No round-trip. - Step 3 runs `finalize_withdrawal_through_guardian` after MPC: picks `seq` from `LocalLimiter::validate_consume`, fans out `SignGuardianWithdrawalRequest` BLS signatures to the committee (each validator re-fetches the txn from chain and reconstructs the same `StandardWithdrawalRequest` deterministically), forwards the signed request to the guardian, verifies the response envelope, then `apply_consume`. - New BridgeService RPC `SignGuardianWithdrawalRequest` + `build_guardian_withdrawal_request` / `compute_withdrawal_wid` helpers in `withdrawals.rs`. - Guardian side: `RateLimiter::consume` now takes a `wid` (unused for now, prepping the idempotency cache in #463). - E2E test `test_bitcoin_withdrawal_with_guardian_e2e_flow` asserts `local_limiter().snapshot() == guardian.state.limiter_state()` and `next_seq == 1` after a successful withdrawal. Follow-ups (known gaps): - Wid-keyed idempotency cache on `consume` (#463): transient RPC failures currently double-debit on retry. - Guardian restart safety / S3 rehydrate (#465). - Step 2/Step 3 timestamp unification via a Move-side change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On a guardian restart the new session generated fresh ephemeral keys and started with a mock
LimiterState(bucket at max capacity,next_seq = 0) plus an empty wid-keyed response cache (PR-3). That combination meant retries of prior-session withdrawals would either be rejected outright (seq mismatch) or double-debit the bucket. This PR closes both gaps.Log schema
WithdrawalLogMessage::Successcarries an optionallimiter_state_post— the fullLimiterStatesnapshot after the withdrawal was applied. `#[serde(default)]` keeps the format backward compatible with logs written before this field existed.KP-side rehydration (
crates/hashi-monitor/src/kp)Guardian-side rehydration (
crates/hashi-guardian/src/init)Why re-sign on rehydration
The log's `response` field is the unsigned `StandardWithdrawalResponse` (BTC Schnorr sigs for each input). Those sigs are deterministic in the enclave BTC key, which is shared across sessions. Only the Ed25519 envelope (timestamp + signature) is session-specific. Re-signing with the new session's key means clients that `get_guardian_info`-fetched the new pubkey verify cached retries cleanly.
Tests
All existing unit tests still pass (hashi 240/240, hashi-guardian 17/17, hashi-types 47/47, hashi-monitor 11/11 excluding the pre-existing `lookup_btc_confirmation_with_local_regtest` which requires a local regtest node). Rehydration integration tests require real S3 (or MinIO) so they land with the guardian e2e harness (task #10).
Follow-ups