fix: reduce post-reallocation receipt rejection window during network subgraph polling gap#973
Open
cargopete wants to merge 2 commits intographprotocol:mainfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
This PR addresses a partial contributor to the post-reallocation query outage reported by Ellipfra and others. The full outage lasts around 60 minutes and is primarily caused by gateway-side network subgraph lag - the gateway continues issuing receipts with stale allocation IDs for the duration of its own propagation delay, which is not addressable from indexer-rs, unfortunately.
However, there is a shorter initial window (~0–5 minutes) where indexer-service itself contributes errors due to two gaps in how recently-closed allocations are handled at the service layer. This PR attempts to fix those two gaps.
What indexer-service-rs is doing wrong
Indexers logs show:
This error fires in the minutes immediately following a reallocation, then disappears - while gateway errors persist for the full 60 minutes. The indexer-service errors are caused by two issues:
Issue 1 — Attestation signer evicted too eagerly (
crates/monitor/src/attestation.rs)modify_signersunconditionally drops any signer not present in the current allocations map:Between the on-chain closure and the monitor's next successful poll, the signer is gone. If a receipt arrives during this window, the query is served but
attestation_middlewarereturnsattestation: null, producingBadResponse(bad attestation: ...)at the gateway.Issue 2 — Receipt eligibility hard-rejects with no grace period (
crates/service/src/tap/checks/allocation_eligible.rs)AllocationEligible::checkdoes a hard lookup against the same watch channel. If the allocation is transiently absent during the polling gap, the receipt is immediately rejected - no query is processed, no attestation is returned.The existing
recently_closed_allocation_buffer_secsconfig (default: 3600s) was designed to prevent exactly this, but it is only applied to the network subgraph query. It is never threaded to the signer eviction logic or the eligibility check at the service layer.Fix
Fix 1 adds an
evicted_at: HashMap<Address, Instant>tombstone map tomodify_signers. Signers are kept alive forgrace_periodafter first eviction rather than being dropped immediately.Fix 2 adds a
recently_seen: HashMap<Address, Instant>local cache toAllocationEligible. Before hard-rejecting, the check consults this cache. If the allocation was confirmed eligible withingrace_period, the receipt is accepted.Both fixes source
grace_periodfromrecently_closed_allocation_buffer_secs- no new config surface. Behaviour beyond the grace period is identical to current.What this does and doesn't fix
The 55-minute tail of the outage requires the gateway to either sync its network subgraph faster or stop issuing receipts with stale allocation IDs after a reallocation. That is a gateway-side fix.
Safety
recently_closed_allocation_buffer_secssemantics - tap-agent already aggregates RAVs for these.AllocationEligible::newdefaults to 3600s. Existing call sites require no changes.Files changed
crates/monitor/src/attestation.rscrates/service/src/tap/checks/allocation_eligible.rscrates/service/src/service/router.rscrates/service/src/tap.rs