Skip to content

fix: reduce post-reallocation receipt rejection window during network subgraph polling gap#973

Open
cargopete wants to merge 2 commits intographprotocol:mainfrom
cargopete:fix/post-reallocation-query-outage
Open

fix: reduce post-reallocation receipt rejection window during network subgraph polling gap#973
cargopete wants to merge 2 commits intographprotocol:mainfrom
cargopete:fix/post-reallocation-query-outage

Conversation

@cargopete
Copy link
Copy Markdown

@cargopete cargopete commented Mar 13, 2026

Context

This PR addresses a partial contributor to the post-reallocation query outage reported by Ellipfra and others. The full outage lasts around 60 minutes and is primarily caused by gateway-side network subgraph lag - the gateway continues issuing receipts with stale allocation IDs for the duration of its own propagation delay, which is not addressable from indexer-rs, unfortunately.

However, there is a shorter initial window (~0–5 minutes) where indexer-service itself contributes errors due to two gaps in how recently-closed allocations are handled at the service layer. This PR attempts to fix those two gaps.

What indexer-service-rs is doing wrong

Indexers logs show:

Receipt allocation ID `0xblahblah...` is not eligible for this indexer

This error fires in the minutes immediately following a reallocation, then disappears - while gateway errors persist for the full 60 minutes. The indexer-service errors are caused by two issues:

Issue 1 — Attestation signer evicted too eagerly (crates/monitor/src/attestation.rs)

modify_signers unconditionally drops any signer not present in the current allocations map:

signers.retain(|id, _| allocations.contains_key(id));

Between the on-chain closure and the monitor's next successful poll, the signer is gone. If a receipt arrives during this window, the query is served but attestation_middleware returns attestation: null, producing BadResponse(bad attestation: ...) at the gateway.

Issue 2 — Receipt eligibility hard-rejects with no grace period (crates/service/src/tap/checks/allocation_eligible.rs)

AllocationEligible::check does a hard lookup against the same watch channel. If the allocation is transiently absent during the polling gap, the receipt is immediately rejected - no query is processed, no attestation is returned.

The existing recently_closed_allocation_buffer_secs config (default: 3600s) was designed to prevent exactly this, but it is only applied to the network subgraph query. It is never threaded to the signer eviction logic or the eligibility check at the service layer.

Fix

Fix 1 adds an evicted_at: HashMap<Address, Instant> tombstone map to modify_signers. Signers are kept alive for grace_period after first eviction rather than being dropped immediately.

Fix 2 adds a recently_seen: HashMap<Address, Instant> local cache to AllocationEligible. Before hard-rejecting, the check consults this cache. If the allocation was confirmed eligible within grace_period, the receipt is accepted.

Both fixes source grace_period from recently_closed_allocation_buffer_secs - no new config surface. Behaviour beyond the grace period is identical to current.

What this does and doesn't fix

Window Before this PR After this PR
Minutes 0–5 (monitor polling gap) Indexer-service rejects receipts and drops signers Receipts accepted, responses correctly attested
Minutes 5–60 (gateway propagation lag) Gateway errors persist Gateway errors persist - not fixed here

The 55-minute tail of the outage requires the gateway to either sync its network subgraph faster or stop issuing receipts with stale allocation IDs after a reallocation. That is a gateway-side fix.

Safety

  • No payment path changes. Accepting receipts for recently-closed allocations within the grace period is consistent with existing recently_closed_allocation_buffer_secs semantics - tap-agent already aggregates RAVs for these.
  • AttestationSigner construction is deterministic and purely local. Retaining it for the grace period does not affect signing correctness.
  • Hard-reject resumes after grace period expiry, identical to current behaviour.
  • AllocationEligible::new defaults to 3600s. Existing call sites require no changes.

Files changed

  • crates/monitor/src/attestation.rs
  • crates/service/src/tap/checks/allocation_eligible.rs
  • crates/service/src/service/router.rs
  • crates/service/src/tap.rs

@cargopete cargopete marked this pull request as ready for review March 13, 2026 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant