Analyze make chain by zancas · Pull Request #1056 · zingolabs/zaino

zancas · 2026-04-25T02:46:51Z

Builds on #1055 merge that first.
Fixes: #1038

Subscribe to the chain-index sync loop's `NamedAtomicStatus` via `tokio::sync::watch` so test setup wakes on the first `Ready` transition instead of after the next 2 s polling tick. A single `sleep(sync_timings.interval)` after rendezvous preserves the load-bearing teardown alignment at ~500 ms (down from 2 s); that shim disappears when the indexer gains `CancellationToken`-based shutdown. `wait_for_indexer_tip` is rewired against the same watch with a `tokio::time::timeout` budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hanges Add `BlockchainSource::wait_for_change` (default returns a never-resolving future), have the success-path interval sleep `select!` between the timer and that future. MockchainSource overrides `wait_for_change` with a `Notify` pinged from `mine_blocks`, so the chain-index sync loop wakes immediately on a mocked block instead of after the next 500 ms tick. Per-block test latency drops from ~510 ms (one full `interval`) to ~10 ms (sync work only). `sync_blocks_after_startup` ~12 s → ~1–2 s, hitting #1039's "under 2 s" target. Real backends keep the default and pace on the timer unchanged. `NonFinalizedState::initialize` now takes `source.clone()` so the outer `source` survives the inner async block's `.await` and is reachable in the new `select!` arm. Known regression: `get_mempool_stream_for_stale_snapshot` flips from a probabilistic flake (#1037) to a deterministic failure. The chain-index sync loop now wakes immediately on `mine_blocks`, but the mempool serve loop still polls on its 100 ms cadence, so the chain-index/mempool tip-skew window is now always observable. The principled fix (per-subscriber notify + select! in the mempool loop) is the work tracked under #1037 and follows in a TDD'd PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion #1 Rename `get_mempool_stream_for_stale_snapshot` to `skew_chain_index_ahead_returns_stale_stream` and add a doc-comment identifying it as direction #1 of the #1037 tip-skew suite. The companion test for the mempool-ahead direction follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

new `mempool_skew` mod (1:1 with `tests/mempool_skew.rs`); add the companion direction-#2 test alongside. - `mempool_skew::chain_index_ahead_returns_stale_stream` — pre-existing direction-#1 spec, deterministically failing post-Option-2 for the same reason it was a flake before (#1037). - `mempool_skew::mempool_ahead_rejects_fresh_snapshot` — new direction-#2 spec. Drives the source via `mine_blocks_silent` (no chain-index notify) so the mempool's poll observes the new tip first, exposing the symmetric false-negative case where the API rejects its own freshly-issued snapshot. Test-only infrastructure to enable the new test: - `MockchainSource::mine_blocks_silent` advances the active height without firing the change-notify, the only way to put the chain-index *behind* the mempool now that `mine_blocks` wakes the sync loop. - `MempoolSubscriber::mempool_tip` (cfg(test)) clones the `mempool_chain_tip` watch receiver; forwarder on `NodeBackedChainIndexSubscriber::mempool_tip`. - `wait_for_mempool_tip_change` helper in `mempool_skew.rs` waits on that receiver, gating direction #2 to act exactly in the skew window. - `mockchain_tests::wait_for_indexer_tip` widened to `pub(super)` so the sibling mod can reuse it. Direction #2 is expected to fail until the principled fix lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add `tips_converge_within_bounded_time_after_mining` — a 20-sample property test that mines one block per iteration, awaits the chain-index tip via the existing event-driven helper, then samples the mempool's tracked tip immediately and counts iterations where the mempool still has the pre-mining hash. Asserts `lag_count == 0`. Today the chain-index sync loop wakes on `mine_blocks` while the mempool serve loop polls on its own 100 ms cadence, so the mempool lags in the vast majority of iterations and the test fails deterministically. Once the principled fix wakes both subsystems on `mine_blocks`, both converge within a single iteration and `lag_count` is zero. Test-only support: - `MockchainSource::active_block_hash` returns the block hash at the current active height — needed to compare the mempool's tracked tip against the expected post-mining hash. - `wait_for_mempool_tip` helper waits for a specific tip hash on the mempool's watch (companion to the existing `wait_for_mempool_tip_change`); used to resync between iterations so the test isn't chasing stacked updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…shness (#1037 part 1/2) `NodeBackedChainIndexSubscriber::get_mempool_stream` previously delegated the snapshot-freshness check to `Mempool`, which compared the snapshot's tip against the mempool serve loop's `mempool_chain_tip`. Because the chain-index sync loop and the mempool serve loop poll the source on independent cadences, the two views diverge for a window after every block — and the API exposed the divergence as caller-visible internal contradictions. Move the freshness check into the chain-index subscriber and compare against the chain-index's *current* non-finalized tip (synchronous `arc-swap` load on `non_finalized_state`, same source of truth as `snapshot_nonfinalized_state`). Pass `None` to `MempoolSubscriber::get_mempool_stream` so the mempool no longer gates on tip — staleness is the chain-index's responsibility. This flips the two skew-direction regression tests added under #1037: - `mempool_skew::chain_index_ahead_returns_stale_stream` (direction #1): FAIL → PASS. After `mine_blocks` + `wait_for_indexer_tip`, the chain-index has the new tip while the mempool still has the old one. The pre-fix check matched the stale snapshot against the mempool's lagging view and accepted it; the new check matches against the chain-index's advanced view and correctly returns `None`. - `mempool_skew::mempool_ahead_rejects_fresh_snapshot` (direction #2): FAIL → PASS. After `mine_blocks_silent` + `wait_for_mempool_tip_change`, the mempool has the new tip while the chain-index still has the old one. The pre-fix check rejected the freshly-issued snapshot because the mempool's tip had moved past it; the new check matches against the chain-index's authoritative (lagging) view and correctly returns `Some`. `mempool_skew::tips_converge_within_bounded_time_after_mining` still fails — the mempool serve loop still polls on its own cadence, so it lags the chain-index in raw timing. That's the next work and lands in part 2/2 (per-subscriber wake on source change via `tokio::sync::broadcast`). The dead `Some(Err(MempoolError::IncorrectChainTip { .. }))` arm in `get_mempool_stream`'s outer match is left in place as defensive code — the mempool can no longer return that error from this call site, but the arm doesn't cost anything and keeps the match exhaustive against future Mempool surface changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…art 2/2) Both the chain-index sync loop and the mempool serve loop pace themselves on a fixed-cadence timer; only the chain-index used to wake early on source state change (Option-2 `Notify`). After a `mine_blocks`, the chain-index would converge in the same iteration but the mempool would not pick up the new tip until its 100 ms timer expired. That asymmetric lag was the failure mode behind `mempool_skew::tips_converge_within_bounded_time_after_mining`. Replace `BlockchainSource::wait_for_change` (a single shared never-resolves future) with `change_subscribe` returning `Option<broadcast::Receiver<()>>`. Each subscriber gets its own buffered receiver, so a `mine_blocks` that fires while one subsystem is mid-iteration is preserved on its receiver and consumed on the next `recv().await` — no missed-wake gap. Push-capable sources (the mockchain) override to return `Some`; poll-only sources (real validators) keep the `None` default and fall through to the timer. Fix the actual convergence bug: the post-`send_replace` cool-down at the top of the mempool serve loop's tip-change branch was a bare `tokio::time::sleep(100ms)` not wired to any wake source. After iter-0's mine event the mempool would land in this sleep and stay deaf to the broadcast for the next iter-1 mine — so the test saw 19 / 20 events lag (iter 0 the only exception, since the mempool started in the broadcast-aware bottom `select!` after init). Both the cool-down and the inter-iteration sleep now go through the shared helper. DRY the now-three call sites of the `match change_rx { Some => select!(sleep, recv), None => sleep }` pattern into `source::wait_or_source_change(change_rx.as_mut(), duration)`. The helper is `pub(super)` — only the chain-index sync loop and the mempool serve loop need it. `mempool_skew::tips_converge_within_bounded_time_after_mining`: FAIL (19/20 lags) → PASS (0 lags). The two direction tests (`chain_index_ahead_returns_stale_stream`, `mempool_ahead_rejects_fresh_snapshot`) keep passing — `mine_blocks_silent` still suppresses the broadcast, and the mempool's 100 ms timer alone is enough to put it ahead of the chain-index for direction #2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`escalates_to_critical_after_persistent_failure` open-coded the exact shape of `super::poll::poll_until`: a `loop { sleep; check; assert elapsed < deadline }` with a redundant post-loop deadline assert. Replace it with a single `poll_until` call — same probe cadence (`timings.initial_backoff`), same deadline (`timings.max_backoff_window() * 5`), with the deadline panic now handled inside the helper. Behaviour is preserved with one minor tightening: the original's post-loop `elapsed < max_time_to_critical` assert had a flake window of up to one `poll_interval` over budget (inner assert passed at iter N-1, then sleep + status==CriticalError break at iter N could push elapsed past the bound). `poll_until` collapses the two deadline checks into one panic on `Instant::now() >= deadline`, matching the test's stated intent. Brings this test in line with `tip_converges_after_burst_mine` in the same module — same helper, same shape. `Instant` and `sleep` imports stay; both are still used by `survives_transient_source_failure`, which is a soak test (verify status stays NOT-CriticalError for 2 s under a transient failure) and intentionally does not fit the `poll_until` shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop initial_backoff 25→10 ms, max_backoff 800→50 ms, and max_consecutive_failures 10→5 so escalates_to_critical_after_persistent_failure drives its backoff schedule to completion in ~120 ms (4 sleeps: 10 + 20 + 40 + 50, with the cap clamping the would-be 80 ms doubling) instead of ~4 s. The compressed schedule still exercises the same three observable behaviours as default(): counter→threshold, exponential doubling, and max_backoff clamp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…10 ms The 100 ms ProptestMockchain delay keeps fs.sync_to_height from initializing non_finalized_state before the passthrough_test body runs — an invariant the :109 assertion on StillSyncingFinalizedState depends on. 100 ms was a conservative gut number: at that rate, sync_to_height(139) takes ~28 s before initializing nfs, orders of magnitude more than any test body needs. Drop to 10 ms. sync_to_height(139) now takes ~2.8 s before nfs initializes, still far longer than any test body (longest is passthrough_get_block_range at ~2 s of body work). The invariant is preserved with comfortable headroom. Direct savings: - passthrough_get_block_range: its 10-block serial stream goes from 10 × 100 ms = 1 s per task → 10 × 10 ms = 100 ms. ~900 ms off wall. - All seven passthrough_* tests: Mempool::spawn's three serial source probes (get_mempool_txids, get_best_block_hash, get_mempool_transactions) drop from 3 × 100 ms to 3 × 10 ms. ~270 ms off each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l::spawn Mempool::spawn had three serial startup probes: (1) retry-loop on get_mempool_txids waiting for validator mempool online, (2) one-shot get_best_block_hash (fatal on error), (3) retry-loop on get_mempool_transactions for initial contents. (1) and (2) share no data — (2) doesn't depend on mempool liveness, and (1) discards its result. Serial ordering added one full source round-trip to every NodeBackedChainIndex::new for no semantic reason. Run (1) and (2) concurrently via tokio::try_join!. If (2) hard-errors, try_join! cancels (1)'s retry-loop and returns the Critical error — same overall outcome as before, just without the wait. The 3 s retry backoff on (1) is preserved; (2)'s fatal-on-error semantics are preserved. Savings per indexer init: - tests (mock source at 10 ms): ~10 ms - prod LAN: a few ms - prod WAN: 50-200 ms Every passthrough_* test, make_chain, and any other test or code path that spins up a NodeBackedChainIndex benefits. (3) still runs serially after the parallel gather — extracting it for concurrent execution is a separate refactor because its helper is wired through the partially-built Mempool struct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l::spawn gather Follow-up to the previous Mempool::spawn refactor, which parallelized the wait-for-online and best-block-hash probes but left the initial mempool-snapshot fetch running serially after them. Fold it in so all three source-dependent startup calls run concurrently. To make it poolable with the other two, extracted Mempool::get_mempool_transactions (private method on Mempool<T>) into a free fn fetch_mempool_snapshot<T: BlockchainSource>(fetcher: &T) — the method only ever touched self.fetcher, so there was no real reason it had to be a method. The serve() call site now uses the free fn too. Two minor behavioural cleanups enabled by this: - Initial Mempool status is now constructed as Ready instead of Spawning (the snapshot has already been fetched by the time we build the struct), matching what the old code did on first-iteration success. - Dropped the `state.notify(...)` call that was inside the snapshot retry loop — during spawn there are no subscribers attached yet, so the notify was unreachable. Savings per indexer init: - tests (mock source at 10 ms): 30 ms serial → 10 ms parallel, ~20 ms. - prod LAN: a few ms. - prod WAN: 100–400 ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

zancas added 11 commits April 24, 2026 17:31

zancas requested a review from AloeareV April 25, 2026 03:17

zancas added 4 commits April 24, 2026 20:24

Merge branch 'dev' into analyze_make_chain

0fe33e0

Merge branch 'dev' into analyze_make_chain

7ca8cd4

Merge branch 'dev' into analyze_make_chain

b06860d

zancas assigned AloeareV Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze make chain#1056

Analyze make chain#1056
zancas wants to merge 15 commits intodevfrom
analyze_make_chain

zancas commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zancas commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants