Skip to content

Analyze make chain#1056

Open
zancas wants to merge 15 commits intodevfrom
analyze_make_chain
Open

Analyze make chain#1056
zancas wants to merge 15 commits intodevfrom
analyze_make_chain

Conversation

@zancas
Copy link
Copy Markdown
Member

@zancas zancas commented Apr 25, 2026

Builds on #1055 merge that first.
Fixes: #1038

zancas added 11 commits April 24, 2026 17:31
  Subscribe to the chain-index sync loop's `NamedAtomicStatus` via
  `tokio::sync::watch` so test setup wakes on the first `Ready`
  transition instead of after the next 2 s polling tick. A single
  `sleep(sync_timings.interval)` after rendezvous preserves the
  load-bearing teardown alignment at ~500 ms (down from 2 s); that shim
  disappears when the indexer gains `CancellationToken`-based shutdown.
  `wait_for_indexer_tip` is rewired against the same watch with a
  `tokio::time::timeout` budget.

  Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hanges

  Add `BlockchainSource::wait_for_change` (default returns a never-resolving
  future), have the success-path interval sleep `select!` between the timer
  and that future. MockchainSource overrides `wait_for_change` with a
  `Notify` pinged from `mine_blocks`, so the chain-index sync loop wakes
  immediately on a mocked block instead of after the next 500 ms tick.

  Per-block test latency drops from ~510 ms (one full `interval`) to ~10 ms
  (sync work only). `sync_blocks_after_startup` ~12 s → ~1–2 s, hitting
  #1039's "under 2 s" target. Real backends keep the default and pace on
  the timer unchanged.

  `NonFinalizedState::initialize` now takes `source.clone()` so the outer
  `source` survives the inner async block's `.await` and is reachable in
  the new `select!` arm.

  Known regression: `get_mempool_stream_for_stale_snapshot` flips from a
  probabilistic flake (#1037) to a deterministic failure. The chain-index
  sync loop now wakes immediately on `mine_blocks`, but the mempool serve
  loop still polls on its 100 ms cadence, so the chain-index/mempool
  tip-skew window is now always observable. The principled fix
  (per-subscriber notify + select! in the mempool loop) is the work
  tracked under #1037 and follows in a TDD'd PR.

  Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion #1

  Rename `get_mempool_stream_for_stale_snapshot` to
  `skew_chain_index_ahead_returns_stale_stream` and add a doc-comment
  identifying it as direction #1 of the #1037 tip-skew suite. The
  companion test for the mempool-ahead direction follows.

  Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
  new `mempool_skew` mod (1:1 with `tests/mempool_skew.rs`); add the
  companion direction-#2 test alongside.

  - `mempool_skew::chain_index_ahead_returns_stale_stream` —
    pre-existing direction-#1 spec, deterministically failing post-Option-2
    for the same reason it was a flake before (#1037).
  - `mempool_skew::mempool_ahead_rejects_fresh_snapshot` — new
    direction-#2 spec. Drives the source via `mine_blocks_silent` (no
    chain-index notify) so the mempool's poll observes the new tip first,
    exposing the symmetric false-negative case where the API rejects its
    own freshly-issued snapshot.

  Test-only infrastructure to enable the new test:

  - `MockchainSource::mine_blocks_silent` advances the active height
    without firing the change-notify, the only way to put the chain-index
    *behind* the mempool now that `mine_blocks` wakes the sync loop.
  - `MempoolSubscriber::mempool_tip` (cfg(test)) clones the
    `mempool_chain_tip` watch receiver; forwarder on
    `NodeBackedChainIndexSubscriber::mempool_tip`.
  - `wait_for_mempool_tip_change` helper in `mempool_skew.rs` waits on
    that receiver, gating direction #2 to act exactly in the skew window.
  - `mockchain_tests::wait_for_indexer_tip` widened to `pub(super)` so
    the sibling mod can reuse it.

  Direction #2 is expected to fail until the principled fix lands.

  Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
  Add `tips_converge_within_bounded_time_after_mining` — a 20-sample
  property test that mines one block per iteration, awaits the
  chain-index tip via the existing event-driven helper, then samples the
  mempool's tracked tip immediately and counts iterations where the
  mempool still has the pre-mining hash. Asserts `lag_count == 0`.

  Today the chain-index sync loop wakes on `mine_blocks` while the
  mempool serve loop polls on its own 100 ms cadence, so the mempool
  lags in the vast majority of iterations and the test fails
  deterministically. Once the principled fix wakes both subsystems on
  `mine_blocks`, both converge within a single iteration and `lag_count`
  is zero.

  Test-only support:

  - `MockchainSource::active_block_hash` returns the block hash at the
    current active height — needed to compare the mempool's tracked tip
    against the expected post-mining hash.
  - `wait_for_mempool_tip` helper waits for a specific tip hash on the
    mempool's watch (companion to the existing
    `wait_for_mempool_tip_change`); used to resync between iterations so
    the test isn't chasing stacked updates.

  Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…shness (#1037 part 1/2)

  `NodeBackedChainIndexSubscriber::get_mempool_stream` previously
  delegated the snapshot-freshness check to `Mempool`, which compared
  the snapshot's tip against the mempool serve loop's
  `mempool_chain_tip`. Because the chain-index sync loop and the mempool
  serve loop poll the source on independent cadences, the two views
  diverge for a window after every block — and the API exposed the
  divergence as caller-visible internal contradictions.

  Move the freshness check into the chain-index subscriber and compare
  against the chain-index's *current* non-finalized tip (synchronous
  `arc-swap` load on `non_finalized_state`, same source of truth as
  `snapshot_nonfinalized_state`). Pass `None` to
  `MempoolSubscriber::get_mempool_stream` so the mempool no longer gates
  on tip — staleness is the chain-index's responsibility.

  This flips the two skew-direction regression tests added under #1037:

  - `mempool_skew::chain_index_ahead_returns_stale_stream` (direction
    #1): FAIL → PASS. After `mine_blocks` + `wait_for_indexer_tip`, the
    chain-index has the new tip while the mempool still has the old one.
    The pre-fix check matched the stale snapshot against the mempool's
    lagging view and accepted it; the new check matches against the
    chain-index's advanced view and correctly returns `None`.

  - `mempool_skew::mempool_ahead_rejects_fresh_snapshot` (direction #2):
    FAIL → PASS. After `mine_blocks_silent` + `wait_for_mempool_tip_change`,
    the mempool has the new tip while the chain-index still has the old
    one. The pre-fix check rejected the freshly-issued snapshot because
    the mempool's tip had moved past it; the new check matches against
    the chain-index's authoritative (lagging) view and correctly returns
    `Some`.

  `mempool_skew::tips_converge_within_bounded_time_after_mining` still
  fails — the mempool serve loop still polls on its own cadence, so it
  lags the chain-index in raw timing. That's the next work and lands in
  part 2/2 (per-subscriber wake on source change via
  `tokio::sync::broadcast`).

  The dead `Some(Err(MempoolError::IncorrectChainTip { .. }))` arm in
  `get_mempool_stream`'s outer match is left in place as defensive code
  — the mempool can no longer return that error from this call site, but
  the arm doesn't cost anything and keeps the match exhaustive against
  future Mempool surface changes.

  Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…art 2/2)

    Both the chain-index sync loop and the mempool serve loop pace
    themselves on a fixed-cadence timer; only the chain-index used to wake
    early on source state change (Option-2 `Notify`). After a `mine_blocks`,
    the chain-index would converge in the same iteration but the mempool
    would not pick up the new tip until its 100 ms timer expired. That
    asymmetric lag was the failure mode behind
    `mempool_skew::tips_converge_within_bounded_time_after_mining`.

    Replace `BlockchainSource::wait_for_change` (a single shared
    never-resolves future) with `change_subscribe` returning
    `Option<broadcast::Receiver<()>>`. Each subscriber gets its own
    buffered receiver, so a `mine_blocks` that fires while one subsystem
    is mid-iteration is preserved on its receiver and consumed on the
    next `recv().await` — no missed-wake gap. Push-capable sources (the
    mockchain) override to return `Some`; poll-only sources (real
    validators) keep the `None` default and fall through to the timer.

    Fix the actual convergence bug: the post-`send_replace` cool-down at
    the top of the mempool serve loop's tip-change branch was a bare
    `tokio::time::sleep(100ms)` not wired to any wake source. After
    iter-0's mine event the mempool would land in this sleep and stay
    deaf to the broadcast for the next iter-1 mine — so the test saw 19
    / 20 events lag (iter 0 the only exception, since the mempool
    started in the broadcast-aware bottom `select!` after init). Both
    the cool-down and the inter-iteration sleep now go through the
    shared helper.

    DRY the now-three call sites of the
    `match change_rx { Some => select!(sleep, recv), None => sleep }`
    pattern into `source::wait_or_source_change(change_rx.as_mut(),
    duration)`. The helper is `pub(super)` — only the chain-index sync
    loop and the mempool serve loop need it.

    `mempool_skew::tips_converge_within_bounded_time_after_mining`:
    FAIL (19/20 lags) → PASS (0 lags). The two direction tests
    (`chain_index_ahead_returns_stale_stream`,
    `mempool_ahead_rejects_fresh_snapshot`) keep passing — `mine_blocks_silent`
    still suppresses the broadcast, and the mempool's 100 ms timer alone
    is enough to put it ahead of the chain-index for direction #2.

    Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
    `escalates_to_critical_after_persistent_failure` open-coded the exact
    shape of `super::poll::poll_until`: a `loop { sleep; check; assert
    elapsed < deadline }` with a redundant post-loop deadline assert.
    Replace it with a single `poll_until` call — same probe cadence
    (`timings.initial_backoff`), same deadline (`timings.max_backoff_window()
    * 5`), with the deadline panic now handled inside the helper.

    Behaviour is preserved with one minor tightening: the original's
    post-loop `elapsed < max_time_to_critical` assert had a flake window
    of up to one `poll_interval` over budget (inner assert passed at
    iter N-1, then sleep + status==CriticalError break at iter N could
    push elapsed past the bound). `poll_until` collapses the two
    deadline checks into one panic on `Instant::now() >= deadline`,
    matching the test's stated intent.

    Brings this test in line with `tip_converges_after_burst_mine` in
    the same module — same helper, same shape. `Instant` and `sleep`
    imports stay; both are still used by
    `survives_transient_source_failure`, which is a soak test (verify
    status stays NOT-CriticalError for 2 s under a transient failure)
    and intentionally does not fit the `poll_until` shape.

    Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
  Drop initial_backoff 25→10 ms, max_backoff 800→50 ms, and
  max_consecutive_failures 10→5 so escalates_to_critical_after_persistent_failure
  drives its backoff schedule to completion in ~120 ms (4 sleeps:
  10 + 20 + 40 + 50, with the cap clamping the would-be 80 ms doubling)
  instead of ~4 s. The compressed schedule still exercises the same three
  observable behaviours as default(): counter→threshold, exponential
  doubling, and max_backoff clamp.

  Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…10 ms

  The 100 ms ProptestMockchain delay keeps fs.sync_to_height from
  initializing non_finalized_state before the passthrough_test body
  runs — an invariant the :109 assertion on StillSyncingFinalizedState
  depends on. 100 ms was a conservative gut number: at that rate,
  sync_to_height(139) takes ~28 s before initializing nfs, orders of
  magnitude more than any test body needs.

  Drop to 10 ms. sync_to_height(139) now takes ~2.8 s before nfs
  initializes, still far longer than any test body (longest is
  passthrough_get_block_range at ~2 s of body work). The invariant is
  preserved with comfortable headroom.

  Direct savings:
  - passthrough_get_block_range: its 10-block serial stream goes from
    10 × 100 ms = 1 s per task → 10 × 10 ms = 100 ms. ~900 ms off wall.
  - All seven passthrough_* tests: Mempool::spawn's three serial source
    probes (get_mempool_txids, get_best_block_hash,
    get_mempool_transactions) drop from 3 × 100 ms to 3 × 10 ms.
    ~270 ms off each.

  Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l::spawn

  Mempool::spawn had three serial startup probes: (1) retry-loop on
  get_mempool_txids waiting for validator mempool online, (2) one-shot
  get_best_block_hash (fatal on error), (3) retry-loop on
  get_mempool_transactions for initial contents.

  (1) and (2) share no data — (2) doesn't depend on mempool liveness, and
  (1) discards its result. Serial ordering added one full source round-trip
  to every NodeBackedChainIndex::new for no semantic reason.

  Run (1) and (2) concurrently via tokio::try_join!. If (2) hard-errors,
  try_join! cancels (1)'s retry-loop and returns the Critical error — same
  overall outcome as before, just without the wait. The 3 s retry backoff
  on (1) is preserved; (2)'s fatal-on-error semantics are preserved.

  Savings per indexer init:
  - tests (mock source at 10 ms): ~10 ms
  - prod LAN: a few ms
  - prod WAN: 50-200 ms

  Every passthrough_* test, make_chain, and any other test or code path
  that spins up a NodeBackedChainIndex benefits. (3) still runs serially
  after the parallel gather — extracting it for concurrent execution is a
  separate refactor because its helper is wired through the
  partially-built Mempool struct.

  Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zancas zancas requested a review from AloeareV April 25, 2026 03:17
zancas added 4 commits April 24, 2026 20:24
…l::spawn gather

  Follow-up to the previous Mempool::spawn refactor, which parallelized
  the wait-for-online and best-block-hash probes but left the initial
  mempool-snapshot fetch running serially after them. Fold it in so all
  three source-dependent startup calls run concurrently.

  To make it poolable with the other two, extracted
  Mempool::get_mempool_transactions (private method on Mempool<T>) into a
  free fn fetch_mempool_snapshot<T: BlockchainSource>(fetcher: &T) — the
  method only ever touched self.fetcher, so there was no real reason it
  had to be a method. The serve() call site now uses the free fn too.

  Two minor behavioural cleanups enabled by this:
  - Initial Mempool status is now constructed as Ready instead of Spawning
    (the snapshot has already been fetched by the time we build the
    struct), matching what the old code did on first-iteration success.
  - Dropped the `state.notify(...)` call that was inside the snapshot
    retry loop — during spawn there are no subscribers attached yet, so
    the notify was unreachable.

  Savings per indexer init:
  - tests (mock source at 10 ms): 30 ms serial → 10 ms parallel, ~20 ms.
  - prod LAN: a few ms.
  - prod WAN: 100–400 ms.

  Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(test): chain_index::tests::proptest_blockgen::make_chain is slow (~10.3 s)

2 participants