[Feature] Batch Triggers by kaimast · Pull Request #4190 · ProvableHQ/snarkOS

kaimast · 2026-03-30T04:27:38Z

This PR changes the batch creation task, to trigger on certain events, instead of only periodically. The goal is to avoid cases where a node is not ready to propose and then waits another MIN_BATCH_DELAY (1s) instead of only for the relevant conditions. It also ensures that the batch delay timer is reset whenever the node advances.

The most important changes in this PR are in Primary::start_handlers and the new proposal_task.rs module. The README details when the batch creation task is woken up.

Overview of Changes

The major change of this PR is to only have proposals occur in a single location -- the proposal task. It gets woken up on specific events, such as timer expiration or receiving sufficient certificates for a round.
Unlike before, where it woke up every second and retired independent of round advancement, it now correctly resets timers after advancing rounds.

Minor Changes

6167485 changes some of the time constants to use Duration. This avoids the potential for incorrect conversion of these constants
cefe21b moves the proposal logic to a dedicated proposal_task module. This enables unit testing its behavior and when it will be woken up.

Testing

End-to-end stress tests passed:

…w block is committeed

Add a `commit_lock` (async mutex) to serialize concurrent calls to `commit_leader_certificate`. Without it, a second certificate crossing the availability threshold while the consensus callback for a prior commit is still in-flight would re-walk already-committed rounds (because `last_committed_round` hasn't been updated yet), causing duplicate subdag commits.

vicsn

Cool, do you see a lower average blocktime in your stress tests or in the CI devnets?

vicsn · 2026-04-03T09:06:42Z

            let mut state = self.sync_state.write();
            state.set_sync_height(new_height);
-            !state.can_issue_new_block_requests()
+            (state.is_block_synced(), !state.can_issue_new_block_requests())


I may have asked this before, but when would we ever be synced but still able to issue new block requests?

If you are one block behind you are still considered "synced" but could also request one more block.

Can you write or link up some docblocks or README paragraph giving examples with specific block heights, for either validators and clients, for how the two concepts evolve?

raychu86 · 2026-04-07T01:23:30Z

+pub const MAX_FETCH_TIMEOUT: Duration = Duration::from_secs(3 * MAX_BATCH_DELAY.as_secs());
+/// The maximum time allowed for the leader to send their certificate.
+/// After this time, the node will consider the leader as failed and try to advance the round without it.
+pub const MAX_LEADER_CERTIFICATE_DELAY: Duration = Duration::from_secs(2 * MAX_BATCH_DELAY.as_secs());


I don't believe this value is the same as before. This is now 4 seconds (2 * 2 = 4). Whereas it was 5 seconds before (2 * 2500 / 1000 = 5000 / 1000 = 5) .

FIxed in 942daae. I also added a note to change it to simply = 2 * MAX_BATCH_DELAY once const_trait_impl is stable.

raychu86 · 2026-04-07T01:24:33Z

+pub const CREATE_BATCH_INTERVAL: Duration = Duration::from_millis(250);
+
+/// The maximum time to wait before timing out on a fetch.
+pub const MAX_FETCH_TIMEOUT: Duration = Duration::from_secs(3 * MAX_BATCH_DELAY.as_secs());


Same truncation issue here

raychu86 · 2026-04-07T01:25:55Z

    ledger: Arc<dyn LedgerService<N>>,
    /// The workers.
    workers: Arc<OnceLock<Vec<Worker<N>>>>,
+


nit: why the new lines here?

It makes the multi-line doc comments easier to read IMHO, but I can revert it.

ljedrz

LGTM in general, though it is a non-negligible increase in the amount of logic; it would be great if in the near future we could generalize the "trigger X if Y or Z happens or a timeout occurs" logic across our use cases.

kaimast · 2026-04-09T04:35:09Z

Do you know what kind of speed-ups we should expect in consensus? Or if there is meaningful impact at all to block time?

I am still trying to figure out a straightforward way of measuring this. My expectation is that the average block time stays about the same but the tail end of block times will be lower.

vicsn · 2026-04-14T11:11:26Z

Forwarding from @kaimast , this PR introduces a significant improvement in commit latency for light networks (5 validators, weak machines, see image below), and at least no regression for normal prerelease runs.

vicsn · 2026-04-14T11:23:39Z

                    }
                }
+
+                attempt += 1;


Do we or should we ever decrement attempt?

this is fine, it gets reset with every iteration of the outer loop, once the round advances or a proposal is cut

vicsn · 2026-04-14T11:25:11Z

+                                    }
+                                };
+
+                if attempt > 1 {


Am I reading correctly that when attempt > 1, we first sleep for at least MIN_BATCH_DELAY, and then again for at least CREATE_BATCH_INTERVAL? That seems like too much. Or did you intend for this to be sleep_until(round_start + CREATE_BATCH_INTERVAL) or something?

Maybe you can document the role of attempt better

The second and subsequent calls to sleep should return immediately, because round_start is only reset when the round number changes.

ljedrz

Left a new round of comments.

…llible

kaimast · 2026-04-15T04:47:39Z

+    // Final heap use should be close to that after the first connection.
+    // We allow up to 1 KiB of growth to accommodate small per-task allocations that
+    // tokio's runtime retains internally across connections (e.g. sharded queue state).
+    let heap_growth = heap_after_loop.saturating_sub(heap_after_one_conn.unwrap());


I had this test failed and got the following explanation by Claude:

The ~700-byte heap growth across 9 extra connect/disconnect cycles comes from Data::deserialize() calling tokio::task::spawn_blocking() during signature
verification in the handshake (verify_challenge_response). When a signature arrives from the network it is a Data::Buffer(bytes) variant, so snarkvm
offloads the cryptographic deserialization to a blocking thread.

Each full connect/disconnect cycle triggers this twice (once on each side of the handshake), so 9 extra cycles = 18 spawn_blocking calls beyond the baseline.

In tokio 1.52.0 (upgraded from 1.51.1 by the cargo update on this branch), PR #7757 replaced the blocking pool's injection queue with a sharded structure o improve scalability. The sharded queue retains a small amount of internal bookkeeping per submitted task even after the task completes — roughly 39 bytes per call × 18 calls ≈ 700 bytes. All of our application-level data structures (peer_pool, resolver, cache, conn.tasks, tcp.tasks) are fully cleaned up after each cycle; the residual allocation is entirely inside tokio's runtime internals.

The 1 KiB assertion bound accounts for this overhead while still being tight enough to catch any real per-connection leak (even accumulating a small struct across 9 connections would far exceed 1 KiB).

@lukasz the fix makes sense to me, but maybe you can also take quick look

this is perfectly in line with my past experiences with tokio, where some allocated memory is retained due to plumbing: tokio-rs/tokio#4031

kaimast · 2026-04-15T05:52:01Z

I think I addressed all the comments. However, the most recent stresstest run for this branch was not successful, and I need to address that first before we can merge the branch.

ljedrz

LGTM with the stresstest issue resolved.

kaimast force-pushed the feat/batch-triggers branch from 0f2041b to 84e73d5 Compare March 30, 2026 04:30

kaimast changed the base branch from staging to chore/update-rand March 30, 2026 22:58

kaimast force-pushed the feat/batch-triggers branch from 40ec68f to a51872b Compare March 30, 2026 23:02

kaimast force-pushed the chore/update-rand branch from e1694dc to 91692c7 Compare March 30, 2026 23:52

kaimast force-pushed the feat/batch-triggers branch from b20642d to a3177c4 Compare March 30, 2026 23:55

Base automatically changed from chore/update-rand to staging March 31, 2026 12:03

kaimast marked this pull request as ready for review March 31, 2026 19:39

kaimast and others added 11 commits April 1, 2026 11:45

feat(node/consensus): wake up unconfirmed transmission task when a ne…

f54ed3c

…w block is committeed

misc(node/bft): use test_log to only print logs of failed tests

48ed8d7

refactor(node/bft): introduce dedicated batch proposal task

4321348

refactor(node/bft): use Duration for timing constants

d91d565

feat(node/bft): wake up primary when sync is finished

0c0b6b6

refactor(node/bft): isolate and test proposal logic

695130c

fix(node/bft): wake up on proposal task on max leader delay

1949178

ci: enable reset tests for this branch

cb2caa2

fix(node/bft): ensure recently_committed and gc_round are updated

cb9f143

chore: run cargo update

da71e92

kaimast force-pushed the feat/batch-triggers branch from a3177c4 to da71e92 Compare April 1, 2026 18:55

kaimast requested review from ljedrz, raychu86 and vicsn April 1, 2026 21:55

kaimast mentioned this pull request Apr 1, 2026

[Feature] Batch creation and round advancement improvements #4149

Closed

vicsn requested changes Apr 3, 2026

View reviewed changes

kaimast mentioned this pull request Apr 3, 2026

[Fix] Properly handle late-arriving signatures #4202

Draft

kaimast added 2 commits April 6, 2026 14:32

Merge remote-tracking branch 'origin/staging' into feat/batch-triggers

9dbd343

misc(node): fix comments

0d65b5f

raychu86 reviewed Apr 7, 2026

View reviewed changes

ljedrz reviewed Apr 7, 2026

View reviewed changes

Comment thread node/bft/src/primary/proposal_task.rs Outdated

ljedrz reviewed Apr 7, 2026

View reviewed changes

kaimast added 2 commits April 8, 2026 17:03

fix(node/bft): avoid busy wait

3a10cef

fix(node/bft): use higher precision when multiplying duration constants

942daae

vicsn mentioned this pull request Apr 10, 2026

[Tracking issue] consensus improvements #3960

Open

10 tasks

refactor(node/bft): simplify proposal_task logic

6d7e32e

vicsn requested a review from ljedrz April 14, 2026 11:14

vicsn requested changes Apr 14, 2026

View reviewed changes