Skip to content

[Perf] Dedicated sync streams#4232

Draft
ljedrz wants to merge 4 commits intostagingfrom
perf/dedicated_sync_stream2
Draft

[Perf] Dedicated sync streams#4232
ljedrz wants to merge 4 commits intostagingfrom
perf/dedicated_sync_stream2

Conversation

@ljedrz
Copy link
Copy Markdown
Collaborator

@ljedrz ljedrz commented Apr 24, 2026

This PR is a draft implementation of dedicated sync streams listed in the node hardening issue, as an alternative to the general chunking proposal. Initially, it applies only to the validators, which will benefit from it the most.

The main reasons for the proposal are:

  • increasing the performance of consensus (by reducing Head-of-Line blocking caused by large network messages)
  • reducing DoS surface (by greatly reducing the maximum network message size)

Benefits over chunking:

  • simpler non-sync network plumbing
  • the potential to reduce the maximum network message size to a much greater degree

The general approach is as follows: the Gateway's Event gains 2 new variants (in order to temporarily maintain backward compatibility via the existing BlockRequest and BlockResponse), SyncRequest and SyncResponse. When a node receives a SyncRequest, it responds with an address to a dedicated sync stream, and a short-lived access token that must be used in order to establish the connection. Once established, the responder sends BlockResponse messages (through the dedicated stream), and the existing BlockSync plumbing handles the rest. Once the maximum number of responses (if such a limit is desired) per a sync stream is sent, a new sync stream needs to be opened in order to receive more blocks.

The rough list of code changes, enumerated for simpler referencing if need be:

  1. Some of the network messages are moved to snarkos-node-network to avoid circular dependencies (it will also make sense for the currently Gateway-only messages to reside there once this syncing is extended to non-validators).
  2. New structs, SyncToken and SyncResponse, are introduced.
  3. The Event is extended with 2 new variants, SyncRequest (holding a BlockRequest) and SyncResponse (which holds the new SyncResponse struct).
  4. A SyncStreams object is introduced; it is essentially a node that requires no special peer handling or address resolution, and has a trivial handshake. Just like the Gateway, it contains a clone of sync-related Senders, and the LedgerService. Using a node makes stream handling a lot simpler, and - compared to ad-hoc streams - reduces potential NAT/firewall issues (since only a single listener port is involved). The Tcp node is very lightweight, and so are its connections.
  5. The BlockSync plumbing is adjusted to account for the new logic.

The current state of the PR: nodes can successfully establish dedicated sync streams and send/receive blocks, but I'm running into design details of the current BlockSync setup that are incompatible with the new approach; the problematic spots can be seen commented out in block_sync.rs. The issues I've identified thus far are:

  • the sync requests are matched with addresses associated with BlockLocators (as opposed to the dedicated sync streams)
  • the sync responses are currently distributed in a somewhat "fanout" fashion; instead, we should now also be prepared to send many BlockResponse (or even just Block) messages to single peers via a single stream; the expected block ranges should also be much larger, so as to minimize the number of SyncRequest messages that need to be handled by the Gateway

Once these are resolved, the related tests will also need some adjustment.

@kaimast please let me know if you have ideas on how the aforementioned BlockSync integration issues can be solved while maintaining backward compatibility, or suggest how this logic could be delegated elsewhere; feel free to commit to this PR if you'd like to integrate these changes with BlockSync in a way that's aligned with your design and future plans.

Open questions:

  1. Do we want to "cycle" through several streams while syncing? It might be unnecessary, since at that point we're already past twofold authentication (the validator handshake + the access token). Since the sync streams are not used for anything else and are lightweight, I see little harm in it.
  2. Should we limit the requestor to a single sync stream? This is a problem if we wanted to have high syncing performance while cycling streams (as we couldn't begin a new stream before concluding the existing one).
  3. Do we want to limit the number of requested blocks? This needs to be weighed with the desired syncing redundancy factor and the performance implications it has for the providers. Note: dedicated sync streams are by design more robust than singular requests for blocks, so we may not need as much redundancy anymore.
  4. The values for some of the consts.

ljedrz added 3 commits April 24, 2026 15:32
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
@ljedrz ljedrz requested review from kaimast and vicsn April 24, 2026 14:47
Copy link
Copy Markdown
Collaborator

@vicsn vicsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1./2. I would try and optimize for simplicity of the implementation. With the foundations in place, in the future we can more formally balance robustness and performance under various scenario's.

3./4. Can you take a first stab in a google doc for the limits we choose? We can try to allocate, say, a rough 10GB at any given time for all peers combined. Assuming current average blocksize if needed.

};

#[derive(Clone, PartialEq, Eq, Hash)]
pub struct SyncToken([u8; 32]);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document what this is and how it works? Are we exposed to worse MITM attacks compared to the Gateway handshake?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The token is sent in plaintext and could be observed or tampered with by an inline MITM, yes. However, it is strictly an anti-DoS mechanism to prevent unauthorized resource consumption on the validator, not a cryptographic session key. The integrity of the downloaded blocks is guaranteed by the hashes and signatures, not by the transport layer, so any MITM tampering would be instantly caught and rejected.

Comment thread node/sync/src/node.rs
#[async_trait]
impl<N: Network> OnConnect for SyncStreams<N> {
async fn on_connect(&self, peer_addr: SocketAddr) {
// Check if we're the ones who provide the sync.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can two peers sync from each other concurrently? Should we add a test for these edge cases?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is controlled by the BlockSync logic, and remains unchanged. I don't see how such a scenario would be useful, so I'm pretty sure it is disallowed. As for the comment, this check is only required due to the modular nature of the Tcp plumbing - at the point of OnConnect::on_connect, we don't readily know the side of the connection (though this could be exposed by the Tcp if needed), so we can look up the applicable block request (which we would need anyway) in order to check it.

Comment on lines 118 to +124
// Retrieve the start (inclusive) and end (exclusive) block height.
let candidate_start_height = self.first().map(|b| b.height()).unwrap_or(0);
let candidate_end_height = 1 + self.last().map(|b| b.height()).unwrap_or(0);
// let candidate_start_height = self.first().map(|b| b.height()).unwrap_or(0);
// let candidate_end_height = 1 + self.last().map(|b| b.height()).unwrap_or(0);
// Check that the range matches the block request.
if start_height != candidate_start_height || end_height != candidate_end_height {
bail!("Peer '{peer_ip}' sent an invalid block response (range does not match block request)")
}
// if start_height != candidate_start_height || end_height != candidate_end_height {
// bail!("Peer '{peer_ip}' sent an invalid block response (range does not match block request)")
// }
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are to reuse the existing BlockRequest, this check is no longer correct, as the request can span a much greater range of blocks than the responses

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants