Skip to content

[Feature] Improve (dis-)connect handling between peers#3902

Merged
kaimast merged 17 commits intostagingfrom
fix/already-connecting
Jan 26, 2026
Merged

[Feature] Improve (dis-)connect handling between peers#3902
kaimast merged 17 commits intostagingfrom
fix/already-connecting

Conversation

@kaimast
Copy link
Copy Markdown
Contributor

@kaimast kaimast commented Oct 1, 2025

Fixes #3893. The PR is quite large, below is a high-level overview of the changes.

Better Connection Handling

The main purpose of this PR is to propagate errors during handshakes "up", so the error is logged at the appropriate location. It also enables filtering out benign errors such as "Already connecting".

This is achieved by using a concrete ConnectionError type instead of anyhow::Error. Previously, the code also used Option or booleans to indicate a duplicate connection, which did not make the code very readable. To replace those, the new disconnect reasons, mentioned below, are very useful.

New Disconnect Reasons

The PR adds a four new disconnect reasons "already connecting", "already connected", "connecting to myself", "no untrusted external peers allowed". This avoids those confusing peer disconnected before sending Message::ChallengeResponse` messages.

We previously added support for unknown disconnect reasons (see #4018), so this can be added without causing problems.

Significant Reduction of Connection-Related Warnings

Nodes still generated a significant number of warnings even with these changes. One culprit was that the heartbeat logic sometimes reattempted failed connections too frequently. The PR now adds a mandatory 10-second cooldown period between connection attempts.
In the future, we should consider increasing the cooldowns with every failed attempt.

Additionally, I also expanded the notion of a grace period at startup, where nodes do not log errors, such as "no connected validators". Now, nodes do not generate warnings in the first minute about not being fully connected.
We already did this for the stake-related checks, but not for other router/gateway errors.

To test these changes, the PR adds a new check to devnet_ci.sh that limits the number of warnings a node is allowed to generate. For now, I limited them to 10 per node.
There are some connections related errors at startup that we cannot remove easily; nodes all start concurrently and then fail to connect to each other at the first attempt.

@ljedrz
Copy link
Copy Markdown
Collaborator

ljedrz commented Oct 1, 2025

Some of the goals of this PR are achieved with #3900 (which is also a bugfix). In addition, I can see that some of the changes here are conflicting with the aforementioned PR, which deduplicates some of the related code.

That being said, the idea to introduce a new, concrete error type is solid, and so is the Result change in the Handshake protocol. Since I've been tinkering with peering quite a lot lately, my recommendation would be as follows:

  1. close this PR for now, as the code it is applied to can still be compacted, making the change surface smaller
  2. simplify/deduplicate the network-related code further (we still have plenty of duplicate functionalities between the Gateway and the Router)
  3. update and extend the connectivity/peering tests to make sure they cover everything we need (it's been a long time since their introduction)
  4. reconsider these changes and apply them in small chunks, without any unrelated adjustments

@vicsn
Copy link
Copy Markdown
Collaborator

vicsn commented Oct 1, 2025

Agreed with all of the above except:

simplify/deduplicate the network-related code further (we still have plenty of duplicate functionalities between the Gateway and the Router)

Please no refactors unless agreed upon and resolving an existing issue.

@ljedrz
Copy link
Copy Markdown
Collaborator

ljedrz commented Oct 1, 2025

Please no refactors unless agreed upon and resolving an existing issue.

Fair enough; I was mostly thinking about the prospect of a new node type (BootstrapClient), and some of the existing issues (like #3888), which are likely to naturally lead to such a result.

@kaimast kaimast force-pushed the fix/already-connecting branch 5 times, most recently from 2540eb9 to a43475f Compare October 1, 2025 21:49
@kaimast kaimast force-pushed the fix/already-connecting branch 2 times, most recently from 6e30d1b to d5b7767 Compare October 15, 2025 04:40
@kaimast kaimast force-pushed the fix/already-connecting branch 2 times, most recently from a44a36e to 3772564 Compare November 6, 2025 02:49
@kaimast kaimast force-pushed the fix/already-connecting branch from 3772564 to 9ca9004 Compare November 13, 2025 23:23
@kaimast kaimast force-pushed the fix/already-connecting branch 2 times, most recently from 9d514fd to cddd288 Compare December 3, 2025 19:37
@kaimast kaimast force-pushed the fix/already-connecting branch 3 times, most recently from 06d9529 to 619e2e5 Compare December 10, 2025 21:00
@kaimast kaimast force-pushed the fix/already-connecting branch from 619e2e5 to 18fc6f4 Compare January 13, 2026 02:13
@kaimast kaimast changed the title [Draft] Improve (dis-)connect handling between peers [Feature] Improve (dis-)connect handling between peers Jan 13, 2026
@kaimast kaimast marked this pull request as ready for review January 13, 2026 23:25
Comment thread .ci/utils.sh Outdated
Comment thread node/network/src/peering.rs Outdated
@kaimast kaimast requested a review from ljedrz January 21, 2026 04:21
@kaimast
Copy link
Copy Markdown
Contributor Author

kaimast commented Jan 21, 2026

Left some comments and suggestions.

Thanks! I think I addressed everything.

Note: Sometimes the devnet test fails due to an error in syncing. This is something I am fixing in another PR and, ideally, should not hold up this one.

Comment thread node/network/src/peering.rs Outdated
Comment thread node/network/src/peering.rs
Comment thread node/network/src/peering.rs
ljedrz
ljedrz previously approved these changes Jan 21, 2026
Copy link
Copy Markdown
Collaborator

@ljedrz ljedrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some final nits 👌.

Comment thread node/network/src/lib.rs Outdated
Comment thread node/tcp/src/tcp.rs
vicsn
vicsn previously approved these changes Jan 22, 2026
Copy link
Copy Markdown
Collaborator

@vicsn vicsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending Lukasz' approval!

ljedrz
ljedrz previously approved these changes Jan 23, 2026
Copy link
Copy Markdown
Collaborator

@ljedrz ljedrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Kai Mast <kai@provable.com>
@kaimast kaimast dismissed stale reviews from ljedrz and vicsn via 1e01700 January 23, 2026 22:44
@kaimast kaimast requested review from ljedrz and vicsn January 23, 2026 22:44
@kaimast
Copy link
Copy Markdown
Contributor Author

kaimast commented Jan 23, 2026

I resolved a conflict with staging and it dismissed the reviews again. Please re-approve.

@kaimast kaimast merged commit 281f262 into staging Jan 26, 2026
3 of 4 checks passed
@kaimast kaimast deleted the fix/already-connecting branch January 26, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Remove/reduce "already connecting to node" warnings

4 participants