[Feature] Improve (dis-)connect handling between peers#3902
[Feature] Improve (dis-)connect handling between peers#3902
Conversation
|
Some of the goals of this PR are achieved with #3900 (which is also a bugfix). In addition, I can see that some of the changes here are conflicting with the aforementioned PR, which deduplicates some of the related code. That being said, the idea to introduce a new, concrete error type is solid, and so is the
|
|
Agreed with all of the above except:
Please no refactors unless agreed upon and resolving an existing issue. |
Fair enough; I was mostly thinking about the prospect of a new node type ( |
2540eb9 to
a43475f
Compare
6e30d1b to
d5b7767
Compare
a44a36e to
3772564
Compare
3772564 to
9ca9004
Compare
9d514fd to
cddd288
Compare
06d9529 to
619e2e5
Compare
619e2e5 to
18fc6f4
Compare
Thanks! I think I addressed everything. Note: Sometimes the devnet test fails due to an error in syncing. This is something I am fixing in another PR and, ideally, should not hold up this one. |
ljedrz
left a comment
There was a problem hiding this comment.
LGTM with some final nits 👌.
vicsn
left a comment
There was a problem hiding this comment.
Pending Lukasz' approval!
Signed-off-by: Kai Mast <kai@provable.com>
|
I resolved a conflict with |
Fixes #3893. The PR is quite large, below is a high-level overview of the changes.
Better Connection Handling
The main purpose of this PR is to propagate errors during handshakes "up", so the error is logged at the appropriate location. It also enables filtering out benign errors such as "Already connecting".
This is achieved by using a concrete
ConnectionErrortype instead ofanyhow::Error. Previously, the code also usedOptionor booleans to indicate a duplicate connection, which did not make the code very readable. To replace those, the new disconnect reasons, mentioned below, are very useful.New Disconnect Reasons
The PR adds a four new disconnect reasons "already connecting", "already connected", "connecting to myself", "no untrusted external peers allowed". This avoids those confusing
peer disconnected before sendingMessage::ChallengeResponse` messages.We previously added support for unknown disconnect reasons (see #4018), so this can be added without causing problems.
Significant Reduction of Connection-Related Warnings
Nodes still generated a significant number of warnings even with these changes. One culprit was that the heartbeat logic sometimes reattempted failed connections too frequently. The PR now adds a mandatory 10-second cooldown period between connection attempts.
In the future, we should consider increasing the cooldowns with every failed attempt.
Additionally, I also expanded the notion of a grace period at startup, where nodes do not log errors, such as "no connected validators". Now, nodes do not generate warnings in the first minute about not being fully connected.
We already did this for the stake-related checks, but not for other router/gateway errors.
To test these changes, the PR adds a new check to
devnet_ci.shthat limits the number of warnings a node is allowed to generate. For now, I limited them to 10 per node.There are some connections related errors at startup that we cannot remove easily; nodes all start concurrently and then fail to connect to each other at the first attempt.