Tune banking_stage receive loop timing#25172
Conversation
ryleung-solana
left a comment
There was a problem hiding this comment.
While this tuning probably helps, I still see an apparent downtrend over time in the TPS, and as you mentioned, the blockhash_too_old errors are still there, so I'm not convinced as yet that this really addresses the root cause of the issue. It's a very strange thing we're seeing since all of these bottlenecks are downstream of quic and sigverify, by which point it shouldn't matter whether the packets arrived via UDP or Quic... the only qualitative difference I can think of is that packets arriving via Quic arrive in small batches, but rebatching them doesn't seem to have done anything...
I don't think there is one root cause to all these issues. We need to remove the wrinkles at different parts of the pipeline. This smoothens out the interconnect between sigverify stage and banking stage. It does not degrade any performance. Definitely need more fixes for |
Agreed, and this definitely looks like it could help. Pending the other comment, this looks good. |
Codecov Report
@@ Coverage Diff @@
## master #25172 +/- ##
=========================================
- Coverage 82.0% 82.0% -0.1%
=========================================
Files 610 610
Lines 168056 168061 +5
=========================================
- Hits 137972 137967 -5
- Misses 30084 30094 +10 |
(cherry picked from commit 71dd95e)
| trace!("got more packets"); | ||
| trace!("got more packet batches in banking stage"); | ||
| let (packets_received, packet_count_overflowed) = num_packets_received | ||
| .overflowing_add(packet_batch.iter().map(|batch| batch.packets.len()).sum()); |
There was a problem hiding this comment.
@pgarg66 (cc: @ryleung-solana ) hmm, how could num_packets_received overflow? it's usize, so effectively u64 in our use case. and it's trusted in that sense that all of it is derived by .sum()-ing .len()s. So, i think oom could definitely happen before we hit overflowing condition here. maybe, could you write a test to demonstrate this edge case?
oh, i know i'm commenting on quite a bit old pr; it's just that I'm creating a pr touching this code. :)
ultimately, I'm wondering whether it's safe to remove this overflow handling or not...
There was a problem hiding this comment.
IMHO, it's fine to get rid of this overflowing handling; by definition, there can't be enough elements in the virtual address space to overflow a usize...
There was a problem hiding this comment.
It was mostly precautionary check. In theory it might happen (just looking at the math), but maybe never happen in practice.
Is this causing any bugs, or are you asking to remove it to simplify the code? In either case, it should be fine to remove it.
There was a problem hiding this comment.
FINALLY :)
fyi, I created a pr for this: #29715
|
automerge label removed due to a CI failure |
Problem
Banking stage is not able to keep up with receiving packets sent by the sigverify stage. The problem gets worse for QUIC, since each packet batch contains only one packet. This results into a lot of
blockhash_not_foundandblockhash_too_olderrors.Summary of Changes
Banking stage uses different receive timeout value based on whether packets are buffered. This PR updates the receive loop's greedy receive logic to use this timeout to continue to receive more packet batches.
Also, the PR computes the upper bound for receiving packet batches based on how many more can be buffered. This makes it more dynamic, and buffer capacity driven.
Tested the PR by running a dev cluster.
The TPS numbers are better:

Baseline
With this PR

The

blockhash_not_founderrors have drastically reducedBaseline
With this PR

The count of
blockhash_too_olderrors hasn't changed. So need more analysis of that.Fixes #