Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Tune banking_stage receive loop timing#25172

Merged
mergify[bot] merged 1 commit into
solana-labs:masterfrom
pgarg66:banking-stage-recv
May 13, 2022
Merged

Tune banking_stage receive loop timing#25172
mergify[bot] merged 1 commit into
solana-labs:masterfrom
pgarg66:banking-stage-recv

Conversation

@pgarg66
Copy link
Copy Markdown
Contributor

@pgarg66 pgarg66 commented May 12, 2022

Problem

Banking stage is not able to keep up with receiving packets sent by the sigverify stage. The problem gets worse for QUIC, since each packet batch contains only one packet. This results into a lot of blockhash_not_found and blockhash_too_old errors.

Summary of Changes

Banking stage uses different receive timeout value based on whether packets are buffered. This PR updates the receive loop's greedy receive logic to use this timeout to continue to receive more packet batches.

Also, the PR computes the upper bound for receiving packet batches based on how many more can be buffered. This makes it more dynamic, and buffer capacity driven.

Tested the PR by running a dev cluster.

The TPS numbers are better:
Baseline
image

With this PR
image

The blockhash_not_found errors have drastically reduced
Baseline
image

With this PR
image

The count of blockhash_too_old errors hasn't changed. So need more analysis of that.

Fixes #

@pgarg66 pgarg66 requested a review from sakridge May 12, 2022 19:15
Copy link
Copy Markdown
Contributor

@ryleung-solana ryleung-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this tuning probably helps, I still see an apparent downtrend over time in the TPS, and as you mentioned, the blockhash_too_old errors are still there, so I'm not convinced as yet that this really addresses the root cause of the issue. It's a very strange thing we're seeing since all of these bottlenecks are downstream of quic and sigverify, by which point it shouldn't matter whether the packets arrived via UDP or Quic... the only qualitative difference I can think of is that packets arriving via Quic arrive in small batches, but rebatching them doesn't seem to have done anything...

Comment thread core/src/banking_stage.rs
@pgarg66
Copy link
Copy Markdown
Contributor Author

pgarg66 commented May 12, 2022

While this tuning probably helps, I still see an apparent downtrend over time in the TPS, and as you mentioned, the blockhash_too_old errors are still there, so I'm not convinced as yet that this really addresses the root cause of the issue. It's a very strange thing we're seeing since all of these bottlenecks are downstream of quic and sigverify, by which point it shouldn't matter whether the packets arrived via UDP or Quic... the only qualitative difference I can think of is that packets arriving via Quic arrive in small batches, but rebatching them doesn't seem to have done anything...

I don't think there is one root cause to all these issues. We need to remove the wrinkles at different parts of the pipeline. This smoothens out the interconnect between sigverify stage and banking stage. It does not degrade any performance. Definitely need more fixes for blockhash_too_old errors.

@ryleung-solana
Copy link
Copy Markdown
Contributor

While this tuning probably helps, I still see an apparent downtrend over time in the TPS, and as you mentioned, the blockhash_too_old errors are still there, so I'm not convinced as yet that this really addresses the root cause of the issue. It's a very strange thing we're seeing since all of these bottlenecks are downstream of quic and sigverify, by which point it shouldn't matter whether the packets arrived via UDP or Quic... the only qualitative difference I can think of is that packets arriving via Quic arrive in small batches, but rebatching them doesn't seem to have done anything...

I don't think there is one root cause to all these issues. We need to remove the wrinkles at different parts of the pipeline. This smoothens out the interconnect between sigverify stage and banking stage. It does not degrade any performance. Definitely need more fixes for blockhash_not_found errors.

Agreed, and this definitely looks like it could help. Pending the other comment, this looks good.

Comment thread core/src/banking_stage.rs
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2022

Codecov Report

Merging #25172 (3fe53ca) into master (896729f) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##           master   #25172     +/-   ##
=========================================
- Coverage    82.0%    82.0%   -0.1%     
=========================================
  Files         610      610             
  Lines      168056   168061      +5     
=========================================
- Hits       137972   137967      -5     
- Misses      30084    30094     +10     

@pgarg66 pgarg66 added automerge Merge this Pull Request automatically once CI passes v1.10 labels May 13, 2022
@mergify mergify Bot merged commit 71dd95e into solana-labs:master May 13, 2022
@pgarg66 pgarg66 deleted the banking-stage-recv branch May 13, 2022 03:42
mergify Bot pushed a commit that referenced this pull request May 13, 2022
mergify Bot added a commit that referenced this pull request May 13, 2022
(cherry picked from commit 71dd95e)

Co-authored-by: Pankaj Garg <pankaj@solana.com>
Comment thread core/src/banking_stage.rs
trace!("got more packets");
trace!("got more packet batches in banking stage");
let (packets_received, packet_count_overflowed) = num_packets_received
.overflowing_add(packet_batch.iter().map(|batch| batch.packets.len()).sum());
Copy link
Copy Markdown
Contributor

@ryoqun ryoqun Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgarg66 (cc: @ryleung-solana ) hmm, how could num_packets_received overflow? it's usize, so effectively u64 in our use case. and it's trusted in that sense that all of it is derived by .sum()-ing .len()s. So, i think oom could definitely happen before we hit overflowing condition here. maybe, could you write a test to demonstrate this edge case?

oh, i know i'm commenting on quite a bit old pr; it's just that I'm creating a pr touching this code. :)

ultimately, I'm wondering whether it's safe to remove this overflow handling or not...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it's fine to get rid of this overflowing handling; by definition, there can't be enough elements in the virtual address space to overflow a usize...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was mostly precautionary check. In theory it might happen (just looking at the math), but maybe never happen in practice.

Is this causing any bugs, or are you asking to remove it to simplify the code? In either case, it should be fine to remove it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FINALLY :)

fyi, I created a pr for this: #29715

@mergify mergify Bot removed the automerge Merge this Pull Request automatically once CI passes label Dec 9, 2022
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Dec 9, 2022

automerge label removed due to a CI failure

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants