Introduce ledger-tool simulate-block-production#2733
Conversation
f1dbbd9 to
f965399
Compare
There was a problem hiding this comment.
@behzadnouri please let me know if you disagree with this changes in gossip/src/cluster_info.rs as justified by the source code comment.
There was a problem hiding this comment.
I pretty much would rather avoid introducing scenarios wherecluster_info.keypair.pubkey() != contact_info.pubkey.
Can you please provide more context why we need a ClusterInfo with a contact_info which we do not own the keypair? To me this seems pretty error-prone and I would much prefer we try an alternative.
There was a problem hiding this comment.
Agree with @behzadnouri here. It seems the reason we are adding this new inconsistency is because BankingStage takes it as an argument.
It'd be better for us to refactor BankingStage to not use ClusterInfo imo.
We use ClusterInfo for 2 things:
- Creating the
Forwarder- could pass an
Option<Forwarder>as arg toBankingStageinstead - ^ would need some changes to rip out mandatory forwarding in tlmi / voting threads
- alternatively could make forwarder an enum w/ disabled variant or even a trait instead
- could pass an
- Getting validator id for checking if we are leader
- easily can pass
Pubkeyinstead
- easily can pass
There was a problem hiding this comment.
hmm, my dcou based hack is unpopular.. ;) I did my part of little hassle. how about this?: 2b33131bec622c09d4254ee25727b3de764709fd
Can you please provide more context why we need a
ClusterInfowith acontact_infowhich we do not own thekeypair?
It seems the reason we are adding this new inconsistency is because
BankingStagetakes it as an argument.
@apfitzge 's understanding is correct. note that such broken ClusterInfo is only ever created under dcou code-path, though.
Getting validator id for checking if we are leader
* easily can pass `Pubkey` instead
Sadly, this isn't easy because the identity Pubkey can be hot-swapped inside ClusterInfo. That lead me to take this trait direction...:
- alternatively could make forwarder ... a trait instead
Also, note that BroadcastStageType also needs ClusterInfo. Fortunately, it seems the new hacky trait LikeClusterInfo plumbing isn't needed for it.
That said, I wonder this additional production code is worth to maintain, only to support some obscure development ledger-tool subcommand. Anyway, I'm not that opinionated. I just want to merge this pr.
f965399 to
6f75c2f
Compare
|
fyi, not included in this pr. but I now have some fancy charts (salvaged solana-labs#28119) at the development branch. namely, now that we can display the individual tx timings for each scheduler (quick legend: x axis is walltime; y axis is lined up by each threads; green arced arrows are read-lock dependency, pink arced arrows are write-lock dependency) thread-local-multi-iteratoreach banking thread is working as hard as like animals. you can indirectly see batch boundaries. central schedulermuch like to thread-local-multi-iterator, batched transactions show almost no gap (no overhead). while clipped, overall much less chaotic dep graph is observed. lastly, because stickiness of write lock to a particular thread, the 2nd batch is rather large and other threads are idle (see the 2nd pic) unified schedulerread locks are well parallelized. each task execution incurs large overhead, but dep graph resolution is rather timely. note that unified-scheduler is wired to the block production as well at #2325. That's why i have all the charts from the 3 impls... |
There was a problem hiding this comment.
this particular log output are like this:
$ grep -E 'jitter' simulate-mb.2024y08m26d10h44m43s933116486ns
[2024-08-26T10:58:57.989324285Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254383): +360.27µs (sim: 12.00036027s event: 12s)
[2024-08-26T10:58:58.344810251Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254384): -19.615829ms (sim: 12.355846357s event: 12.375462186s)
[2024-08-26T10:58:58.695797971Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254385): -71.903503ms (sim: 12.706834067s event: 12.77873757s)
[2024-08-26T10:58:59.047995512Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254386): -135.656223ms (sim: 13.059031477s event: 13.1946877s)
in short, poh in sim is rather more timely than the actual traced poh recordings. maybe this is due to much reduced sysload.
apfitzge
left a comment
There was a problem hiding this comment.
Looks good for the most part.
Tried to fix up some grammar in the documenting comments, and some suggestions to split up some of the larger functions so its' easier for me to read.
There was a problem hiding this comment.
logging functionality here should really get separated, it is quite long and distracting from the behavior of the loop.
There was a problem hiding this comment.
So when we're no longer leader the process will end.
In the future, could we extend this capability so that we "fast-forward" through our non-leader periods?
That probably adds considerable complexity, but I think would make simming significantly more useful.
AFAICT, as is we must load from snapshot everytime we want to do a sim of 4 slots - which will be very time consuming if I have hundred(s) of leader slots in my trace data that I'd like to simulate.
There was a problem hiding this comment.
In the future, could we extend this capability so that we "fast-forward" through our non-leader periods?
That probably adds considerable complexity
indeed, it's possible. but with considerable complexity. Note that such "fast-forward"-ing needs to reload from snapshot... Just doing it without snapshot reloading would make most txes fail, invalidating the simulation itself.
I have hundred(s) of leader slots in my trace data that I'd like to simulate.
i know this is ideal. but dozen of simulation ledgers each with single round of the 4 leader slots is good enough..
There was a problem hiding this comment.
Think there's some complexity in how we'd need to "fast-forward" through non-leader periods, but don't think it'd require loading from a snapshot if done correctly.
Probably would add even more complexity, but could we not treat the simmed blocks as some sort of duplicate block (or a fork).
After each 4 slot sim, we drop the simmed blocks (duplicates/fork) for the actual blocks which we then replay as normal until we get close to next leader period.
That's what I had in mind for the fast-forward, since I definitely agree that we can't just continue on from the simmed blocks and act like things will just work from there haha.
If we were to do this would probably want to collaborate with ashwin or stevecz on how we could handle the "sim-swap" to remove simmed blocks and insert real blocks.
There was a problem hiding this comment.
oh, that idea sounds nice.
There was a problem hiding this comment.
how we could handle the "sim-swap" to remove simmed blocks and insert real blocks.
i think this just can be done with read-write side blockstore.
There was a problem hiding this comment.
oh, that idea sounds nice.
Cool! I think that'd really improve the usability, but we should definitely leave it for a follow-up PR, this one is big enough as is 😄
There was a problem hiding this comment.
why clone instead of lettting the sender thread take ownership of these batches?
just reviewing on github, so can't see the type - is this Arc-ed, or just a clone of actual packet batch?
There was a problem hiding this comment.
the useful BTreeMap::range() don't allow because it's not range_into or something like that.
However, I noticed that I can use BTreeMap::split_off(): 3a3131e2adfb8d858d64b2143236f7f0e11a9f2a
just reviewing on github, so can't see the type - is this Arc-ed, or just a clone of actual packet batch?
fyi, this is Arc-ed.
There was a problem hiding this comment.
However, I noticed that I can use
BTreeMap::split_off(): 3a3131e
related to the above, i further improved simulation jitter: 5e77bd7fa0e0ec1a453485a7aca69013fc1540b2
There was a problem hiding this comment.
Why not use a VecDeque here? These are, afaik, always in order from the files (assuming we read the files in the correct order, which is easy enough to do).
There was a problem hiding this comment.
added comments: d3bf0d9c60c85acfc0df38d4bda52b0feb98bbc1
There was a problem hiding this comment.
btw, related to this a bit, I noticed we rather should stop using BTreeMap::into_iter(): 5e77bd7fa0e0ec1a453485a7aca69013fc1540b2
There was a problem hiding this comment.
|
@ryoqun The figures in this comment #2733 (comment) raised a few questions for me
I think these questions are outside the scope of this PR - so maybe we do not focus on them here. Instead, I would like to ask about inspection of the simulated blocks: Are simulated blocks saved to blockstore? Is it possible for us to save them to a "separate" blockstore or parameterize it in some way? Ideally we could run simulation for several scheduler implementations, configurations, etc, and then run some analysis on the blocks produced after the fact so that we can compare them all. Basically it'd be really nice to add some block analysis stuff in ledger-tool, and then run that command in a bash loop to get some metrics about block "quality":
|
902dd5f to
2b33131
Compare
There was a problem hiding this comment.
remove this line completely...
There was a problem hiding this comment.
done: 6d012c216d491d71a4f14de24c5e9a669d806e7a
There was a problem hiding this comment.
well, intentionally wrap this with RwLock to mimic real ClusterInfo?
There was a problem hiding this comment.
done: 535f4da8118afbd97e356489b12e81c7b4443ccf
yeah, i forgot to align the thread counts... I'll do in-depth comparison later. unified scheduler takes longer to clear the buffer when the thread count is very low like 4. that said, it scales well to saturate all of worker threads. here's a sample: also, now that unified scheduler is enabled for block verification, i think we should increase banking thread count to like 12-16.
indeed unified scheduler can't get rid of the curse of unbatched overhead, but i think it can be optimized for the greedy-leader-maximizing per-cu rewards. Currently, all non-contentious transactions are directly buffered to crossbeam channels with unbounded buffer depth in the unified scheduler as you know. but, I'm planning to place a priority queue for the freely-reorderable transactions in front of them at the scheduler thread side, while maintaining max of 1.5 * handler_thread_count of tasks are buffered by the crossbeam channels. In this way, higher-paying task reprioritization latency is about 1.5 * avg execution time of single transaction.
👍 anyway, i put some thought above.
yes.
yeah, we can do this easily. |
|
@apfitzge thanks for all the effort of code-reviewing. sans the general clean up of |
dee964a to
284a1b6
Compare
|
fate of 1.4k lines of pr. ;) I was forced to rebase this pr onto #2172 |
* Introduce ledger-tool simulate-block-production
* Move counting code out of time-sensitive loop
* Avoid misleading ::clone() altogether
* Use while instead of loop+break
* Add comment of using BTreeMap
* Reduce simulation jitter due to mem deallocs
* Rename to CostTracker::new_from_parent_limits()
* Make ::load() take a slice
* Clean up retracer code a bit
* Add comment about BaningTracer even inside sim
* Remove redundant dcou dev-dependencies
* Apply suggestions from code review
Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
* Fix up and promote to doc comments
* Make warm-up code and doc simpler
* Further clean up timed_batches_to_send
* Fix wrong units...
* Replace new_with_dummy_keypair() with traits
* Tweak --no-block-cost-limits description
* Remove redundant dev-dependencies
* Use RwLock to mimic real ClusterInfo
* Fix typo
* Refactor too long BankingSimulator::start()
* Reduce indent
* Calculate required_duration in advance
* Use correct format specifier instead of cast
* Align formatting by using ::*
* Make envs overridable
* Add comment for SOLANA_VALIDATOR_EXIT_TIMEOUT
* Clarify comment a bit
* Fix typoss
* Fix typos
Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
* Use correct variant name: DeserializeError
* Remove SimulatorLoopLogger::new()
* Fix typos more
* Add explicit _batch in field names
* Avoid unneeded events: Vec<_> buffering
* Manually adjust logging code styles
* Align name: spawn_sender_loop/enter_simulator_loop
* Refactor by introducing {Sender,Simulator}Loop
* Fix out-of-sync sim due to timed preprocessing
* Fix too-early base_simulation_time creation
* Don't log confusing info! after leader slots
* Add justification comment of BroadcastStage
* Align timeout values
* Comment about snapshot_slot=50
* Don't squash all errors unconditionally
* Remove repetitive exitence check
* Promote no_block_cost_limits logging level
* Make ci/run-sanity.sh more robust
* Improve wordking of --enable-hash-overrides
* Remove marker-file based abortion mechanism
* Remove needless touch
---------
Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>





Problem
Even while solana-labs#29196 has landed and has been enabled for a while, there's no code to actually simulate the block production.
Summary of Changes
Finally, introduce the functionality called
agave-ledger-tool simulate-block-productionwith following flags:On top of it, this pr also makes it possible to replay the simulated blocks later by persisting the shreds into the blockstore and adjusting the replay code-path a bit. Namely, the following flags are added:
(bike-shedding is welcome, btw...)
This pr is extracted from #2325
sample output (with the mainnet-beta ledger)
notice that (bank)
hashes andlast_blockhashes are overidden whileaccount_delta(hashes) are different.While, tx counts largely differ from the actual log, both runs of simulation exhibits similar numbers, indicating rather stability of simulation.
The difference between simulation and reality is due to other general system loads, probably.
simulated log (1):
simulated log (2):
actual log: