Introduce ledger-tool simulate-block-production by ryoqun · Pull Request #2733 · anza-xyz/agave

ryoqun · 2024-08-25T06:36:18Z

Problem

Even while solana-labs#29196 has landed and has been enabled for a while, there's no code to actually simulate the block production.

Summary of Changes

Finally, introduce the functionality called agave-ledger-tool simulate-block-production with following flags:

$ agave-ledger-tool simulate-block-production --help
...
OPTIONS:
...
        --first-simulated-slot <SLOT>
            Start simulation at the given slot
        --no-block-cost-limits
            Disable block cost limits effectively by setting them to the max
        --block-production-method <METHOD>
            Switch transaction scheduling method for producing ledger entries [default: central-scheduler] [possible
            values: thread-local-multi-iterator, central-scheduler]

On top of it, this pr also makes it possible to replay the simulated blocks later by persisting the shreds into the blockstore and adjusting the replay code-path a bit. Namely, the following flags are added:

$ agave-ledger-tool verify --help
...
OPTIONS:
...
        --abort-on-invalid-block
            Exits with failed status early as soon as any bad block is detected
        --no-block-cost-limits
            Disable block cost limits effectively by setting them to the max
        --enable-hash-overrides
            Enable override of blockhashes and bank hashes from banking trace event files to correctly verify blocks
            produced by the simulate-block-production subcommand

(bike-shedding is welcome, btw...)

This pr is extracted from #2325

sample output (with the mainnet-beta ledger)

notice that (bank) hashes and last_blockhashes are overidden while account_delta (hashes) are different.

While, tx counts largely differ from the actual log, both runs of simulation exhibits similar numbers, indicating rather stability of simulation.

The difference between simulation and reality is due to other general system loads, probably.

simulated log (1):

[2024-08-25T14:58:26.916751963Z INFO  solana_runtime::bank] bank frozen: 282254384 hash: DVkbZUJLMGHVat1GNdT4NT6qebSPH3uxiGa5hU7QSvZU accounts_delta: GajSceKAeDuAhCnRyXbv7qUsiEwdseJ1ebWDF7wyhWSg signature_count: 579 last_blockhash: A1FHFc8grbbS2zyc7a1byuU6VXUQHqsRpdydBhQM5ACx capitalization: 581802336581619111, stats: BankHashStats { num_updated_accounts: 2579, num_removed_accounts: 14, num_lamports_stored: 343411018630348, total_data_len: 9421212, num_executable_accounts: 0 }
[2024-08-25T14:58:27.270576266Z INFO  solana_runtime::bank] bank frozen: 282254385 hash: 4SpPMwTNNAfR6quW73NKqx7CRfbSoGyQMTeARez2Hftw accounts_delta: 4QfWEXTJUFr1kV6ET3tnYENppQv2BeAGNA1grDRpfPv4 signature_count: 1750 last_blockhash: 8FHhFcs4pgXEHvKohswtNPcx9KHvr8k6VEsunsRT9M6t capitalization: 581802336574886442, stats: BankHashStats { num_updated_accounts: 4964, num_removed_accounts: 2, num_lamports_stored: 516037578135707, total_data_len: 17010726, num_executable_accounts: 0 }
[2024-08-25T14:58:27.623465240Z INFO  solana_runtime::bank] bank frozen: 282254386 hash: HMtfZUB9Wgdw12c4adk17cGce13dSeaJgUsTc3mGpDKL accounts_delta: 74GUrNzB2M4ufASPBJdB65ZKPimA84AhPR8u1sjSAA94 signature_count: 1369 last_blockhash: 7xUVmhDbeVPxA4Gu5TKLd2qWU6L47MzRGxfMVwAWSQMg capitalization: 581802336570831246, stats: BankHashStats { num_updated_accounts: 3991, num_removed_accounts: 7, num_lamports_stored: 1485447148098157, total_data_len: 7028676, num_executable_accounts: 0 }
[2024-08-25T14:58:27.978300836Z INFO  solana_runtime::bank] bank frozen: 282254387 hash: 5UgDevAtVTUJZD9cjrAbpUPzdwtFPxxhiQM2sQ5L6EeW accounts_delta: D5eLRBxyYYNy6nGB3TrDtwmx6BkKjvy2diUBvMbM9zfi signature_count: 754 last_blockhash: En8erpgtuHcK3rEQA7i1JFcJB5AUjPpz57gHh5vHUgmr capitalization: 581802336567526926, stats: BankHashStats { num_updated_accounts: 2766, num_removed_accounts: 2, num_lamports_stored: 210579970838403, total_data_len: 9649857, num_executable_accounts: 0 }

simulated log (2):

[2024-08-26T06:21:39.116746518Z INFO  solana_runtime::bank] bank frozen: 282254384 hash: DVkbZUJLMGHVat1GNdT4NT6qebSPH3uxiGa5hU7QSvZU accounts_delta: 3WdvevVPam6vYSY9MRDY25ThdkryjLkmxYi2HLxtPPDT signature_count: 690 last_blockhash: A1FHFc8grbbS2zyc7a1byuU6VXUQHqsRpdydBhQM5ACx capitalization: 581802336580834700, stats: BankHashStats { num_updated_accounts: 2878, num_removed_accounts: 22, num_lamports_stored: 1463388803383965, total_data_len: 15984345, num_executable_accounts: 0 }
[2024-08-26T06:21:39.472078453Z INFO  solana_runtime::bank] bank frozen: 282254385 hash: 4SpPMwTNNAfR6quW73NKqx7CRfbSoGyQMTeARez2Hftw accounts_delta: GVDVyi9BdVjJiXdDUCgrYpBkAJ5ZZSXoKi89z6a8No2w signature_count: 1783 last_blockhash: 8FHhFcs4pgXEHvKohswtNPcx9KHvr8k6VEsunsRT9M6t capitalization: 581802336574005189, stats: BankHashStats { num_updated_accounts: 4915, num_removed_accounts: 1, num_lamports_stored: 1051960783828027, total_data_len: 15620129, num_executable_accounts: 0 }
[2024-08-26T06:21:39.827448111Z INFO  solana_runtime::bank] bank frozen: 282254386 hash: HMtfZUB9Wgdw12c4adk17cGce13dSeaJgUsTc3mGpDKL accounts_delta: JCJB95BFo69Rk3VZRF9GL4P6JYqRWoFKbT4ngmUVWPKj signature_count: 1352 last_blockhash: 7xUVmhDbeVPxA4Gu5TKLd2qWU6L47MzRGxfMVwAWSQMg capitalization: 581802336570362548, stats: BankHashStats { num_updated_accounts: 3872, num_removed_accounts: 2, num_lamports_stored: 319553718238402, total_data_len: 5976282, num_executable_accounts: 0 }
[2024-08-26T06:21:40.178714212Z INFO  solana_runtime::bank] bank frozen: 282254387 hash: CsCFYPXXcc9h6BwCeAcsgbpsbBp9kH7RLovFzGapzkZS accounts_delta: 7tM6MrmVEGMyMtxuycSgdnJDpoVGJfgiRkvcKJRGxzpF signature_count: 721 last_blockhash: B1NvabLf172iriwSxC6pt3fDxRVMgGZYTqWqqA37BAtT capitalization: 581802336566863144, stats: BankHashStats { num_updated_accounts: 2759, num_removed_accounts: 2, num_lamports_stored: 226979096844990, total_data_len: 7714591, num_executable_accounts: 0 }

actual log:

[2024-08-08T05:03:37.833306644Z INFO  solana_runtime::bank] bank frozen: 282254384 hash: DVkbZUJLMGHVat1GNdT4NT6qebSPH3uxiGa5hU7QSvZU accounts_delta: 4eR6tY1RbhjxW56qNnjbfyWgXUR78JggjH5WVo6cBt3v signature_count: 1797 last_blockhash: A1FHFc8grbbS2zyc7a1byuU6VXUQHqsRpdydBhQM5ACx capitalization: 581802336557110608, stats: BankHashStats { num_updated_accounts: 5197, num_removed_accounts: 40, num_lamports_stored: 888031815473795, total_data_len: 16030729, num_executable_accounts: 0 }
[2024-08-08T05:03:38.225493056Z INFO  solana_runtime::bank] bank frozen: 282254385 hash: 4SpPMwTNNAfR6quW73NKqx7CRfbSoGyQMTeARez2Hftw accounts_delta: EhXCEyiWG8RV7zQv2nsxm5dFimKMY4TYuA64y1GgxUcb signature_count: 1664 last_blockhash: 8FHhFcs4pgXEHvKohswtNPcx9KHvr8k6VEsunsRT9M6t capitalization: 581802336533852687, stats: BankHashStats { num_updated_accounts: 5102, num_removed_accounts: 33, num_lamports_stored: 2826520371573347, total_data_len: 23738722, num_executable_accounts: 0 }
[2024-08-08T05:03:38.620202820Z INFO  solana_runtime::bank] bank frozen: 282254386 hash: HMtfZUB9Wgdw12c4adk17cGce13dSeaJgUsTc3mGpDKL accounts_delta: rEFqYUdJvPLeNUF7nDjpxpP7jBJ6SjHmp5XAKq7vpTf signature_count: 1366 last_blockhash: 7xUVmhDbeVPxA4Gu5TKLd2qWU6L47MzRGxfMVwAWSQMg capitalization: 581802336525660471, stats: BankHashStats { num_updated_accounts: 4631, num_removed_accounts: 16, num_lamports_stored: 3357971643149105, total_data_len: 27580121, num_executable_accounts: 1 }
[2024-08-08T05:03:39.028377595Z INFO  solana_runtime::bank] bank frozen: 282254387 hash: ET4eF1A1hQQgGdC2ZKddxLSeSFdMPFqSrFKn5Ly2kLep accounts_delta: BNJuXeiw2AyefF8dc8guaV9FEuxyddUWRzvw8zV7GDj2 signature_count: 1163 last_blockhash: GZE8MzgEgyekrTVrKExNbh2xnoMjM7Bi7PNSb92MXqsF capitalization: 581802336510318235, stats: BankHashStats { num_updated_accounts: 4300, num_removed_accounts: 25, num_lamports_stored: 2674603072772107, total_data_len: 28823937, num_executable_accounts: 0 }

ryoqun · 2024-08-26T06:31:28Z

@behzadnouri please let me know if you disagree with this changes in gossip/src/cluster_info.rs as justified by the source code comment.

I pretty much would rather avoid introducing scenarios wherecluster_info.keypair.pubkey() != contact_info.pubkey.

Can you please provide more context why we need a ClusterInfo with a contact_info which we do not own the keypair? To me this seems pretty error-prone and I would much prefer we try an alternative.

Agree with @behzadnouri here. It seems the reason we are adding this new inconsistency is because BankingStage takes it as an argument.

It'd be better for us to refactor BankingStage to not use ClusterInfo imo.

We use ClusterInfo for 2 things:

Creating the Forwarder

could pass an Option<Forwarder> as arg to BankingStage instead

^ would need some changes to rip out mandatory forwarding in tlmi / voting threads

alternatively could make forwarder an enum w/ disabled variant or even a trait instead

Getting validator id for checking if we are leader

easily can pass Pubkey instead

hmm, my dcou based hack is unpopular.. ;) I did my part of little hassle. how about this?: 2b33131bec622c09d4254ee25727b3de764709fd

Can you please provide more context why we need a ClusterInfo with a contact_info which we do not own the keypair?

It seems the reason we are adding this new inconsistency is because BankingStage takes it as an argument.

@apfitzge 's understanding is correct. note that such broken ClusterInfo is only ever created under dcou code-path, though.

Getting validator id for checking if we are leader

* easily can pass `Pubkey` instead

Sadly, this isn't easy because the identity Pubkey can be hot-swapped inside ClusterInfo. That lead me to take this trait direction...:

alternatively could make forwarder ... a trait instead

Also, note that BroadcastStageType also needs ClusterInfo. Fortunately, it seems the new hacky trait LikeClusterInfo plumbing isn't needed for it.

That said, I wonder this additional production code is worth to maintain, only to support some obscure development ledger-tool subcommand. Anyway, I'm not that opinionated. I just want to merge this pr.

ryoqun · 2024-08-26T11:20:11Z

fyi, not included in this pr. but I now have some fancy charts (salvaged solana-labs#28119) at the development branch.

namely, now that we can display the individual tx timings for each scheduler

(quick legend: x axis is walltime; y axis is lined up by each threads; green arced arrows are read-lock dependency, pink arced arrows are write-lock dependency)

thread-local-multi-iterator

each banking thread is working as hard as like animals. you can indirectly see batch boundaries.

central scheduler

much like to thread-local-multi-iterator, batched transactions show almost no gap (no overhead). while clipped, overall much less chaotic dep graph is observed.

lastly, because stickiness of write lock to a particular thread, the 2nd batch is rather large and other threads are idle (see the 2nd pic)

unified scheduler

read locks are well parallelized. each task execution incurs large overhead, but dep graph resolution is rather timely.

note that unified-scheduler is wired to the block production as well at #2325. That's why i have all the charts from the 3 impls...

ryoqun · 2024-08-26T11:43:40Z

this particular log output are like this:

$ grep -E 'jitter' simulate-mb.2024y08m26d10h44m43s933116486ns [2024-08-26T10:58:57.989324285Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254383): +360.27µs (sim: 12.00036027s event: 12s) [2024-08-26T10:58:58.344810251Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254384): -19.615829ms (sim: 12.355846357s event: 12.375462186s) [2024-08-26T10:58:58.695797971Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254385): -71.903503ms (sim: 12.706834067s event: 12.77873757s) [2024-08-26T10:58:59.047995512Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254386): -135.656223ms (sim: 13.059031477s event: 13.1946877s)

in short, poh in sim is rather more timely than the actual traced poh recordings. maybe this is due to much reduced sysload.

apfitzge

Looks good for the most part.
Tried to fix up some grammar in the documenting comments, and some suggestions to split up some of the larger functions so its' easier for me to read.

apfitzge · 2024-08-26T14:15:31Z

logging functionality here should really get separated, it is quite long and distracting from the behavior of the loop.

apfitzge · 2024-08-26T14:20:10Z

So when we're no longer leader the process will end.

In the future, could we extend this capability so that we "fast-forward" through our non-leader periods?

That probably adds considerable complexity, but I think would make simming significantly more useful.
AFAICT, as is we must load from snapshot everytime we want to do a sim of 4 slots - which will be very time consuming if I have hundred(s) of leader slots in my trace data that I'd like to simulate.

In the future, could we extend this capability so that we "fast-forward" through our non-leader periods?

That probably adds considerable complexity

indeed, it's possible. but with considerable complexity. Note that such "fast-forward"-ing needs to reload from snapshot... Just doing it without snapshot reloading would make most txes fail, invalidating the simulation itself.

I have hundred(s) of leader slots in my trace data that I'd like to simulate.

i know this is ideal. but dozen of simulation ledgers each with single round of the 4 leader slots is good enough..

Think there's some complexity in how we'd need to "fast-forward" through non-leader periods, but don't think it'd require loading from a snapshot if done correctly.

Probably would add even more complexity, but could we not treat the simmed blocks as some sort of duplicate block (or a fork).
After each 4 slot sim, we drop the simmed blocks (duplicates/fork) for the actual blocks which we then replay as normal until we get close to next leader period.

That's what I had in mind for the fast-forward, since I definitely agree that we can't just continue on from the simmed blocks and act like things will just work from there haha.
If we were to do this would probably want to collaborate with ashwin or stevecz on how we could handle the "sim-swap" to remove simmed blocks and insert real blocks.

oh, that idea sounds nice.

how we could handle the "sim-swap" to remove simmed blocks and insert real blocks.

i think this just can be done with read-write side blockstore.

oh, that idea sounds nice.

Cool! I think that'd really improve the usability, but we should definitely leave it for a follow-up PR, this one is big enough as is 😄

apfitzge · 2024-08-26T14:24:12Z

why clone instead of lettting the sender thread take ownership of these batches?

just reviewing on github, so can't see the type - is this Arc-ed, or just a clone of actual packet batch?

the useful BTreeMap::range() don't allow because it's not range_into or something like that.

However, I noticed that I can use BTreeMap::split_off(): 3a3131e2adfb8d858d64b2143236f7f0e11a9f2a

just reviewing on github, so can't see the type - is this Arc-ed, or just a clone of actual packet batch?

fyi, this is Arc-ed.

However, I noticed that I can use BTreeMap::split_off(): 3a3131e

related to the above, i further improved simulation jitter: 5e77bd7fa0e0ec1a453485a7aca69013fc1540b2

apfitzge · 2024-08-26T14:28:22Z

Why not use a VecDeque here? These are, afaik, always in order from the files (assuming we read the files in the correct order, which is easy enough to do).

added comments: d3bf0d9c60c85acfc0df38d4bda52b0feb98bbc1

btw, related to this a bit, I noticed we rather should stop using BTreeMap::into_iter(): 5e77bd7fa0e0ec1a453485a7aca69013fc1540b2

ryoqun · 2024-08-27T03:17:41Z

ref: rust-lang/rustfmt#5920

apfitzge · 2024-08-27T13:43:34Z

@ryoqun The figures in this comment #2733 (comment) raised a few questions for me

How do these schedulers compare if we give equal number of worker theads for all impls. The unified scheduler seems to have more than double the threads, what if we also give it 4?
it definitely seems the unified scheduler does a better job of parallelism due to the lack of batching, as well as its' aggressive approach towards parallelism. I'm curious how this might affect fee-collection done by the leader. Certainly we can process many non-contentious transactions, but those will use our blockspace more quickly and potentially use up blockspace that more valuable (greedy leader maximizing per-cu rewards) transactions that are currently blocked.

I think these questions are outside the scope of this PR - so maybe we do not focus on them here. Instead, I would like to ask about inspection of the simulated blocks:

Are simulated blocks saved to blockstore? Is it possible for us to save them to a "separate" blockstore or parameterize it in some way? Ideally we could run simulation for several scheduler implementations, configurations, etc, and then run some analysis on the blocks produced after the fact so that we can compare them all.

Basically it'd be really nice to add some block analysis stuff in ledger-tool, and then run that command in a bash loop to get some metrics about block "quality":

CU fullness
CU max depth
rewards
parallelism

ryoqun · 2024-08-28T08:15:40Z

remove this line completely...

done: 6d012c216d491d71a4f14de24c5e9a669d806e7a

ryoqun · 2024-08-28T08:16:57Z

well, intentionally wrap this with RwLock to mimic real ClusterInfo?

done: 535f4da8118afbd97e356489b12e81c7b4443ccf

ryoqun · 2024-08-28T14:20:09Z

@ryoqun The figures in this comment #2733 (comment) raised a few questions for me

How do these schedulers compare if we give equal number of worker theads for all impls. The unified scheduler seems to have more than double the threads, what if we also give it 4?

yeah, i forgot to align the thread counts... I'll do in-depth comparison later. unified scheduler takes longer to clear the buffer when the thread count is very low like 4. that said, it scales well to saturate all of worker threads. here's a sample:

also, now that unified scheduler is enabled for block verification, i think we should increase banking thread count to like 12-16.

it definitely seems the unified scheduler does a better job of parallelism due to the lack of batching, as well as its' aggressive approach towards parallelism. I'm curious how this might affect fee-collection done by the leader. Certainly we can process many non-contentious transactions, but those will use our blockspace more quickly and potentially use up blockspace that more valuable (greedy leader maximizing per-cu rewards) transactions that are currently blocked.

indeed unified scheduler can't get rid of the curse of unbatched overhead, but i think it can be optimized for the greedy-leader-maximizing per-cu rewards. Currently, all non-contentious transactions are directly buffered to crossbeam channels with unbounded buffer depth in the unified scheduler as you know. but, I'm planning to place a priority queue for the freely-reorderable transactions in front of them at the scheduler thread side, while maintaining max of 1.5 * handler_thread_count of tasks are buffered by the crossbeam channels. In this way, higher-paying task reprioritization latency is about 1.5 * avg execution time of single transaction.

I think these questions are outside the scope of this PR

👍 anyway, i put some thought above.

so maybe we do not focus on them here. Instead, I would like to ask about inspection of the simulated blocks:

Are simulated blocks saved to blockstore?

yes.

Is it possible for us to save them to a "separate" blockstore or parameterize it in some way? Ideally we could run simulation for several scheduler implementations, configurations, etc, and then run some analysis on the blocks produced after the fact so that we can compare them all.

Basically it'd be really nice to add some block analysis stuff in ledger-tool, and then run that command in a bash loop to get some metrics about block "quality":
* CU fullness

* CU max depth

* rewards

* parallelism

yeah, we can do this easily.

ryoqun · 2024-08-28T15:01:10Z

@apfitzge thanks for all the effort of code-reviewing. sans the general clean up of banking_simulation.rs, I think i've addressed all comments. I'm planning to do the clean up tomorrow.

ryoqun · 2024-09-11T02:06:55Z

fate of 1.4k lines of pr. ;)

I was forced to rebase this pr onto #2172

steviez

LGTM and given #2733 (review), think we can push this!

want to merge this

* Introduce ledger-tool simulate-block-production * Move counting code out of time-sensitive loop * Avoid misleading ::clone() altogether * Use while instead of loop+break * Add comment of using BTreeMap * Reduce simulation jitter due to mem deallocs * Rename to CostTracker::new_from_parent_limits() * Make ::load() take a slice * Clean up retracer code a bit * Add comment about BaningTracer even inside sim * Remove redundant dcou dev-dependencies * Apply suggestions from code review Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com> * Fix up and promote to doc comments * Make warm-up code and doc simpler * Further clean up timed_batches_to_send * Fix wrong units... * Replace new_with_dummy_keypair() with traits * Tweak --no-block-cost-limits description * Remove redundant dev-dependencies * Use RwLock to mimic real ClusterInfo * Fix typo * Refactor too long BankingSimulator::start() * Reduce indent * Calculate required_duration in advance * Use correct format specifier instead of cast * Align formatting by using ::* * Make envs overridable * Add comment for SOLANA_VALIDATOR_EXIT_TIMEOUT * Clarify comment a bit * Fix typoss * Fix typos Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com> * Use correct variant name: DeserializeError * Remove SimulatorLoopLogger::new() * Fix typos more * Add explicit _batch in field names * Avoid unneeded events: Vec<_> buffering * Manually adjust logging code styles * Align name: spawn_sender_loop/enter_simulator_loop * Refactor by introducing {Sender,Simulator}Loop * Fix out-of-sync sim due to timed preprocessing * Fix too-early base_simulation_time creation * Don't log confusing info! after leader slots * Add justification comment of BroadcastStage * Align timeout values * Comment about snapshot_slot=50 * Don't squash all errors unconditionally * Remove repetitive exitence check * Promote no_block_cost_limits logging level * Make ci/run-sanity.sh more robust * Improve wordking of --enable-hash-overrides * Remove marker-file based abortion mechanism * Remove needless touch --------- Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>

ryoqun force-pushed the simulate-block-production branch 8 times, most recently from f1dbbd9 to f965399 Compare August 26, 2024 06:05