Skip to content

Builder deposits optimisation#9311

Open
pawanjay176 wants to merge 23 commits into
sigp:unstablefrom
pawanjay176:builder-deposits-optimisation
Open

Builder deposits optimisation#9311
pawanjay176 wants to merge 23 commits into
sigp:unstablefrom
pawanjay176:builder-deposits-optimisation

Conversation

@pawanjay176
Copy link
Copy Markdown
Member

Issue Addressed

N/A

Proposed Changes

Adds an OnboardBuildersCache to the beacon chain to pre-verify and cache builder deposits. Caching is important in 2 places:

  1. onboard_builders_from_pending_deposits is a fork transition function that scales with the number of pending deposits. Under worst case, the pending deposits queue can be dos'd with a number of 1eth deposits to make nodes do more work verifying it at the fork boundary. Even though the pending_deposits queue is effectively capped by the gas limit, this cache makes even a theoretic attack ineffective by doing the full verification in miliseconds instead of seconds.
    Some numbers claude cooked up
Deposits Capital Cached Batch verify (without cache)
10K $26M ~35ms ~520ms
50K $130M ~175ms ~2.6s
96K $250M 271ms (measured) ~5s
100K $260M ~285ms ~5.2s
  1. Post fork, process_operations may need to verify all builder deposits in the hot path. the engine api currently allows max 8192 deposit requests to be sent for block production, so in worst case, we may need to verify 8192 signatures during block processing. The deposits we need to process are received in the payload envelope ~6 seconds into the slot. we process these deposits when a new beacon block that builds on the payload arrives ~3 seconds into the next slot. So we have a lot of time to verify these signatures before we actually need to process them

The cache is threaded to both the per_slot_processing for the first case and per_block_processing for the second case.

Additional Info

tested with the following kurtosis config:

participants_matrix:
     el:
       - el_type: geth
         el_image: ethpandaops/geth:bal-devnet-6
         el_extra_params: ["--rpc.txfeecap=0", "--rpc.gascap=0"]
     cl:
       - cl_type: lighthouse
         cl_image: lighthouse-local:latest
         cl_log_level: debug
         count: 2
       - cl_type: prysm
         cl_image: ethpandaops/prysm-beacon-chain:glamsterdam-devnet-3-deposits
         count: 2
network_params:
  gloas_fork_epoch: 1
  withdrawal_type: "0x01"
  validator_balance: 40000
  gas_limit: 5000000000
  genesis_gaslimit: 5000000000

additional_services:
  - dora
  - assertoor

assertoor_params:
  image: ethpandaops/assertoor:master
  run_stability_check: false
  run_block_proposal_check: false
  tests:
    - file: "https://raw.githubusercontent.com/ethpandaops/assertoor/refs/heads/master/playbooks/gloas-dev/builder-deposit-spam.yaml"
      config:
        batchSize: 256
        pendingBatches: 16 # 8192 deposits per block
        # totalDeposits: 96214
        totalDeposits: 262140
        skipForkActivationCheck: true

dora_params:
  image: ethpandaops/dora:master⏎

@pawanjay176 pawanjay176 added work-in-progress PR is a work-in-progress optimization Something to make Lighthouse run more efficiently. gloas labels May 17, 2026
@pawanjay176 pawanjay176 added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels May 19, 2026
@pawanjay176 pawanjay176 requested a review from eserilev May 19, 2026 00:34
@pawanjay176
Copy link
Copy Markdown
Member Author

This is ready for review now.

@mergify
Copy link
Copy Markdown

mergify Bot commented May 19, 2026

Some required checks have failed. Could you please take a look @pawanjay176? 🙏

@mergify mergify Bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels May 19, 2026
Copy link
Copy Markdown
Member

@eserilev eserilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good on the whole, just some small suggestion, a question and a few nits

Comment on lines +107 to +114
let decompressed = deposit_data
.par_iter()
.enumerate()
.map(|(index, deposit)| {
deposit_pubkey_signature_message(deposit, spec)
.map(|(public_key, signature, message)| (index, public_key, signature, message))
})
.collect::<Vec<_>>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use one of the scoped rayon pools instead of the global pool

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below comment

let mut results = vec![false; decompressed.len()];

let batch_results = decompressed
.par_chunks(DEPOSIT_SIGNATURE_BATCH_SIZE)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a scoped rayon pool here as well

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of using the scoped rayon pool here, which would involve threading the task executor all the way to state_processing, we can instead spawn the tasks that trigger signature verification with the rayon pool. Implemented in 50cd378

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place where we might call rayon without a scoped pool is in process_deposit_requests when we have cache misses for the signature verification. I think that is okay.

Comment thread consensus/state_processing/src/builder_deposits_cache.rs
Comment thread consensus/state_processing/src/per_block_processing/process_operations.rs Outdated
Comment thread consensus/state_processing/src/per_block_processing/process_operations.rs Outdated
Comment thread consensus/state_processing/src/per_block_processing/process_operations.rs Outdated
/// This can be significantly slower if there are many builder deposits
/// that need to be onboarded at the fork boundary. This variant should be used
/// for tests and other non-production paths.
FullVerification,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we are only running tests for GloasVerificationContext::FullVerification, would be nice to write tests for the other two variants if possible

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more tests in de89987

Comment thread consensus/state_processing/src/upgrade/gloas.rs
Comment thread consensus/state_processing/src/builder_deposits_cache.rs
@pawanjay176
Copy link
Copy Markdown
Member Author

@eserilev Removed the GloasVerificationContext::SkipBuilderOnboarding variant because I think it wasn't safe.
partial_state_advance promises to return a valid state just without the roots calculated so not doing the builder onboarding there feels like violating that contract and the assumption that the advanced state doesn't need builders might be misguided with a later refactor. Ended up threading the cache everywhere which is a little ugly but I think its necessary.

@pawanjay176 pawanjay176 added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels May 19, 2026
@mergify
Copy link
Copy Markdown

mergify Bot commented May 20, 2026

Some required checks have failed. Could you please take a look @pawanjay176? 🙏

@mergify mergify Bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels May 20, 2026
// it is `O(n * m)` where `n` is max 8192 and `m` is max 128M.
fn is_pending_validator<E: EthSpec>(
state: &BeaconState<E>,
#[instrument(skip_all, level = "debug")]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may create a large number of spans, we probably don't need the span per validator?

Comment thread beacon_node/beacon_chain/src/beacon_chain.rs Outdated
@pawanjay176
Copy link
Copy Markdown
Member Author

pawanjay176 commented May 22, 2026

Note to reviewer: Changed process_deposit_requests_post_gloas significantly with 70b8594 which also diverges quite a bit from the spec function to optimise it.

The observation was that with a bigger sized state.builders, inserting a new builder to the builders list was taking a full iteration of the builder list. This is because builder indices are reusable and add_builder_to_registry uses get_index_for_new_builder which iterates through the entire list to check if any index is available for reuse. With higher builder counts, this becomes significant.

We now cache all reusable indices in a first sweep before reusing anything and that dropped the time to insert with big builder count much more manageable. Again, this is highly unlikely on mainnet.

@pawanjay176 pawanjay176 added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels May 22, 2026
@mergify
Copy link
Copy Markdown

mergify Bot commented May 22, 2026

Some required checks have failed. Could you please take a look @pawanjay176? 🙏

@mergify mergify Bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels May 22, 2026
}

builder_deposit_keys.push(key);
builder_deposits.push(deposit_data);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about whether its worth de-dup here, but it seems like the risk and potential impact is low?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i was considering getting rid of cache_pending_deposits post fork so this could make it easier.
I'm happy to consolidate logic in one place though.

}

/// Transform a `Fulu` state into a `Gloas` state.
#[instrument(skip_all)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see this span when you test it locally?
I think we might have to rename the advance_head span to lh_advance_head so it gets exported to tempo.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to just drop it too. I have already tested and benchmarked it with way worse cases than we'll ever see on mainnet and this happens just once at the fork transtion. I think its fine to remove it.

// perform the signature verification in batches.
// We have until the fork transition for the cache to be used, so we use the low priority pool.
executor.spawn_blocking_with_rayon(
move || cache.add_new_pending_deposits::<T::EthSpec>(&state, &spec),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pretty much a no-op after the fork?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it is. Hadn't considered it. can potentially only do this for pre-gloas states and then delete it post gloas


for (index, builder) in state_builders.iter().enumerate() {
builder_index_map.insert(builder.pubkey, index as BuilderIndex);
if builder.withdrawable_epoch <= current_epoch && builder.balance == 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the builder tops up in the same block and its balance increases, then we could accidentally make this index reusable right? is this possible

if let Some(builder_index) = builder_index {
state
.builders_mut()?
.get_mut(builder_index as usize)
.ok_or(BeaconStateError::UnknownBuilder(builder_index))?
.balance
.safe_add_assign(deposit_request.amount)?;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome catch!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 656e70a and added a test as well. Really great catch. I'm going to try and upstream this to the EF tests

// perform the signature verification in batches.
executor.spawn_blocking_with_rayon(
move || cache.cache_deposit_requests(&deposits, &spec),
task_executor::RayonPoolType::HighPriority,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't work in the hot path, however i think its fine leaving it high prio, as we want to be ready asap in case if the payload arrive late in the slot? is this what you were thinking?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah pretty much. Better to have everything verified to reduce cache misses in case of late envelopes

}

/// Helper to create a harness with Fulu genesis and gloas at a later epoch.
async fn get_fulu_harness_with_gloas_scheduled<E: EthSpec>(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this to the top near the other similar functions

@jimmygchen
Copy link
Copy Markdown
Member

Looks like the failing test here revealed a bug, and the invalid deposit got added to pending deposits.

Might want to skip (continue) if the siganture is invalid here:

https://github.com/sigp/lighthouse/actions/runs/26260083391/job/77291404904?pr=9311

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gloas optimization Something to make Lighthouse run more efficiently. waiting-on-author The reviewer has suggested changes and awaits thier implementation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants