Skip to content

feat(ethexe/node-loader): add per-batch watchdog, SIGTERM, heartbeat#5393

Open
grishasobol wants to merge 2 commits intomasterfrom
feat/loader-watchdog-and-shutdown
Open

feat(ethexe/node-loader): add per-batch watchdog, SIGTERM, heartbeat#5393
grishasobol wants to merge 2 commits intomasterfrom
feat/loader-watchdog-and-shutdown

Conversation

@grishasobol
Copy link
Copy Markdown
Member

  • Wrap each batch in a 180s tokio::time::timeout so a hung RPC call drops the worker's batch instead of stalling the pool forever.
  • Listen for SIGTERM in addition to SIGINT so docker stop triggers the same graceful drain + summary as Ctrl+C.
  • Emit an INFO heartbeat every 10s with completed/failed/in_flight counters so a stalled pool is visible without grepping DEBUG.

@semanticdiff-com
Copy link
Copy Markdown

semanticdiff-com Bot commented Apr 29, 2026

Review changes with  SemanticDiff

Changed Files
File Status
  ethexe/node-loader/src/batch.rs  36% smaller
  ethexe/node-loader/src/main.rs  1% smaller

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the robustness and observability of the node loader. By implementing per-batch timeouts, the system can now recover from stalled RPC operations. Additionally, the inclusion of SIGTERM support and periodic progress heartbeats ensures that the loader behaves more predictably in containerized environments and provides better operational feedback.

Highlights

  • Batch Watchdog: Introduced a 180s timeout for each batch to prevent hung RPC calls from stalling the entire worker pool.
  • Graceful Shutdown: Updated signal handling to listen for both SIGINT and SIGTERM, ensuring proper cleanup when using commands like 'docker stop'.
  • Progress Monitoring: Added a 10s heartbeat log that reports completed, failed, and in-flight batch counts to improve visibility into pool status.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@grishasobol grishasobol requested a review from playX18 April 29, 2026 15:41
- Wrap each batch in a 180s tokio::time::timeout so a hung RPC call
  drops the worker's batch instead of stalling the pool forever.
- Listen for SIGTERM in addition to SIGINT so `docker stop` triggers
  the same graceful drain + summary as Ctrl+C.
- Emit an INFO heartbeat every 10s with completed/failed/in_flight
  counters so a stalled pool is visible without grepping DEBUG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@grishasobol grishasobol force-pushed the feat/loader-watchdog-and-shutdown branch from d9ccb13 to 1e28fae Compare April 29, 2026 15:44
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a watchdog timeout for batch execution and a periodic heartbeat logger to monitor pool progress in the node-loader. It also enhances shutdown handling by supporting both SIGINT and SIGTERM signals for graceful termination. Review feedback suggests simplifying the heartbeat interval initialization and refining the placement of signal handling within asynchronous selection blocks.

I am having trouble creating individual review comments. Click here to see my feedback.

ethexe/node-loader/src/batch.rs (372-374)

medium

The heartbeat interval is initialized and then immediately ticked. This is redundant because the interval is already configured to skip missed ticks. The initial tick can be removed to simplify the logic.

let mut heartbeat = tokio::time::interval(HEARTBEAT_INTERVAL);
heartbeat.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);

ethexe/node-loader/src/main.rs (447)

medium

The tokio::signal::ctrl_c() call should be placed directly within the tokio::select! branch to ensure it is polled correctly. This avoids potential issues with signal consumption when the future is created outside the select block.

result = tokio::signal::ctrl_c() => Ok("Ctrl+C / SIGINT"),

@grishasobol
Copy link
Copy Markdown
Member Author

@claude fix review and also fix windows problems https://github.com/gear-tech/gear/actions/runs/25118845247/job/73613584380?pr=5393

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @grishasobol's task in 3m 5s —— View job


Fixed Windows CI Failure

  • Read changed files
  • Investigated Windows CI failure — tokio::signal::unix is unavailable on x86_64-pc-windows-msvc
  • Fixed Windows compilation error (gated SIGTERM handling behind #[cfg(unix)])
  • Addressed Gemini review feedback (see notes below)
  • Committed and pushed

Fix applied (ethexe/node-loader/src/main.rs):

Wrapped the tokio::signal::unix import and sigterm listener in #[cfg(unix)], with a #[cfg(not(unix))] fallback that handles only SIGINT via ctrl_c():

async fn wait_for_shutdown_signal() -> std::io::Result<&'static str> {
    #[cfg(unix)]
    {
        use tokio::signal::unix::{SignalKind, signal};
        let mut sigterm = signal(SignalKind::terminate())?;
        tokio::select! {
            result = tokio::signal::ctrl_c() => result.map(|()| "Ctrl+C / SIGINT"),
            _ = sigterm.recv() => Ok("SIGTERM"),
        }
    }
    #[cfg(not(unix))]
    {
        tokio::signal::ctrl_c().await.map(|()| "Ctrl+C / SIGINT")
    }
}

Gemini review notes:

  • Heartbeat initial tick (batch.rs:364): The heartbeat.tick().await is intentionally kept — it consumes the first immediate tick so the first heartbeat fires after HEARTBEAT_INTERVAL (10s), not at startup. MissedTickBehavior::Skip affects missed ticks, not the initial one. Removing it as Gemini suggested would cause a spurious heartbeat before any batch starts.
  • ctrl_c() placement (main.rs:444): Already inline inside the select! branch (result = tokio::signal::ctrl_c()). Gemini's suggestion matches the existing code — no change needed.
    | Branch

tokio::signal::unix is unavailable on Windows (x86_64-pc-windows-msvc),
causing a compilation error. Wrap the SIGTERM branch with #[cfg(unix)]
and add a #[cfg(not(unix))] fallback that handles SIGINT only.

Co-authored-by: Gregory Sobol <grishasobol@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

/// Per-batch watchdog: drop and reschedule a batch if a hung RPC call parks
/// the worker. Generous: code validation alone takes ~14 s for 5 codes.
const BATCH_TIMEOUT: Duration = Duration::from_secs(180);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth making batch timeout somewhat dynamic, but 180s can be fine if batch size is not that large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants