Skip to content

fix(envoy-client): dont abort drain after 20s#5043

Draft
MasterPtato wants to merge 1 commit into
05-11-fix_wasm_point_wasm-pack_build_to_new_wasm-bindgen_repofrom
05-11-fix_envoy-client_dont_abort_drain_after_20s
Draft

fix(envoy-client): dont abort drain after 20s#5043
MasterPtato wants to merge 1 commit into
05-11-fix_wasm_point_wasm-pack_build_to_new_wasm-bindgen_repofrom
05-11-fix_envoy-client_dont_abort_drain_after_20s

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@MasterPtato MasterPtato changed the base branch from counter-latency/require-engine-ping-health to graphite-base/5043 May 12, 2026 01:21
@MasterPtato MasterPtato force-pushed the 05-11-fix_envoy-client_dont_abort_drain_after_20s branch from 2ad5377 to b28585d Compare May 12, 2026 01:21
@MasterPtato MasterPtato changed the base branch from graphite-base/5043 to 05-11-fix_wasm_point_wasm-pack_build_to_new_wasm-bindgen_repo May 12, 2026 01:21
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

Code Review

Overview

This PR has two goals: (1) replace the hard-coded 20s SHUTDOWN_DRAIN_TIMEOUT in rivetkit-core with the server-provided actor_stop_threshold from protocol metadata, and (2) refactor the kitchen-sink example to use an explicit resolveMode() helper instead of inline env-var checks.

The kitchen-sink refactor is clean and well-structured. The core Rust change, however, contains critical compile errors that need to be fixed before this is ready to merge.


Critical: Compile Errors in registry/mod.rs

Bug 1: SHUTDOWN_DRAIN_TIMEOUT is removed but still referenced

The constant is deleted at the top of the file, but the timeout() call still uses it:

// constant removed ↑
match timeout(
    Duration::from_millis(SHUTDOWN_DRAIN_TIMEOUT as u64),  // ← compile error
    handle.shutdown_and_wait(false),
)

Additionally, SHUTDOWN_DRAIN_TIMEOUT was a Durationas u64 on a Duration doesn't type-check regardless.

Bug 2: stop_threshold is computed but never used

let stop_threshold = handle
    .get_protocol_metadata  // ← missing () — see Bug 3
    .await
    .map(|x| x.actor_stop_threshold)
    .unwrap_or(30 * 60 * 1000);

match timeout(
    Duration::from_millis(SHUTDOWN_DRAIN_TIMEOUT as u64),  // ← should be stop_threshold
    ...
)

The stop_threshold variable is the intended replacement, but the timeout call still uses the deleted constant. The fix is:

match timeout(
    Duration::from_millis(stop_threshold as u64),
    handle.shutdown_and_wait(false),
)

Bug 3: Missing parentheses on async method call

get_protocol_metadata is defined as pub async fn get_protocol_metadata(&self). The code does:

handle.get_protocol_metadata.await   // ← field access, not a method call

Should be:

handle.get_protocol_metadata().await

Minor Issues

resolveMode() default changes implicit behavior

Previously in server.ts, running without any env vars resulted in serverlessMode = false (serverful). With the new code, the same environment defaults to "serverless" mode. This is intentional (the dev script now sets RIVET_KITCHEN_SINK_MODE=serverful explicitly), but it is a behavioral change for any existing local dev setup that relied on the implicit default.

TODO comment is untracked

// TODO: Move into envoy-client since timing out has to do with protocol compliance

This is noted inline but not tracked in .agent/todo/. Consider adding a todo entry so it doesn't get lost.

i64 to u64 cast

actor_stop_threshold is an i64 in the protocol schema. The cast stop_threshold as u64 is fine in practice (threshold is always positive), but wrapping on negative input is silent. A .max(0) guard before the cast would be more explicit.


Positive Changes

  • The mode.ts refactor is clean: explicit union type, validated env var with a useful error message, and centralized mode detection.
  • Removing the hard-coded 20s timeout in favour of the engine-provided actor_stop_threshold (default 30 min) is the right fix — the old timeout was too aggressive and would abort drains before the engine's own lost-timeout fired.
  • The comment on the serverless handler guard in server.ts correctly explains why the serverful path must not wire /api/rivet handlers.
  • Cosmetic config changes (2_000, 5 * 60 * 1000) improve readability.
  • node:22-trixie-slim is a reasonable base image bump.
  • Docstring indentation fix in index.d.ts is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant