fix(kitchen-sink): correct broken serverless drainGracePeriod default by NathanFlurry · Pull Request #5006 · rivet-dev/rivet

NathanFlurry · 2026-05-09T05:57:57Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

drainGracePeriod and requestLifespan both defaulted to 15*60 seconds. pegboard-outbound computes sleep_until_drain as request_lifespan.saturating_sub(drain_grace_period), so when both are equal, drain fires the instant the outbound HTTP request opens. Every fresh actor allocation then receives GoingAway before reaching Running, producing guard.actor_ready_timeout failures on default config. Set drainGracePeriod default to 60s, giving the outbound a 14-minute working window before drain. Add comment documenting the constraint that drainGracePeriod must be strictly less than requestLifespan. Also adds an increment-only load-test harness (scripts/load-test.ts + load-test-counter actor) used to verify the fix. With the fix, 10-worker 5-min run achieves 99.96% success vs 0% on bare defaults.

railway-app · 2026-05-09T05:58:05Z

🚅 Deployed to the rivet-pr-5006 environment in rivet-frontend

Service	Status	Web	Updated (UTC)
website	😴 Sleeping (View Logs)	Web	May 9, 2026 at 6:10 am
ladle	❌ Build Failed (View Logs)	Web	May 9, 2026 at 5:59 am
mcp-hub	✅ Success (View Logs)	Web	May 9, 2026 at 5:59 am
frontend-cloud	❌ Build Failed (View Logs)	Web	May 9, 2026 at 5:59 am
frontend-inspector	❌ Build Failed (View Logs)	Web	May 9, 2026 at 5:59 am
kitchen-sink	❌ Build Failed (View Logs)	Web	May 9, 2026 at 5:58 am

NathanFlurry · 2026-05-09T05:58:12Z

fix(kitchen-sink): correct broken serverless drainGracePeriod default #5006 👈 (View in Graphite)
refactor(metrics): split server metrics crate #5004 : 1 other dependent PR (#5005 )
fix(rivetkit): use unbounded actor channels #5003
chore(envoy-client): add envoy key tracing context #5002
chore(rivetkit): rewrite work registry + fix waituntil not preventing sleep #4990
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-05-09T05:59:51Z

Code Review

Overview

This PR does two things:

Bug fix: Corrects the drainGracePeriod default in the kitchen-sink serverless pool config from 15 * 60 (equal to requestLifespan) to 60. When both values are equal, sleep_until_drain = request_lifespan - drain_grace_period = 0, causing GoingAway to fire instantly on every fresh actor allocation.
New tooling: Adds a loadTestCounter actor and a load-test.ts script for parallel load testing with metrics scraping.

The Bug Fix (`src/index.ts`)

The fix is correct and the comment explaining it is well-placed — this is exactly the kind of non-obvious invariant that CLAUDE.md calls for documenting. One minor style note: the comment uses a parenthetical (saturating) inline, which slightly conflicts with the convention of avoiding parenthetical structures. Consider:

// Must be strictly less than requestLifespan. The engine computes
// sleep_until_drain as request_lifespan - drain_grace_period and saturates at
// zero; equal values cause drain to fire the instant a request opens, so every
// fresh actor allocation immediately receives GoingAway.

`scripts/load-test.ts`

Positives:

Staggered worker start (startDelayMs) avoids thundering-herd at t=0. Good.
Slow-warn timer with clearTimeout in both success and failure paths is correct.
Using performance.now() for duration but Date.now() for the wall-clock timestamp is the right split.
describeError checks for structured group/code before falling back to Error.name, matching the canonical RivetError shape.

Issues / suggestions:

Unbounded samples array: All CallSample objects accumulate for the full run. At PARALLELISM=10, INTERVAL_MS=1000, DURATION_MS=300000 that's ~3,000 samples — fine. But if someone runs with a short interval or large parallelism for hours, this leaks. Consider capping to the last N samples or only keeping samples needed for the summary stats (sorted durations).
createClient per call: Intentional for stress-testing connection overhead, but worth a comment so readers don't assume it's a mistake.
stop() is fire-and-forget on the fd: pollMetricsTarget.stop() hooks closeSync into loopPromise.finally(...) but doesn't expose a Promise, so the main function can't await the close. If the process exits before the loop's last tick() awaits fetch, the final scrape may not be flushed. Low severity for a dev tool, but consider returning a Promise<void> from stop() and awaiting it in main().

Missing env-var validation: Only PARALLELISM is validated. Number("abc") → NaN, which would silently break the timing math. At minimum add guards for DURATION_MS and INTERVAL_MS:

if (!Number.isFinite(DURATION_MS) || DURATION_MS <= 0) throw new Error("LOAD_TEST_DURATION_MS must be a positive number");
if (!Number.isFinite(INTERVAL_MS) || INTERVAL_MS <= 0) throw new Error("LOAD_TEST_INTERVAL_MS must be a positive number");

disableMetadataLookup: true is hardcoded: This is likely intentional (avoids the round-trip for pure load testing), but a comment would help. Also consider whether the test should optionally exercise the metadata path.
Unused startedAtMs on the happy-path log: startedAtMs is stored in CallSample but the console log only prints durationMs. That's fine for the summary output, but it means the JSONL metrics file is the only place to correlate wall-clock timing if needed. If offline analysis matters, consider writing the samples to a file too.

`src/actors/testing/load-test-counter.ts`

Simple and correct. No concerns.

Summary

The drainGracePeriod fix is the meaningful change here and it's correct. The load-test script is well-structured; the issues above are all minor (unbounded array, fire-and-forget fd close, missing env validation). The comment style on the bug fix could be slightly tightened to avoid the parenthetical.

This was referenced May 9, 2026

refactor(metrics): split server metrics crate #5004

Merged

fix(rivetkit): use unbounded actor channels #5003

Merged

railway-app Bot had a problem deploying to rivet-frontend / rivet-pr-5006 May 9, 2026 05:58 Failure

NathanFlurry mentioned this pull request May 9, 2026

fix(sqlite): quiet startup database miss #5005

Merged

11 tasks

This was referenced May 9, 2026

fix(depot-client): only fatal sqlite vfs on fence mismatch #5007

Merged

test(rivetkit): cover metrics endpoint #5009

Closed

fix(rivetkit): preserve query gateway skip ready wait #5008

Merged

chore(engine): roxiproxi tests #4864

Merged

MasterPtato mentioned this pull request May 9, 2026

feat(envoy-client): add ping health check #5001

Merged

11 tasks

NathanFlurry changed the base branch from metrics/split-server-crate to graphite-base/5006 May 9, 2026 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kitchen-sink): correct broken serverless drainGracePeriod default#5006

fix(kitchen-sink): correct broken serverless drainGracePeriod default#5006
NathanFlurry wants to merge 1 commit into
graphite-base/5006from
kitchen-sink/fix-drain-grace-period

NathanFlurry commented May 9, 2026

Uh oh!

railway-app Bot commented May 9, 2026 •

edited

Loading

Uh oh!

NathanFlurry commented May 9, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented May 9, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

railway-app Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 9, 2026

Code Review

Overview

The Bug Fix (src/index.ts)

scripts/load-test.ts

src/actors/testing/load-test-counter.ts

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

railway-app Bot commented May 9, 2026 •

edited

Loading

NathanFlurry commented May 9, 2026 •

edited

Loading

The Bug Fix (`src/index.ts`)

`scripts/load-test.ts`

`src/actors/testing/load-test-counter.ts`