Skip to content

fix(kitchen-sink): correct broken serverless drainGracePeriod default#5006

Open
NathanFlurry wants to merge 1 commit into
graphite-base/5006from
kitchen-sink/fix-drain-grace-period
Open

fix(kitchen-sink): correct broken serverless drainGracePeriod default#5006
NathanFlurry wants to merge 1 commit into
graphite-base/5006from
kitchen-sink/fix-drain-grace-period

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

drainGracePeriod and requestLifespan both defaulted to 15*60 seconds.
pegboard-outbound computes sleep_until_drain as
request_lifespan.saturating_sub(drain_grace_period), so when both are
equal, drain fires the instant the outbound HTTP request opens. Every
fresh actor allocation then receives GoingAway before reaching Running,
producing guard.actor_ready_timeout failures on default config.

Set drainGracePeriod default to 60s, giving the outbound a 14-minute
working window before drain. Add comment documenting the constraint
that drainGracePeriod must be strictly less than requestLifespan.

Also adds an increment-only load-test harness (scripts/load-test.ts +
load-test-counter actor) used to verify the fix. With the fix, 10-worker
5-min run achieves 99.96% success vs 0% on bare defaults.
@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 9, 2026

🚅 Deployed to the rivet-pr-5006 environment in rivet-frontend

Service Status Web Updated (UTC)
website 😴 Sleeping (View Logs) Web May 9, 2026 at 6:10 am
ladle ❌ Build Failed (View Logs) Web May 9, 2026 at 5:59 am
mcp-hub ✅ Success (View Logs) Web May 9, 2026 at 5:59 am
frontend-cloud ❌ Build Failed (View Logs) Web May 9, 2026 at 5:59 am
frontend-inspector ❌ Build Failed (View Logs) Web May 9, 2026 at 5:59 am
kitchen-sink ❌ Build Failed (View Logs) Web May 9, 2026 at 5:58 am

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 9, 2026

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 9, 2026

Code Review

Overview

This PR does two things:

  1. Bug fix: Corrects the drainGracePeriod default in the kitchen-sink serverless pool config from 15 * 60 (equal to requestLifespan) to 60. When both values are equal, sleep_until_drain = request_lifespan - drain_grace_period = 0, causing GoingAway to fire instantly on every fresh actor allocation.
  2. New tooling: Adds a loadTestCounter actor and a load-test.ts script for parallel load testing with metrics scraping.

The Bug Fix (src/index.ts)

The fix is correct and the comment explaining it is well-placed — this is exactly the kind of non-obvious invariant that CLAUDE.md calls for documenting. One minor style note: the comment uses a parenthetical (saturating) inline, which slightly conflicts with the convention of avoiding parenthetical structures. Consider:

// Must be strictly less than requestLifespan. The engine computes
// sleep_until_drain as request_lifespan - drain_grace_period and saturates at
// zero; equal values cause drain to fire the instant a request opens, so every
// fresh actor allocation immediately receives GoingAway.

scripts/load-test.ts

Positives:

  • Staggered worker start (startDelayMs) avoids thundering-herd at t=0. Good.
  • Slow-warn timer with clearTimeout in both success and failure paths is correct.
  • Using performance.now() for duration but Date.now() for the wall-clock timestamp is the right split.
  • describeError checks for structured group/code before falling back to Error.name, matching the canonical RivetError shape.

Issues / suggestions:

  1. Unbounded samples array: All CallSample objects accumulate for the full run. At PARALLELISM=10, INTERVAL_MS=1000, DURATION_MS=300000 that's ~3,000 samples — fine. But if someone runs with a short interval or large parallelism for hours, this leaks. Consider capping to the last N samples or only keeping samples needed for the summary stats (sorted durations).

  2. createClient per call: Intentional for stress-testing connection overhead, but worth a comment so readers don't assume it's a mistake.

  3. stop() is fire-and-forget on the fd: pollMetricsTarget.stop() hooks closeSync into loopPromise.finally(...) but doesn't expose a Promise, so the main function can't await the close. If the process exits before the loop's last tick() awaits fetch, the final scrape may not be flushed. Low severity for a dev tool, but consider returning a Promise<void> from stop() and awaiting it in main().

  4. Missing env-var validation: Only PARALLELISM is validated. Number("abc")NaN, which would silently break the timing math. At minimum add guards for DURATION_MS and INTERVAL_MS:

    if (!Number.isFinite(DURATION_MS) || DURATION_MS <= 0) throw new Error("LOAD_TEST_DURATION_MS must be a positive number");
    if (!Number.isFinite(INTERVAL_MS) || INTERVAL_MS <= 0) throw new Error("LOAD_TEST_INTERVAL_MS must be a positive number");
  5. disableMetadataLookup: true is hardcoded: This is likely intentional (avoids the round-trip for pure load testing), but a comment would help. Also consider whether the test should optionally exercise the metadata path.

  6. Unused startedAtMs on the happy-path log: startedAtMs is stored in CallSample but the console log only prints durationMs. That's fine for the summary output, but it means the JSONL metrics file is the only place to correlate wall-clock timing if needed. If offline analysis matters, consider writing the samples to a file too.


src/actors/testing/load-test-counter.ts

Simple and correct. No concerns.


Summary

The drainGracePeriod fix is the meaningful change here and it's correct. The load-test script is well-structured; the issues above are all minor (unbounded array, fire-and-forget fd close, missing env validation). The comment style on the bug fix could be slightly tightened to avoid the parenthetical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant