Skip to content

DO NOT MERGE: serverless restart race condition#4997

Draft
NathanFlurry wants to merge 1 commit into
rivetkit-shutdown/exit-pid1-after-drainfrom
05-07-do_not_merge_serverless_restart_race_condition
Draft

DO NOT MERGE: serverless restart race condition#4997
NathanFlurry wants to merge 1 commit into
rivetkit-shutdown/exit-pid1-after-drainfrom
05-07-do_not_merge_serverless_restart_race_condition

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 7, 2026

🚅 Deployed to the rivet-pr-4997 environment in rivet-frontend

Service Status Web Updated (UTC)
website ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am
mcp-hub ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am
frontend-inspector ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am
kitchen-sink ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am
frontend-cloud ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am
ladle ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am

@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-4997 May 7, 2026 09:39 Destroyed
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 7, 2026

Code Review\n\n> Note: This is a DRAFT / DO NOT MERGE PR.\n\n> Re-reviewed 2026-05-09 -- PR content unchanged since original review on 2026-05-07. All findings below remain applicable.\n\n---\n\n### Overview\n\nThree files added as a local reproduction and investigation harness for a post-engine-restart gateway race condition:\n\n- engine-restart-serverless.md -- investigation document: timing windows, sweep results, root-cause theory\n- engine-restart-serverless.ts -- driver harness: spawns a real engine binary, warms actors, restarts, sweeps post-restart probes\n- engine-restart-serverless-runtime.ts -- serverless runtime fixture: actor with SQLite/KV heartbeat, HTTP health endpoint, WebSocket ping/pong, commit-during-restart action\n\nThe investigation is thorough and the findings are clearly documented.\n\n---\n\n### Issues\n\n#### File placement\n\nThe investigation document engine-restart-serverless.md is committed into tests/driver/. Per CLAUDE.md, research documents go in .agent/research/ and notes go in .agent/notes/. tests/driver/ is the wrong home for a prose investigation that is not a runnable test spec.\n\n#### Inconsistent indentation in engine-restart-serverless.ts\n\nThe main() function has noticeably mixed indentation. Some blocks use one level of tabs, others jump to two or three without a corresponding nesting reason. The ServerlessRuntime.startProcess env object shows it clearly: RIVETKIT_TEST_ENDPOINT, RIVETKIT_HEARTBEAT_MODE, and RIVETKIT_TEST_HOST are each indented an extra level compared to surrounding keys.\n\n#### Unsafe as casts in the runtime fixture\n\nctx.sql as SqliteDatabase and ctx.vars as HeartbeatVars appear across multiple callbacks. Both bypass the type system. A comment explaining why the cast is necessary would help future readers.\n\n#### Module-level HEARTBEAT_MODE captured in pure helper functions\n\nheartbeatSuccessCount and heartbeatErrorCount read the module-level HEARTBEAT_MODE constant instead of accepting it as a parameter. They look like pure functions over HeartbeatStats but are not.\n\n#### Silent catch {} blocks\n\nwaitForHttpOk swallows all fetch errors silently; same for the catch {} inside openWebSocketPingPongs message handler. A one-line comment (e.g. // retry on any network error) would make the intent clear.\n\n#### Long duplicated body-truncation lines\n\nBoth runGatewayHealthDelaySweep and runGatewayWebSocketDelaySweep contain the same inline body-truncation expression. Extracting a truncate(str, limit) helper removes the duplication.\n\n#### console.error(JSON.stringify({...})) in commitDuringEngineRestart\n\nThe fixture otherwise uses logRuntimeEvent for structured JSON logs. This one error path should also use logRuntimeEvent('commitDuringEngineRestartFailed', {...}) for consistency.\n\n#### rawSqlDatabaseProvider stub is unexplained\n\nThe no-op provider at the top of the runtime fixture is used as the db: field on the actor. It is unclear whether this is intentionally a placeholder or a mistake. A comment explaining the intent would help.\n\n---\n\n### Minor notes\n\n- parseDelayList silently drops NaN values. A warning log for invalid entries would help catch typos in env var strings.\n- waitForRunnerConfigReady has a hardcoded 30-second deadline and 250 ms poll interval not exposed as named constants.\n- withTimeout and AbortSignal.timeout are both used for timeouts at different call sites. A comment noting the distinction would prevent confusion.\n\n---\n\n### Summary\n\nThe investigation is solid and the harness does what it needs to do. Main actionable items before broader use or promotion to a non-draft PR:\n\n1. Move the .md to .agent/research/\n2. Fix indentation inconsistencies in main() and ServerlessRuntime.startProcess\n3. Switch the rogue console.error to logRuntimeEvent\n4. Add a comment explaining the rawSqlDatabaseProvider stub

@NathanFlurry NathanFlurry changed the base branch from main to graphite-base/4997 May 9, 2026 05:56
@NathanFlurry NathanFlurry force-pushed the 05-07-do_not_merge_serverless_restart_race_condition branch from aa455c7 to 154af5d Compare May 9, 2026 05:57
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-4997 May 9, 2026 05:57 Destroyed
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4997 to runtime-options/preserve-bridge-errors May 9, 2026 05:57
@NathanFlurry NathanFlurry force-pushed the runtime-options/preserve-bridge-errors branch from 7d0b73c to d754df5 Compare May 9, 2026 06:24
@NathanFlurry NathanFlurry force-pushed the 05-07-do_not_merge_serverless_restart_race_condition branch from 154af5d to 011798b Compare May 9, 2026 06:24
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-4997 May 9, 2026 06:24 Destroyed
@NathanFlurry NathanFlurry changed the base branch from runtime-options/preserve-bridge-errors to graphite-base/4997 May 9, 2026 07:33
@NathanFlurry NathanFlurry reopened this May 9, 2026
@MasterPtato MasterPtato force-pushed the 05-07-do_not_merge_serverless_restart_race_condition branch from 011798b to 673ed87 Compare May 9, 2026 23:19
@MasterPtato MasterPtato force-pushed the graphite-base/4997 branch from d754df5 to 72cbc7c Compare May 9, 2026 23:19
@MasterPtato MasterPtato changed the base branch from graphite-base/4997 to main May 9, 2026 23:19
@MasterPtato MasterPtato force-pushed the 05-07-do_not_merge_serverless_restart_race_condition branch from 673ed87 to e1d5a39 Compare May 9, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant