Skip to content

fix(rivetkit): require engine ping for health#5022

Draft
NathanFlurry wants to merge 1 commit into
counter-latency/current-actor-metricsfrom
counter-latency/require-engine-ping-health
Draft

fix(rivetkit): require engine ping for health#5022
NathanFlurry wants to merge 1 commit into
counter-latency/current-actor-metricsfrom
counter-latency/require-engine-ping-health

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

Review: fix(rivetkit): require engine ping for health

Overview

This PR tightens the health check semantics for the envoy client: instead of initializing last_ping_ts to the current time (making a fresh envoy appear healthy before the engine ever pings it), the field is now initialized to 0 and is_ping_healthy() returns false until a real ping is received. This prevents a freshly-started or never-connected envoy from falsely reporting healthy to container orchestrators.

The core idea is correct and the bug it fixes is real.


Issues

Inconsistency between native Rust and NAPI health paths (potential bug)

The serverless.rs change and the registry.rs change handle the same scenario - no active envoy in serverless mode - with opposite results:

  • serverless.rs (CoreServerlessRuntime native path): .unwrap_or(true) - no active envoy returns 200 OK
  • registry.rs (NAPI path, RegistryState::Serverless): None => health_response(503, "engine_ping_stale", &version) - no active envoy returns 503

In serverless mode, no active envoy typically means the runtime is idle - no serverless invocation has arrived yet, so no envoy has been created. This is a valid healthy state, distinct from "envoy exists but engine never pinged." The engine_ping_stale status message is also semantically incorrect here; it implies an envoy was started but missed pings, when really no envoy exists at all. A TypeScript-based serverless deployment would start returning 503 on health probes immediately and stay there until its first invocation arrives, which could cause premature container recycling.

The registry.rs None branch should keep health_response(200, "ok", &version) for the serverless-idle case, matching the native path. The ping-sentinel fix to is_ping_healthy() already covers the "envoy started but no ping received" scenario for the Some(envoy) branch.


Positives

  • Correct sentinel choice. Using 0 for "no ping received" is safe: epoch 0 is always outside the 20-second threshold, and the field comment documents the invariant clearly.
  • Test updated correctly. The serverless.rs test now asserts 503 on initial health check, directly validating the new startup behavior.
  • Logic simplification. Renaming envoy_unhealthy to runtime_healthy and flipping branches in serverless.rs makes the intent clearer.
  • Three test fixtures updated. actor.rs and events.rs test contexts both initialize last_ping_ts: 0, keeping fixtures consistent with production construction.

Minor notes

  • The is_ping_healthy() early return for last == 0 could use a named constant (e.g. PING_TS_UNSET) to make the sentinel more self-documenting, though the field-level doc comment does cover it.
  • No test coverage for the registry.rs NAPI health path. Given this is where the inconsistency lives, a test asserting the serverless-idle (no envoy) case returns 200 would prevent future regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant