Skip to content

fix(rivetkit): preserve internal bridge errors#4993

Merged
NathanFlurry merged 1 commit into
mainfrom
runtime-options/preserve-bridge-errors
May 9, 2026
Merged

fix(rivetkit): preserve internal bridge errors#4993
NathanFlurry merged 1 commit into
mainfrom
runtime-options/preserve-bridge-errors

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 7, 2026

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 7, 2026

PR #4993 Review: fix(rivetkit): preserve internal bridge errors

Overview

This PR fixes a significant correctness problem: structured RivetError values thrown by actors were being lost when they crossed the NAPI bridge into Rust, causing all actor errors to appear as opaque internal_error. The fix introduces a new ActorSpecifier type that travels with errors through the whole stack, replaces Box::leak-based schema interning with a RivetErrorKind::Dynamic variant, adds client-boundary sanitization helpers, and bumps the client protocol to v4 to carry actor context back to callers.

The design is sound and the changes are well-scoped. A few items worth discussing below.

Correctness

callback_error loses the JsCallbackUnavailable type for unstructured errors

Previously, a non-bridge, non-closing NAPI error returned JsCallbackUnavailable { callback, reason }.build() — a typed, matchable RivetError. The new code returns anyhow::anyhow!(reason) which is an opaque string. Downstream code that matched on actor.callback_unavailable will stop working, and the callback field (which identified which callback failed) is lost. This looks like a regression. Consider keeping the JsCallbackUnavailable path and only removing the now-redundant pre-return logging.

is_internal_error hardcodes group names

is_internal_error checks (group == "core" || group == "rivetkit") && code == "internal_error". This can silently diverge from public_error_status_code (the authoritative public/private classifier) if a new group is added. Since is_internal_error is only used in ActionDispatchError::from_anyhow to decide whether to preserve the original anyhow message, consider deriving it from is_client_error_public instead: !is_client_error_public(group, code).

actor_specifier() silently drops context when sleep_generation() is None

actor_specifier() returns None if sleep_generation() returns None. If the actor has no sleep generation yet (early startup, misconfigured context), errors propagate without actor context and there is no indication in the log. A tracing::debug! on the None path would make silent failures visible.

Memory Leak Fix (Positive)

The removal of BRIDGE_RIVET_ERROR_SCHEMAS, leak_str, and intern_bridge_rivet_error_schema is correct and aligns with the CLAUDE.md invariant that Box::leak must not appear in per-call paths. RivetErrorKind::Dynamic cleanly avoids any leak.

Logging

A significant number of tracing::warn! / tracing::debug! callsites are removed without equivalent replacements:

  • parse_bridge_rivet_error: removes the "decoded structured bridge error" warn
  • callback_error: removes all tracing for the three callback failure cases
  • encodeNativeCallbackError (TS): removes warn log for bridge encoding

New logging is added in router.ts, client/utils.ts, sqlite.rs, and actor_connect.rs, which partially compensates. But the NAPI layer is now dark for non-bridge callback failures. If a callback dies with a raw string reason, there is no log anywhere. Consider at minimum a tracing::debug! in the anyhow::anyhow!(reason) branch.

Protocol

The v3 to v4 migration looks correct:

  • actor is optional in the BARE schema, so old clients reading v4 messages just ignore it.
  • v3_to_v4 / v4_to_v3 converters are symmetric and present.
  • PROTOCOL_VERSION bumped from 3 to 4 in both Rust and TS.

One minor note: the CBOR manual encoder computes the actor map length with if actor.key.is_some() { 3 } else { 2 }. Correct but fragile if ActorSpecifier gains another optional field later. Consistent with the existing file pattern so not blocking.

Test Coverage

Good additions:

  • preserves_internal_anyhow_message_until_client_boundary
  • masks_private_structured_message_at_client_boundary
  • framework_action_error_response_sanitizes_internal_message
  • framework_anyhow_error_response_sanitizes_internal_message
  • test_internal_error_preserves_actor_specifier
  • test_actor_specifier_extracted_from_middle_of_chain
  • Bridge roundtrip test updated with actor field

Missing coverage:

  • No test for the callback_error path when the NAPI error is neither bridge-structured nor closing. The old JsCallbackUnavailable behavior had no test and the new anyhow::anyhow! path has no test.
  • No test verifying that attach_actor_to_error + RivetError::extract round-trips the actor specifier through the full ActorTask dispatch path end-to-end.

Minor

  • ActionDispatchError::into_anyhow sets both kind.default_message and message to the same sanitized string. Since RivetError::message() prefers self.message when Some, the default_message is unreachable. Not a bug, just dead state.
  • deconstructError signature change (removing logger and extraLog) is a good cleanup; callers that need logging now do it themselves with more context.

Summary

The core design is correct and the memory-leak fix is a clear improvement. The main risk is the callback_error change losing the JsCallbackUnavailable typed error, which may break downstream error matching. The is_internal_error classifier should derive from is_client_error_public to avoid divergence. Logging gaps in the NAPI layer for non-bridge callback failures are worth addressing before merging.

@NathanFlurry NathanFlurry force-pushed the runtime-options/preserve-bridge-errors branch from 5613480 to 814b9ab Compare May 7, 2026 04:26
@NathanFlurry NathanFlurry force-pushed the 05-06-fix_rivetkit_fix_default_spawn_engine_behavior branch from 7685070 to 077bd4e Compare May 7, 2026 04:26
@NathanFlurry NathanFlurry changed the base branch from 05-06-fix_rivetkit_fix_default_spawn_engine_behavior to graphite-base/4993 May 7, 2026 04:27
@NathanFlurry NathanFlurry force-pushed the runtime-options/preserve-bridge-errors branch from 814b9ab to 811f9c5 Compare May 7, 2026 04:28
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4993 to runtime-options/remove-deprecated-actor-options May 7, 2026 04:28
@NathanFlurry NathanFlurry changed the base branch from runtime-options/remove-deprecated-actor-options to graphite-base/4993 May 7, 2026 04:34
@NathanFlurry NathanFlurry force-pushed the runtime-options/preserve-bridge-errors branch from 811f9c5 to b24f3be Compare May 7, 2026 04:34
@NathanFlurry NathanFlurry force-pushed the graphite-base/4993 branch from eb0a964 to 395aa83 Compare May 7, 2026 04:34
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4993 to main May 7, 2026 04:34
@NathanFlurry NathanFlurry marked this pull request as ready for review May 7, 2026 04:34
@NathanFlurry NathanFlurry force-pushed the runtime-options/preserve-bridge-errors branch 2 times, most recently from f5ede57 to ac0098e Compare May 8, 2026 20:40
@NathanFlurry NathanFlurry changed the base branch from main to graphite-base/4993 May 9, 2026 05:56
@NathanFlurry NathanFlurry force-pushed the runtime-options/preserve-bridge-errors branch from ac0098e to 7d0b73c Compare May 9, 2026 05:57
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4993 to sqlite/quiet-startup-db-miss May 9, 2026 05:57
@NathanFlurry NathanFlurry force-pushed the runtime-options/preserve-bridge-errors branch from 7d0b73c to d754df5 Compare May 9, 2026 06:24
@NathanFlurry NathanFlurry force-pushed the sqlite/quiet-startup-db-miss branch from e8f7214 to 4e070ea Compare May 9, 2026 06:24
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 9, 2026

PR Review: fix(rivetkit): preserve internal bridge errors

Overview

This is a substantial, well-structured PR that improves error handling across the full Rust/TypeScript stack. The main changes are:

  1. RivetErrorKind enum — replaces the schema: &'static RivetErrorSchema field with a Static variant (existing behavior) and a Dynamic variant (owned strings), eliminating the Box::leak-based BRIDGE_RIVET_ERROR_SCHEMAS intern cache.
  2. ActorSpecifier — new struct carried through anyhow chains and serialized to wire/response, giving clients and logs the actor identity on every error.
  3. Error sanitization boundaryclient_error_message / client_error_metadata functions enforce that internal messages never reach the wire, while ActionDispatchError.message now holds the raw anyhow text for internal logging.
  4. Client protocol v4 — adds actor: optional<ActorSpecifier> to Error and HttpResponseError.
  5. HTTP response headersx-rivet-actor, x-rivet-actor-generation, x-rivet-actor-key on every actor response.
  6. TypeScript parityActorSpecifier type, deconstructError signature cleanup, BARE v4 codec.

Issues

1. Redundant attach_actor_response_headers in action/queue error paths

handle_user_request_fetch converts its errors into Ok(HttpResponse) via framework_anyhow_error_response_with_actormessage_boundary_error_response_with_actor, which already calls attach_actor_response_headers. The outer handle_fetch then unconditionally calls attach_actor_response_headers again on the returned Ok response:

response.map(|mut response| {
    attach_actor_response_headers(&mut response, &actor);
    response
})

Because the header map uses insert (overwrite), this is harmless — the values are identical — but it is unnecessary work on every error response from action/queue handlers. Consider skipping the outer attach_actor_response_headers when the response already has actor headers set, or only adding them at one layer.


2. Incomplete CRLF injection check in attach_actor_response_headers

fn attach_actor_response_headers(response: &mut HttpResponse, actor: &ActorSpecifier) {
    response.headers.insert(HEADER_RIVET_ACTOR.to_owned(), actor.actor_id.clone()); // no check
    ...
    if !key.contains('\r') && !key.contains('\n') { // only key is guarded
        response.headers.insert(HEADER_RIVET_ACTOR_KEY.to_owned(), key.clone());
    }
}

actor_id is inserted without a CRLF guard. Actor IDs are UUIDs today (safe), but the field type is String and the same check applied to key should also apply to actor_id for defense-in-depth. The same gap exists in the TypeScript actorResponseHeaders function.


3. impl_versioned_v3_only macro name is now misleading

The macro impl_versioned_v3_only! now generates V3 and V4 variants and v3_to_v4/v4_to_v3 converters. The name no longer matches its behavior. Consider renaming to impl_versioned_v3_v4_only! or a more generic impl_versioned_from_v3!.


4. Logging removed from callback_error in NAPI (observability regression)

The PR strips several structured log lines from callback_error:

// Removed:
"decoded structured bridge error"
"napi callback failed with structured bridge error"
"napi callback closed without structured bridge error prefix"
"napi callback failed without structured bridge error prefix"

These were the only log points covering JavaScript→Rust callback failures without a structured bridge error. Without them, untagged JS errors (non-RivetError throws) that reach callback_error are now completely silent, making production debugging harder. At minimum the Status::Closing path and the unstructured-error fallback path should retain a tracing::debug!.


5. deconstructError logging gaps on TypeScript side

deconstructError previously logged at every branch. The PR moves logging to call sites, but not all call sites log. For example, in engine/actor-driver.ts:

const { group, code } = deconstructError(error, false);
logger().error({ msg: "failed to bind dynamic hibernatable websocket", ... });

The error details (message, metadata, actor) are not included in the log. Other call sites may have similar gaps.


Observations / Non-blocking

  • ActionDispatchError.message semantics change — the field now holds the raw anyhow message for internal errors ("plain failure") rather than the RivetError default message. This is intentional and the tests are clear, but a doc comment on the field explaining the client_message() vs message split would help future readers.

  • ActorSpecifier implementing std::error::Error — valid and idiomatic for anyhow context, but the Display output ("actor <id> generation <n>") looks like a context annotation rather than an error description. This is fine as anyhow context, just slightly unusual.

  • Version converter chain — the v3→v4 and v4→v3 converters correctly enumerate all fields rather than using the removed transcode_version helper, in compliance with the new CLAUDE.md rule. Good.

  • Memory leak fix — replacing BRIDGE_RIVET_ERROR_SCHEMAS / Box::leak with RivetErrorKind::Dynamic is clean and fixes the CLAUDE.md violation. The CLAUDE.md has been updated accordingly.

  • Test coverage — new tests cover the error sanitization boundary, actor specifier extraction from anyhow chains, serialization, and HTTP header injection. Good coverage of the new behavior.


Summary

The core design is sound and the changes are well-motivated. The two items worth fixing before merge are the CRLF injection gap for actor_id (security) and the NAPI callback logging removal (observability). The redundant header insertion and the macro naming are low-priority cleanup items.

@NathanFlurry NathanFlurry force-pushed the sqlite/quiet-startup-db-miss branch from 4e070ea to 16c6a31 Compare May 9, 2026 07:33
@NathanFlurry NathanFlurry force-pushed the runtime-options/preserve-bridge-errors branch from d754df5 to 0889a3a Compare May 9, 2026 07:33
Base automatically changed from sqlite/quiet-startup-db-miss to main May 9, 2026 07:50
@NathanFlurry NathanFlurry merged commit 0889a3a into main May 9, 2026
9 of 23 checks passed
@NathanFlurry NathanFlurry deleted the runtime-options/preserve-bridge-errors branch May 9, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant