refactor(metrics): split server metrics crate by NathanFlurry · Pull Request #5004 · rivet-dev/rivet

NathanFlurry · 2026-05-09T03:12:21Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

railway-app · 2026-05-09T03:12:29Z

🚅 Deployed to the rivet-pr-5004 environment in rivet-frontend

Service	Status	Web	Updated (UTC)
kitchen-sink	❌ Build Failed (View Logs)	Web	May 9, 2026 at 7:33 am
website	😴 Sleeping (View Logs)	Web	May 9, 2026 at 3:23 am
frontend-cloud	❌ Build Failed (View Logs)	Web	May 9, 2026 at 3:14 am
mcp-hub	✅ Success (View Logs)	Web	May 9, 2026 at 3:13 am
frontend-inspector	❌ Build Failed (View Logs)	Web	May 9, 2026 at 3:13 am
ladle	❌ Build Failed (View Logs)	Web	May 9, 2026 at 3:13 am

NathanFlurry · 2026-05-09T03:12:35Z

refactor(metrics): split server metrics crate #5004 : 2 dependent PRs (#5005 , #5006 ) 👈 (View in Graphite)
fix(rivetkit): use unbounded actor channels #5003
chore(envoy-client): add envoy key tracing context #5002
chore(rivetkit): rewrite work registry + fix waituntil not preventing sleep #4990
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-05-09T07:44:36Z

Code Review: refactor(metrics): split server metrics crate

Overview

This PR splits rivet-metrics into two crates:

rivet-metrics: lightweight, prometheus registry plus bucket definitions plus re-exports.
rivet-metrics-server: heavyweight, HTTP server plus OTel providers plus tracing setup.

The actor metrics subsystem is redesigned from per-actor Registry instances to a global METRICS static with label-based separation, plus a retention window for short-lived actor label cleanup.

Positives

Clean separation lets rivetkit-core pull in prometheus without dragging in hyper, OTel, etc.
eprintln to tracing::error in OtelProviderGuard::drop is the correct fix per convention.
New actor_active gauge gives scrape-time visibility into active-vs-retained actors.
Metrics now prefixed with actor_ for clarity on the Prometheus side.
Retention/cleanup design via RETAINED_ACTORS is sound: avoids immediate label deletion on drop.

Concerns

Action metrics appear to be missing (likely regression)

The old per-actor registry included action_call_total, action_error_total, and action_duration_seconds_sum. These are NOT in the new ActorMetricCollectors struct, and the tests exercising them (dispatch_records_prometheus_action_metrics, dispatch_records_action_error_metrics) were deleted without replacement. If those counters no longer exist, action-level observability is gone. Please confirm this is intentional or add the IntCounterVec/HistogramVec entries.

High-cardinality label risk

envoy_key is now a label dimension on every metric (around 40 metrics). If envoy_key encodes per-actor-instance information such as the actor ID or socket address, this is a cardinality explosion: each unique tuple (actor_id_gen, actor_key, envoy_key) produces a distinct time series for every metric. Please document what envoy_key contains and verify it has bounded cardinality in production.

cleanup_expired_actor_metrics on every retain/release

retain_sync does a full iteration of RETAINED_ACTORS on every actor creation and destruction. With thousands of short-lived actors this becomes O(actors) work on the hot actor lifecycle path. Consider amortizing cleanup via a background task or a size threshold gate.

record_retained_actor_metrics on every observation

begin_user_task, end_user_task, observe_shutdown_wait, inc_state_mutation, and all SQLite commit hooks each call record_retained_actor_metrics, which acquires an entry_sync on RETAINED_ACTORS. This adds a HashMap lookup and write to every metric observation in already-busy actor paths. Consider pre-populating all known label combinations at new() instead of tracking lazily.

is_missing_labels_error fragile string matching

This function detects the error case by inspecting the prometheus Msg string, which is not a stable public API. A version bump could silently break the filter, causing harmless cleanup errors to log at debug instead of being suppressed. Add a comment pinning the prometheus version these strings were verified against, or open an upstream issue for a typed variant.

Missing retention window expiry test

actor_active_metric_is_retained_after_drop verifies that metrics survive the drop, but there is no test that they are eventually removed after ACTOR_METRIC_RETENTION (10 min). Consider a test using a configurable constant or Instant override to verify that cleanup_expired_actor_metrics actually removes labels.

active_refs window between lock release and gauge update

In release_actor_metrics, actor_active is set to 0 outside the get_sync block, after inactive is latched. If another thread calls retain_actor_metrics between the lock release and the set(0), the gauge shows 0 for one scrape even though a new reference was just added. Moving the set(0) inside the get_sync block closes this window.

Minor

actor_create_state_ms and actor_create_vars_ms are inconsistent with the SI convention used elsewhere. Consider renaming to _milliseconds or documenting the unit in the description string.
Removed noisy comments are a correct cleanup.
Log message updated from using no-op collector to using existing collector is more accurate. Good fix.

railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5004 May 9, 2026 03:12 Destroyed

NathanFlurry force-pushed the metrics/split-server-crate branch from 30f3c06 to 71e98d4 Compare May 9, 2026 03:38

railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5004 May 9, 2026 03:38 Destroyed

NathanFlurry force-pushed the metrics/split-server-crate branch from 71e98d4 to a2c556b Compare May 9, 2026 03:49

railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5004 May 9, 2026 03:49 Destroyed

NathanFlurry force-pushed the metrics/split-server-crate branch from a2c556b to f63add8 Compare May 9, 2026 05:29

railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5004 May 9, 2026 05:29 Destroyed

This was referenced May 9, 2026

fix(rivetkit): preserve internal bridge errors #4993

Merged

DO NOT MERGE: serverless restart race condition #4997

Draft

NathanFlurry force-pushed the metrics/split-server-crate branch from f63add8 to 71e98d4 Compare May 9, 2026 05:57

NathanFlurry marked this pull request as ready for review May 9, 2026 05:57

MasterPtato mentioned this pull request May 9, 2026

feat(envoy-client): add ping health check #5001

Merged

11 tasks

refactor(metrics): split server metrics crate

fd0ce61

NathanFlurry force-pushed the metrics/split-server-crate branch from 71e98d4 to fd0ce61 Compare May 9, 2026 07:33

NathanFlurry force-pushed the rivetkit-channels/use-unbounded-channels branch from 6e295e4 to 3388aec Compare May 9, 2026 07:33

railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5004 May 9, 2026 07:33 Destroyed

Base automatically changed from rivetkit-channels/use-unbounded-channels to main May 9, 2026 07:50

NathanFlurry merged commit fd0ce61 into main May 9, 2026
11 of 18 checks passed

NathanFlurry deleted the metrics/split-server-crate branch May 9, 2026 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(metrics): split server metrics crate#5004

refactor(metrics): split server metrics crate#5004
NathanFlurry merged 1 commit into
mainfrom
metrics/split-server-crate

NathanFlurry commented May 9, 2026

Uh oh!

railway-app Bot commented May 9, 2026 •

edited

Loading

Uh oh!

NathanFlurry commented May 9, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented May 9, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

railway-app Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

railway-app Bot commented May 9, 2026 •

edited

Loading

NathanFlurry commented May 9, 2026 •

edited

Loading