Skip to content

refactor(metrics): split server metrics crate#5004

Merged
NathanFlurry merged 1 commit into
mainfrom
metrics/split-server-crate
May 9, 2026
Merged

refactor(metrics): split server metrics crate#5004
NathanFlurry merged 1 commit into
mainfrom
metrics/split-server-crate

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 9, 2026

🚅 Deployed to the rivet-pr-5004 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web May 9, 2026 at 7:33 am
website 😴 Sleeping (View Logs) Web May 9, 2026 at 3:23 am
frontend-cloud ❌ Build Failed (View Logs) Web May 9, 2026 at 3:14 am
mcp-hub ✅ Success (View Logs) Web May 9, 2026 at 3:13 am
frontend-inspector ❌ Build Failed (View Logs) Web May 9, 2026 at 3:13 am
ladle ❌ Build Failed (View Logs) Web May 9, 2026 at 3:13 am

@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5004 May 9, 2026 03:12 Destroyed
Copy link
Copy Markdown
Member Author

NathanFlurry commented May 9, 2026

@NathanFlurry NathanFlurry force-pushed the metrics/split-server-crate branch from 30f3c06 to 71e98d4 Compare May 9, 2026 03:38
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5004 May 9, 2026 03:38 Destroyed
@NathanFlurry NathanFlurry force-pushed the metrics/split-server-crate branch from 71e98d4 to a2c556b Compare May 9, 2026 03:49
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5004 May 9, 2026 03:49 Destroyed
@NathanFlurry NathanFlurry force-pushed the metrics/split-server-crate branch from a2c556b to f63add8 Compare May 9, 2026 05:29
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5004 May 9, 2026 05:29 Destroyed
@NathanFlurry NathanFlurry force-pushed the metrics/split-server-crate branch from f63add8 to 71e98d4 Compare May 9, 2026 05:57
@NathanFlurry NathanFlurry marked this pull request as ready for review May 9, 2026 05:57
@NathanFlurry NathanFlurry force-pushed the metrics/split-server-crate branch from 71e98d4 to fd0ce61 Compare May 9, 2026 07:33
@NathanFlurry NathanFlurry force-pushed the rivetkit-channels/use-unbounded-channels branch from 6e295e4 to 3388aec Compare May 9, 2026 07:33
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5004 May 9, 2026 07:33 Destroyed
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 9, 2026

Code Review: refactor(metrics): split server metrics crate

Overview

This PR splits rivet-metrics into two crates:

  • rivet-metrics: lightweight, prometheus registry plus bucket definitions plus re-exports.
  • rivet-metrics-server: heavyweight, HTTP server plus OTel providers plus tracing setup.

The actor metrics subsystem is redesigned from per-actor Registry instances to a global METRICS static with label-based separation, plus a retention window for short-lived actor label cleanup.

Positives

  • Clean separation lets rivetkit-core pull in prometheus without dragging in hyper, OTel, etc.
  • eprintln to tracing::error in OtelProviderGuard::drop is the correct fix per convention.
  • New actor_active gauge gives scrape-time visibility into active-vs-retained actors.
  • Metrics now prefixed with actor_ for clarity on the Prometheus side.
  • Retention/cleanup design via RETAINED_ACTORS is sound: avoids immediate label deletion on drop.

Concerns

  1. Action metrics appear to be missing (likely regression)

The old per-actor registry included action_call_total, action_error_total, and action_duration_seconds_sum. These are NOT in the new ActorMetricCollectors struct, and the tests exercising them (dispatch_records_prometheus_action_metrics, dispatch_records_action_error_metrics) were deleted without replacement. If those counters no longer exist, action-level observability is gone. Please confirm this is intentional or add the IntCounterVec/HistogramVec entries.

  1. High-cardinality label risk

envoy_key is now a label dimension on every metric (around 40 metrics). If envoy_key encodes per-actor-instance information such as the actor ID or socket address, this is a cardinality explosion: each unique tuple (actor_id_gen, actor_key, envoy_key) produces a distinct time series for every metric. Please document what envoy_key contains and verify it has bounded cardinality in production.

  1. cleanup_expired_actor_metrics on every retain/release

retain_sync does a full iteration of RETAINED_ACTORS on every actor creation and destruction. With thousands of short-lived actors this becomes O(actors) work on the hot actor lifecycle path. Consider amortizing cleanup via a background task or a size threshold gate.

  1. record_retained_actor_metrics on every observation

begin_user_task, end_user_task, observe_shutdown_wait, inc_state_mutation, and all SQLite commit hooks each call record_retained_actor_metrics, which acquires an entry_sync on RETAINED_ACTORS. This adds a HashMap lookup and write to every metric observation in already-busy actor paths. Consider pre-populating all known label combinations at new() instead of tracking lazily.

  1. is_missing_labels_error fragile string matching

This function detects the error case by inspecting the prometheus Msg string, which is not a stable public API. A version bump could silently break the filter, causing harmless cleanup errors to log at debug instead of being suppressed. Add a comment pinning the prometheus version these strings were verified against, or open an upstream issue for a typed variant.

  1. Missing retention window expiry test

actor_active_metric_is_retained_after_drop verifies that metrics survive the drop, but there is no test that they are eventually removed after ACTOR_METRIC_RETENTION (10 min). Consider a test using a configurable constant or Instant override to verify that cleanup_expired_actor_metrics actually removes labels.

  1. active_refs window between lock release and gauge update

In release_actor_metrics, actor_active is set to 0 outside the get_sync block, after inactive is latched. If another thread calls retain_actor_metrics between the lock release and the set(0), the gauge shows 0 for one scrape even though a new reference was just added. Moving the set(0) inside the get_sync block closes this window.

Minor

  • actor_create_state_ms and actor_create_vars_ms are inconsistent with the SI convention used elsewhere. Consider renaming to _milliseconds or documenting the unit in the description string.
  • Removed noisy comments are a correct cleanup.
  • Log message updated from using no-op collector to using existing collector is more accurate. Good fix.

Base automatically changed from rivetkit-channels/use-unbounded-channels to main May 9, 2026 07:50
@NathanFlurry NathanFlurry merged commit fd0ce61 into main May 9, 2026
11 of 18 checks passed
@NathanFlurry NathanFlurry deleted the metrics/split-server-crate branch May 9, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant