refactor(metrics): split server metrics crate#5004
Conversation
|
🚅 Deployed to the rivet-pr-5004 environment in rivet-frontend
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
30f3c06 to
71e98d4
Compare
71e98d4 to
a2c556b
Compare
a2c556b to
f63add8
Compare
f63add8 to
71e98d4
Compare
71e98d4 to
fd0ce61
Compare
6e295e4 to
3388aec
Compare
|
Code Review: refactor(metrics): split server metrics crate Overview This PR splits rivet-metrics into two crates:
The actor metrics subsystem is redesigned from per-actor Registry instances to a global METRICS static with label-based separation, plus a retention window for short-lived actor label cleanup. Positives
Concerns
The old per-actor registry included action_call_total, action_error_total, and action_duration_seconds_sum. These are NOT in the new ActorMetricCollectors struct, and the tests exercising them (dispatch_records_prometheus_action_metrics, dispatch_records_action_error_metrics) were deleted without replacement. If those counters no longer exist, action-level observability is gone. Please confirm this is intentional or add the IntCounterVec/HistogramVec entries.
envoy_key is now a label dimension on every metric (around 40 metrics). If envoy_key encodes per-actor-instance information such as the actor ID or socket address, this is a cardinality explosion: each unique tuple (actor_id_gen, actor_key, envoy_key) produces a distinct time series for every metric. Please document what envoy_key contains and verify it has bounded cardinality in production.
retain_sync does a full iteration of RETAINED_ACTORS on every actor creation and destruction. With thousands of short-lived actors this becomes O(actors) work on the hot actor lifecycle path. Consider amortizing cleanup via a background task or a size threshold gate.
begin_user_task, end_user_task, observe_shutdown_wait, inc_state_mutation, and all SQLite commit hooks each call record_retained_actor_metrics, which acquires an entry_sync on RETAINED_ACTORS. This adds a HashMap lookup and write to every metric observation in already-busy actor paths. Consider pre-populating all known label combinations at new() instead of tracking lazily.
This function detects the error case by inspecting the prometheus Msg string, which is not a stable public API. A version bump could silently break the filter, causing harmless cleanup errors to log at debug instead of being suppressed. Add a comment pinning the prometheus version these strings were verified against, or open an upstream issue for a typed variant.
actor_active_metric_is_retained_after_drop verifies that metrics survive the drop, but there is no test that they are eventually removed after ACTOR_METRIC_RETENTION (10 min). Consider a test using a configurable constant or Instant override to verify that cleanup_expired_actor_metrics actually removes labels.
In release_actor_metrics, actor_active is set to 0 outside the get_sync block, after inactive is latched. If another thread calls retain_actor_metrics between the lock release and the set(0), the gauge shows 0 for one scrape even though a new reference was just added. Moving the set(0) inside the get_sync block closes this window. Minor
|

Description
Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Checklist: