feat(rivetkit): add current actor metrics by NathanFlurry · Pull Request #5020 · rivet-dev/rivet

NathanFlurry · 2026-05-11T13:45:35Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

NathanFlurry · 2026-05-11T13:46:01Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-05-11T13:54:11Z

Review: feat(rivetkit): add current actor metrics

This is a draft PR -- feedback is intended to catch issues before finalization.

Overview

This PR graduates several no-op actor metric methods into real implementations, adding gauges and counters for queue depth, active connections, inbox depths, user task counts, HTTP requests, keep-awake counts, shutdown drain counts, and SQLite worker state. The core mechanism is an aggregated gauge pattern: each ActorMetrics instance tracks its current contribution in a per-instance Mutex<ActorMetricState>, computes a signed delta, and applies it to the process-global IntGaugeVec. On drop, clear_aggregated_gauges zeroes every contribution, preventing stale label series from accumulating.

The test in metrics.rs covers the multi-instance aggregation and the drop-cleanup path. The kitchen-sink example cleanly removes the now-deleted registry.diagnostics call.

Issues

1. `user_tasks_active` gauge can go negative (bug)

In end_user_task, the decrement uses i64::saturating_sub(1), which produces -1 when the current stored value is 0:

let next = (*current).saturating_sub(1);  // 0i64.saturating_sub(1) == -1

If end_user_task is ever called without a prior begin_user_task for that kind (e.g. from a panicking task), the gauge goes negative. Fix:

let next = (*current).saturating_sub(1).max(0);

The same defensiveness applies to any gauge that decrements.

2. New strong-reference cycles in change callbacks

AsyncCounter::change_callbacks stores Arc<dyn Fn()> (strong references). The three new callbacks registered in configure_sleep_hooks each capture a cloned ActorContext (Arc<ActorContextInner>), and the counters live inside ActorContextInner.sleep.work. This creates a cycle:

Arc<ActorContextInner>
  -> sleep.work.keep_awake (AsyncCounter)
    -> change_callbacks: Vec<Arc<closure>>
      -> keep_awake_ctx: Arc<ActorContextInner>  <- back to start

The http_request_counter callback already does this (pre-existing), but the PR adds three more cycles. Worth confirming there is a teardown path that clears the callbacks before the ActorContextInner refcount reaches zero, otherwise actors will leak.

3. `set_aggregated_gauge` delta arithmetic

let delta = next.saturating_sub(*current);

For i64 in realistic metric ranges this equals plain subtraction, but saturating_sub on a signed type saturates at i64::MIN (not zero), which is surprising for a delta computation. Plain next - *current is more readable and correct for all in-range values.

Minor observations

WORK_LABELS is defined as a module-level constant but used only for keep_awake_active. Either inline it or leave it as-is.
set_worker_active maps bool to 1/0: aggregates correctly across actors (sum = active worker count per actor name), but a short comment near the metric definition would clarify the intent.
usize_to_i64 saturation at i64::MAX: a debug_assert that the value fits would catch regressions in tests without production overhead.
The examples/kitchen-sink removal is clean and correct.

What is good

clear_aggregated_gauges on drop correctly prevents the Prometheus registry from accumulating stale label-value series for stopped actors.
The test exercises concurrent instances, intermediate updates, and full drop cleanup -- good coverage for the aggregation invariant.
Parking lot mutex usage is justified: guards are never held across .await points, and Drop (sync context) requires a sync lock.
set_inbox_depth string dispatch with _ => unreachable!() is acceptable per project conventions (open string space, not an enum).

Summary

The aggregated-gauge design is sound and the test coverage is solid. The two actionable issues before merge are: (1) the potential negative gauge from end_user_task decrement, and (2) confirming the new change-callback cycles have a teardown path. The delta arithmetic nit is low-priority but worth a one-line fix for clarity.

claude · 2026-05-11T13:54:57Z

PR Review: feat(rivetkit): add current actor metrics

Overview

This PR adds real-time gauge metrics for per-actor state (queue depth, active connections, inbox depths, user task counts, HTTP requests, keep-awake work, SQLite worker activity). The core design uses a delta-based aggregation pattern via set_aggregated_gauge, which correctly sums contributions from multiple actor instances sharing the same actor_name label in a single Prometheus gauge. Cleanup on Drop via clear_aggregated_gauges prevents gauge accumulation across actor lifetimes.

Correctness

begin_user_task / end_user_task balance

begin_user_task was previously a no-op; it is now meaningful. Before this PR, callers could call end_user_task without a matching begin_user_task without consequence. Now, an unmatched end_user_task will call saturating_sub(1) on a count of 0 — the saturating prevents a negative gauge, but the underlying state in ActorMetricState::user_tasks_active will end up at 0 instead of -1, silently hiding the imbalance. It would be worth auditing all call sites to confirm the calls are balanced.

set_worker_active as a boolean gauge

sqlite_workers_active is set to 0/1 based on a bool. The set_aggregated_gauge delta approach works correctly here (calling set_worker_active(true) twice in a row is idempotent because delta == 0). But the metric name implies a count, so it could be misleading if future code allows multiple workers per actor. Consider a doc comment or gauge description clarifying the 0/1 semantics.

Gauge-vs-counter ordering in Drop

clear_aggregated_gauges (which zeros all state gauges) is called before actor_active_count.dec() and actor_stopped_total.inc(). This means there is a brief window where Prometheus sees all gauges zeroed but the actor still counted as active. This is a minor observability race and not a correctness bug, but worth noting.

Code Quality

Variable naming in configure_sleep_hooks (context.rs)

The function now has both internal_keep_awake_metric_ctx and internal_keep_awake_ctx in scope. The names are technically distinct, but a reader scanning this function will need to keep both in mind. Consider naming the metrics one keep_awake_metrics_ctx or similar to clearly differentiate purpose at first glance.

String-keyed match in set_inbox_depth

fn set_inbox_depth(&self, inbox: &'static str, depth: usize) {
    let current = match inbox {
        "lifecycle" => &mut state.lifecycle_inbox_depth,
        "dispatch" => &mut state.dispatch_inbox_depth,
        "lifecycle_event" => &mut state.lifecycle_event_inbox_depth,
        _ => unreachable!("unknown inbox metric label"),
    };

This is internal-only and the unreachable! guard is acceptable per project conventions. However, since the call sites are all in this same file, three concrete methods mirroring how set_keep_awake_active / set_internal_keep_awake_active are separate would eliminate the match entirely and be more idiomatic. Not a blocker.

keep_awake_active uses WORK_LABELS (two-label gauge), shutdown_tasks_active uses ACTOR_LABELS (one-label gauge)

This asymmetry means querying "total keep-awake work" for an actor requires summing over both kind values. The split is intentional (keep_awake vs. internal_keep_awake are distinct concepts), but it should be documented somewhere so dashboard authors know to sum.

Performance

The set_aggregated_gauge helper acquires a parking_lot::Mutex on every metric update, even for no-ops (delta == 0 skips the gauge write but still locks). For high-throughput actors updating connection counts or inbox depths on every message, this adds a sync lock acquisition per event. Since parking_lot uncontended fast path is ~5 ns, this is unlikely to be a bottleneck, but worth being aware of.

Tests

The new actor_current_gauges_aggregate_by_actor_name test covers the core invariant (two actors with the same name, gauges aggregate; drop one, value adjusts; drop both, value zeroes). This is exactly the right test to have.
Tests share a global METRICS registry via LazyLock. The unique actor name "counter-gauge-aggregate" avoids collision with other tests, but running the test suite in parallel (Rust default) could have other tests accidentally share actor names if naming is not kept unique. This is pre-existing, not introduced here.
No test for begin_user_task / end_user_task imbalance behavior (see correctness note above).

Minor Notes

u64_to_i64 is gated #[cfg(feature = "sqlite-local")] since it is only used there — good.
actor_started_total and actor_stopped_total are clean additions; the start/stop counter pair makes it easy to derive actor churn rate in dashboards.
Removal of the registry.diagnostics() duck-type check from kitchen-sink/server.ts is a good cleanup, replaced by the properly typed registry.routes.* endpoints.

Summary

The aggregation design is sound and the Drop-based cleanup is the right approach. The main thing to verify before merging is that all existing end_user_task call sites have a corresponding begin_user_task call — if there are any paths that only call end_user_task (e.g., fast-fail exits that skip begin), the user task gauge will be silently incorrect.

NathanFlurry mentioned this pull request May 11, 2026

feat(rivetkit): expose low-cardinality metrics routes #5019

Draft

11 tasks

NathanFlurry force-pushed the counter-latency/current-actor-metrics branch from c6e4baf to 93e7a3b Compare May 11, 2026 13:46

feat(rivetkit): add current actor metrics

6d50b18

NathanFlurry force-pushed the counter-latency/current-actor-metrics branch from 93e7a3b to 6d50b18 Compare May 11, 2026 13:54

NathanFlurry mentioned this pull request May 11, 2026

fix(rivetkit): require engine ping for health #5022

Draft

11 tasks

abcxff mentioned this pull request May 11, 2026

temp: patch wasm-pack to point to latest binary #5037

Draft

11 tasks

MasterPtato mentioned this pull request May 12, 2026

fix(envoy-client): dont abort drain after 20s #5043

Draft

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rivetkit): add current actor metrics#5020

feat(rivetkit): add current actor metrics#5020
NathanFlurry wants to merge 1 commit into
counter-latency/rivetkit-metrics-routesfrom
counter-latency/current-actor-metrics

NathanFlurry commented May 11, 2026

Uh oh!

NathanFlurry commented May 11, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 11, 2026

Uh oh!

claude Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented May 11, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

NathanFlurry commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 11, 2026

Overview

Issues

1. user_tasks_active gauge can go negative (bug)

2. New strong-reference cycles in change callbacks

3. set_aggregated_gauge delta arithmetic

Minor observations

What is good

Summary

Uh oh!

claude Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(rivetkit): add current actor metrics

Overview

Correctness

Code Quality

Performance

Tests

Minor Notes

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NathanFlurry commented May 11, 2026 •

edited

Loading

1. `user_tasks_active` gauge can go negative (bug)

3. `set_aggregated_gauge` delta arithmetic

claude Bot commented May 11, 2026 •

edited

Loading