Skip to content

feat(rivetkit): add current actor metrics#5020

Draft
NathanFlurry wants to merge 1 commit into
counter-latency/rivetkit-metrics-routesfrom
counter-latency/current-actor-metrics
Draft

feat(rivetkit): add current actor metrics#5020
NathanFlurry wants to merge 1 commit into
counter-latency/rivetkit-metrics-routesfrom
counter-latency/current-actor-metrics

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@NathanFlurry NathanFlurry force-pushed the counter-latency/current-actor-metrics branch from c6e4baf to 93e7a3b Compare May 11, 2026 13:46
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

Review: feat(rivetkit): add current actor metrics

This is a draft PR -- feedback is intended to catch issues before finalization.

Overview

This PR graduates several no-op actor metric methods into real implementations, adding gauges and counters for queue depth, active connections, inbox depths, user task counts, HTTP requests, keep-awake counts, shutdown drain counts, and SQLite worker state. The core mechanism is an aggregated gauge pattern: each ActorMetrics instance tracks its current contribution in a per-instance Mutex<ActorMetricState>, computes a signed delta, and applies it to the process-global IntGaugeVec. On drop, clear_aggregated_gauges zeroes every contribution, preventing stale label series from accumulating.

The test in metrics.rs covers the multi-instance aggregation and the drop-cleanup path. The kitchen-sink example cleanly removes the now-deleted registry.diagnostics call.


Issues

1. user_tasks_active gauge can go negative (bug)

In end_user_task, the decrement uses i64::saturating_sub(1), which produces -1 when the current stored value is 0:

let next = (*current).saturating_sub(1);  // 0i64.saturating_sub(1) == -1

If end_user_task is ever called without a prior begin_user_task for that kind (e.g. from a panicking task), the gauge goes negative. Fix:

let next = (*current).saturating_sub(1).max(0);

The same defensiveness applies to any gauge that decrements.

2. New strong-reference cycles in change callbacks

AsyncCounter::change_callbacks stores Arc<dyn Fn()> (strong references). The three new callbacks registered in configure_sleep_hooks each capture a cloned ActorContext (Arc<ActorContextInner>), and the counters live inside ActorContextInner.sleep.work. This creates a cycle:

Arc<ActorContextInner>
  -> sleep.work.keep_awake (AsyncCounter)
    -> change_callbacks: Vec<Arc<closure>>
      -> keep_awake_ctx: Arc<ActorContextInner>  <- back to start

The http_request_counter callback already does this (pre-existing), but the PR adds three more cycles. Worth confirming there is a teardown path that clears the callbacks before the ActorContextInner refcount reaches zero, otherwise actors will leak.

3. set_aggregated_gauge delta arithmetic

let delta = next.saturating_sub(*current);

For i64 in realistic metric ranges this equals plain subtraction, but saturating_sub on a signed type saturates at i64::MIN (not zero), which is surprising for a delta computation. Plain next - *current is more readable and correct for all in-range values.


Minor observations

  • WORK_LABELS is defined as a module-level constant but used only for keep_awake_active. Either inline it or leave it as-is.
  • set_worker_active maps bool to 1/0: aggregates correctly across actors (sum = active worker count per actor name), but a short comment near the metric definition would clarify the intent.
  • usize_to_i64 saturation at i64::MAX: a debug_assert that the value fits would catch regressions in tests without production overhead.
  • The examples/kitchen-sink removal is clean and correct.

What is good

  • clear_aggregated_gauges on drop correctly prevents the Prometheus registry from accumulating stale label-value series for stopped actors.
  • The test exercises concurrent instances, intermediate updates, and full drop cleanup -- good coverage for the aggregation invariant.
  • Parking lot mutex usage is justified: guards are never held across .await points, and Drop (sync context) requires a sync lock.
  • set_inbox_depth string dispatch with _ => unreachable!() is acceptable per project conventions (open string space, not an enum).

Summary

The aggregated-gauge design is sound and the test coverage is solid. The two actionable issues before merge are: (1) the potential negative gauge from end_user_task decrement, and (2) confirming the new change-callback cycles have a teardown path. The delta arithmetic nit is low-priority but worth a one-line fix for clarity.

@NathanFlurry NathanFlurry force-pushed the counter-latency/current-actor-metrics branch from 93e7a3b to 6d50b18 Compare May 11, 2026 13:54
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

PR Review: feat(rivetkit): add current actor metrics

Overview

This PR adds real-time gauge metrics for per-actor state (queue depth, active connections, inbox depths, user task counts, HTTP requests, keep-awake work, SQLite worker activity). The core design uses a delta-based aggregation pattern via set_aggregated_gauge, which correctly sums contributions from multiple actor instances sharing the same actor_name label in a single Prometheus gauge. Cleanup on Drop via clear_aggregated_gauges prevents gauge accumulation across actor lifetimes.


Correctness

begin_user_task / end_user_task balance

begin_user_task was previously a no-op; it is now meaningful. Before this PR, callers could call end_user_task without a matching begin_user_task without consequence. Now, an unmatched end_user_task will call saturating_sub(1) on a count of 0 — the saturating prevents a negative gauge, but the underlying state in ActorMetricState::user_tasks_active will end up at 0 instead of -1, silently hiding the imbalance. It would be worth auditing all call sites to confirm the calls are balanced.

set_worker_active as a boolean gauge

sqlite_workers_active is set to 0/1 based on a bool. The set_aggregated_gauge delta approach works correctly here (calling set_worker_active(true) twice in a row is idempotent because delta == 0). But the metric name implies a count, so it could be misleading if future code allows multiple workers per actor. Consider a doc comment or gauge description clarifying the 0/1 semantics.

Gauge-vs-counter ordering in Drop

clear_aggregated_gauges (which zeros all state gauges) is called before actor_active_count.dec() and actor_stopped_total.inc(). This means there is a brief window where Prometheus sees all gauges zeroed but the actor still counted as active. This is a minor observability race and not a correctness bug, but worth noting.


Code Quality

Variable naming in configure_sleep_hooks (context.rs)

The function now has both internal_keep_awake_metric_ctx and internal_keep_awake_ctx in scope. The names are technically distinct, but a reader scanning this function will need to keep both in mind. Consider naming the metrics one keep_awake_metrics_ctx or similar to clearly differentiate purpose at first glance.

String-keyed match in set_inbox_depth

fn set_inbox_depth(&self, inbox: &'static str, depth: usize) {
    let current = match inbox {
        "lifecycle" => &mut state.lifecycle_inbox_depth,
        "dispatch" => &mut state.dispatch_inbox_depth,
        "lifecycle_event" => &mut state.lifecycle_event_inbox_depth,
        _ => unreachable!("unknown inbox metric label"),
    };

This is internal-only and the unreachable! guard is acceptable per project conventions. However, since the call sites are all in this same file, three concrete methods mirroring how set_keep_awake_active / set_internal_keep_awake_active are separate would eliminate the match entirely and be more idiomatic. Not a blocker.

keep_awake_active uses WORK_LABELS (two-label gauge), shutdown_tasks_active uses ACTOR_LABELS (one-label gauge)

This asymmetry means querying "total keep-awake work" for an actor requires summing over both kind values. The split is intentional (keep_awake vs. internal_keep_awake are distinct concepts), but it should be documented somewhere so dashboard authors know to sum.


Performance

The set_aggregated_gauge helper acquires a parking_lot::Mutex on every metric update, even for no-ops (delta == 0 skips the gauge write but still locks). For high-throughput actors updating connection counts or inbox depths on every message, this adds a sync lock acquisition per event. Since parking_lot uncontended fast path is ~5 ns, this is unlikely to be a bottleneck, but worth being aware of.


Tests

  • The new actor_current_gauges_aggregate_by_actor_name test covers the core invariant (two actors with the same name, gauges aggregate; drop one, value adjusts; drop both, value zeroes). This is exactly the right test to have.
  • Tests share a global METRICS registry via LazyLock. The unique actor name "counter-gauge-aggregate" avoids collision with other tests, but running the test suite in parallel (Rust default) could have other tests accidentally share actor names if naming is not kept unique. This is pre-existing, not introduced here.
  • No test for begin_user_task / end_user_task imbalance behavior (see correctness note above).

Minor Notes

  • u64_to_i64 is gated #[cfg(feature = "sqlite-local")] since it is only used there — good.
  • actor_started_total and actor_stopped_total are clean additions; the start/stop counter pair makes it easy to derive actor churn rate in dashboards.
  • Removal of the registry.diagnostics() duck-type check from kitchen-sink/server.ts is a good cleanup, replaced by the properly typed registry.routes.* endpoints.

Summary

The aggregation design is sound and the Drop-based cleanup is the right approach. The main thing to verify before merging is that all existing end_user_task call sites have a corresponding begin_user_task call — if there are any paths that only call end_user_task (e.g., fast-fail exits that skip begin), the user task gauge will be silently incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant