GREP-498: Controller Reconciliation Prometheus Metrics by brluobt · Pull Request #499 · ai-dynamo/grove

brluobt · 2026-03-24T04:05:53Z

Summary

Add Prometheus metrics to all three Grove controllers (PodCliqueSet, PodClique, PodCliqueScalingGroup).

Grove currently has zero custom Prometheus metrics. This GREP proposes a metrics framework providing:

Reconcile duration, count, error rate, and in-flight tracking via ObservedReconciler wrapper
Sub-operation timing within each reconcile loop via MetricsContext
409 Conflict tracking with target-resource attribution for cross-controller contention diagnosis

Tracking issue: Fixes #498

Motivation

During large-scale inference graph deployments (e.g., via NVIDIA Dynamo), reconciliation performance degrades significantly due to high event rates and cross-controller resource contention on PCS/PC objects. Without metrics, diagnosing these issues requires ad-hoc log analysis.

Key Design Decisions

ObservedReconciler wrapper: Generic wrapper for all 3 controllers, minimal code change
MetricsContext via context.Context: Nil-safe, zero overhead when not present, no interface changes
Conflict tracking: grove_operator_status_update_conflict_total with target_kind label for precise contention attribution
No new dependencies: Uses existing prometheus/client_golang via controller-runtime
No logic changes: Purely additive instrumentation

Test Plan

Unit tests for ObservedReconciler, MetricsContext, and error categorization
Integration test verifying metrics at /metrics endpoint
E2E test at scale verifying sub-operation and conflict metrics

copy-pr-bot · 2026-03-24T04:05:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Ronkahn21 · 2026-03-30T08:39:05Z

Metrics Review: Duplicates of Existing controller-runtime Metrics

I double-checked the proposed metrics against what controller-runtime already exposes out of the box. 4 of the 5 reconcile-level metrics proposed here are duplicates of existing built-in metrics.

Proposed metrics that already exist

Proposed Metric	Already Exists As
`grove_operator_reconcile_duration_seconds`	`controller_runtime_reconcile_time_seconds`
`grove_operator_reconcile_total`	`controller_runtime_reconcile_total`
`grove_operator_reconcile_inflight`	`controller_runtime_active_workers`
`grove_operator_reconcile_requeue_total`	`controller_runtime_reconcile_total{result="requeue"\|"requeue_after"}`

Evidence from live metrics scrape

# Reconcile counts with result breakdown — already exists
controller_runtime_reconcile_total{controller="podcliqueset-controller",result="error"} 9
controller_runtime_reconcile_total{controller="podcliqueset-controller",result="requeue_after"} 13
controller_runtime_reconcile_total{controller="podcliqueset-controller",result="success"} 19

# Reconcile duration histogram — already exists
controller_runtime_reconcile_time_seconds_sum{controller="podcliqueset-controller"} 411.79
controller_runtime_reconcile_time_seconds_count{controller="podcliqueset-controller"} 41

# In-flight workers — already exists
controller_runtime_active_workers{controller="podcliqueset-controller"} 0
controller_runtime_max_concurrent_reconciles{controller="podcliqueset-controller"} 20

# Workqueue metrics — already exist
workqueue_depth{controller="podcliqueset-controller"} 0
workqueue_adds_total{controller="podcliqueset-controller"} 41
workqueue_retries_total{controller="podcliqueset-controller"} 22
workqueue_queue_duration_seconds_sum{controller="podcliqueset-controller"} 364.76
workqueue_work_duration_seconds_sum{controller="podcliqueset-controller"} 411.79

Could you update the proposal to remove the 4 duplicates above and also drop the namespace label from the remaining metrics? controller-runtime intentionally omits namespace to avoid cardinality issues.

The sub-operation duration, conflict tracking, and error categorization metrics are great additions — would be nice to see the proposal focused on those. Thanks!

brluobt · 2026-04-01T15:08:39Z

Thanks for the thorough review @Ronkahn21! Updated in the latest commit:

Removed the 4 duplicate reconcile-level metrics and the ObservedReconciler wrapper
Removed namespace label from all custom metrics
Focused the proposal on 3 genuinely new metrics: sub-operation timing, conflict attribution, and error categorization
Added a "Relationship to Built-in Metrics" section documenting existing controller_runtime_* metrics

shayasoolin · 2026-04-28T12:31:27Z

+### Non-Goals
+
+* This proposal does NOT duplicate any controller-runtime built-in metrics
+* This proposal does NOT add metrics to the Grove scheduler component (only the operator)


There's no such thing as "Grove scheduler component". Grove supports different scheduler backends, none of them is part of Grove itself.

Good catch — fixed. The Motivation/Non-Goals now read:

"This proposal does NOT add metrics to Grove's scheduler backends (these are external components Grove supports, not part of Grove itself); it covers only the Grove operator controllers."

(commit 27c59ff, L83 in the revised README)

shayasoolin · 2026-04-28T13:17:51Z

+    if stepResult := r.ensureFinalizer(ctx, logger, pclq); ctrlcommon.ShortCircuitReconcileFlow(stepResult) {
+        return r.recordIncompleteReconcile(ctx, logger, pclq, &stepResult)
+    }
+    mc.RecordSubOperation("ensure_finalizer", start)


The interface looks good, we'll need to see how it lives together with the existing time measurements for scale-test milestone logging. Worth reviewing the existing code again (it might have changes since this PR was first drafted) and revisit the details.

Good question — added a coexistence check to the Test Plan (commit 27c59ff):

"Verify these metrics do not duplicate or interfere with the scale-test milestone instrumentation in operator/e2e/measurement/. The new metrics expose aggregate per-controller signals via the /metrics endpoint; the measurement package continues to own scale-test milestone logging. Different pipelines, no expected interference."

The two paths address different questions and emit through different sinks:

This proposal → fleet-level aggregate metrics on the /metrics endpoint (Prometheus scrape), answering "across all instances and time, how often does X happen?"

operator/e2e/measurement/ → scale-test milestone logging for E2E test assertions, answering "did this specific E2E run hit milestone Y within timeout Z?"

The implementation PR will include an integration test asserting both are usable from the same operator binary without collision.

shayasoolin · 2026-04-28T13:23:22Z

+
+Grove Operator
+  PCS Controller → creates PC/PCSG, updates PCS status
+  PC Controller  → creates Pods, updates PC status (replicas, readyReplicas)


Please use PCLQ instead of PC, across the whole document.

Done — PC → PCLQ across the whole document, with PCS/PCLQ/PCSG formally defined in the new Appendix B terminology table (commit 27c59ff).

sanjaychatterjee · 2026-04-29T05:47:26Z

@Ronkahn21 to review this.

Ronkahn21

The proposal uses metrics to answer questions that belong to tracing and profiling. "Which sub-operation in this reconcile is slow?" is a tracing question — per-request span data answers it, not an aggregate histogram. The proposed function-name labels (ensure_finalizer, sync_pods, update_observed_generation) bind the metric surface to today's code; refactor any reconciler and the labels break. Metrics should do what metrics do well: aggregate, fleet-level signals that survive refactors and back dashboards and alerts. The needs the GREP describes are real, but they have better answers. Grove's scale-test infrastructure already downloads pprof data after each run, which covers sub-operation latency today. Production tracing can be its own proposal if pprof proves insufficient.

A GREP should describe user-visible changes, not implementation. The current doc embeds about 130 lines of Go — full source for metrics.go and metrics_context.go, per-call-site examples, import lists. None of that belongs in a design doc. Describe what the framework exposes (metric names, labels, recording verbs) and why; leave the wiring for the implementation PR.

Points to address

Drop the embedded Go. Replace it with an API surface: metric names, labels, and recording verbs.
Drop sub-operation timing, or collapse it to phase=spec|status if there's evidence controller_runtime_reconcile_time_seconds falls short at fleet aggregate. Grove's scale-test pprof already covers per-reconcile latency.
Justify each remaining metric against the closest existing built-in. For example, what does grove_operator_status_update_conflict_total{target_kind=...} add over rest_client_requests_total{code="409", url=...}? The burden is on the new metric.
Rewrite user stories at fleet aggregate. SRE-on-Grafana scenarios: P95 latency, conflict-rate alert, persistent-error alert. Add a capacity-shaped story (active workers, queue depth) that's answered entirely by built-ins.
List tracing and profiling as explicit non-goals. Note that scale-test pprof covers per-reconcile diagnosis today; tracing can be its own GREP if production needs exceed that.
Drop the "Grove scheduler component" non-goal (per @shayasoolin). Grove has no scheduler component — backends are pluggable and own their own metrics.

brluobt · 2026-05-09T15:07:26Z

Thanks @Ronkahn21 — after thinking it through, I agree with the framing. Sub-operation timing is a tracing concern, not a metrics concern, and function-name labels coupling the metrics surface to today's call graph is exactly the trap you're pointing at.

Plan for the next revision:

Drop the sub-operation timing histogram and MetricsContext from this proposal. Per-reconcile sub-operation visibility moves to the tracing side (OpenTelemetry spans) — either as a follow-up GREP or as a note in Future Work here, depending on what the maintainers prefer.
Keep ObservedReconciler (aggregate reconcile duration / count / error rate / in-flight per controller — no function-name labels).
Keep the 409 conflict tracking metric (grove_operator_status_update_conflict_total with target_kind) — this is the fleet-level signal the original motivation is really about.

@shayasoolin — I owe you responses on the three inline comments (L71 / L329 / L558). I'll address those in the same revision pass, along with rebasing off main.

Targeting early next week for the revised proposal. Thanks both for the careful review.

Add a Grove Enhancement Proposal for introducing Prometheus metrics to all three Grove controllers (PodCliqueSet, PodClique, PodCliqueScalingGroup). Grove currently has zero custom metrics, providing no visibility into reconcile duration, sub-operation latency, error categorization, or cross-controller conflict rates. Motivated by observed performance degradation during large-scale Pod creation when Grove is used with upstream operators (e.g., NVIDIA Dynamo) that concurrently operate on PCS/PC objects. Tracking issue: ai-dynamo#498 Signed-off-by: brluo <brluobt@gmail.com> Co-authored-by: kangclzjc <kangz@nvidia.com> Signed-off-by: kangclzjc <kangz@nvidia.com>

- Remove 4 duplicate metrics already provided by controller-runtime (reconcile_duration, reconcile_total, reconcile_inflight, reconcile_requeue) - Remove namespace label from all custom metrics to avoid cardinality issues - Remove ObservedReconciler wrapper (no longer needed) - Focus proposal on 3 genuinely new metrics: 1. reconcile_sub_operation_duration_seconds (sub-operation timing) 2. status_update_conflict_total (409 conflict attribution) 3. reconcile_errors_total (error categorization by K8s error type) - Add 'Relationship to Built-in Metrics' section documenting existing controller_runtime_* and workqueue_* metrics - Update PromQL queries to reference built-in metrics by their actual names - Regenerate TOC via mdtoc Signed-off-by: brluo <brluobt@gmail.com> Co-authored-by: kangclzjc <kangz@nvidia.com> Signed-off-by: kangclzjc <kangz@nvidia.com>

@Ronkahn21

Revise GREP-498 per review feedback from @Ronkahn21 and @shayasoolin: Methodology corrections (addressing the core architectural feedback): - Drop sub_operation_duration_seconds histogram: per-request, sub-operation visibility is a distributed-tracing concern, not a Prometheus-metrics concern. Prometheus aggregates across instances and loses per-request context; labeling a histogram by internal Go function names also couples the metric surface to code structure and breaks silently on refactor. - Reframe Alt 2 (OpenTelemetry tracing) from rejected to scoped out; track per-request span instrumentation as Future Work. - Remove all embedded Go source (MetricsContext type, helper functions, registration snippets, import blocks, File Changes Summary). The proposal now describes the metric surface (names, labels, semantics, cardinality bound) only; helper types and call sites belong to the implementation PR. Scope reductions: - Retain two custom metrics: * grove_operator_status_update_conflict_total (conflict attribution by target resource kind) * grove_operator_reconcile_errors_total (error categorization by k8serrors.Is* predicates) - Add explicit Cardinality Bound section (<= 33 new series). - Add Recording Verbs (Contract Only) section. - Add Future Work section for the deferred tracing follow-up. Detail fixes from inline comments: - Motivation: scheduler-backend wording corrected - Grove supports these backends, they are not part of Grove itself. - Test Plan: add coexistence check against operator/e2e/measurement/ scale-test milestone instrumentation. - Appendix A: normalize terminology - PodClique -> PCLQ across document (with PCS/PCLQ/PCSG defined in Appendix B glossary). Proposal shrinks from 591 lines to 377 lines (-36%). No implementation code is introduced by this revision; the follow-up implementation PR(s) will land after GREP approval. Refs: ai-dynamo#499 Signed-off-by: Bruce Luo <brluo@nvidia.com>

brluobt · 2026-05-09T16:48:25Z

GREP-498 v2 — revision summary

Thanks @Ronkahn21 and @shayasoolin for the thorough review. The proposal has been substantially reworked; pushed in commit 27c59ff.

Core architectural change

The biggest delta in this revision is a methodology correction suggested by @Ronkahn21's feedback:

Metrics are for aggregate, fleet-level questions. Per-request "which step inside this reconcile is slow?" questions belong to tracing.

This was the right framing, and the original proposal got it wrong by trying to use a sub-operation histogram (sub_operation_duration_seconds{operation="ensure_finalizer", ...}) to answer per-request questions. Aggregate histograms cannot reconstruct a single reconcile's internal timeline, and labeling the metric by internal Go function names couples the metric surface to the code structure and breaks silently on refactor.

Change in v2:

Dropped sub_operation_duration_seconds entirely.
Reframed Alt 2 (OpenTelemetry tracing) from rejected to scoped out, with per-request span instrumentation tracked under a new Future Work section.
The proposal is now strictly scoped to aggregate, fleet-level signals.

Metric surface in v2

Two custom metrics remain, both of which answer genuine gaps that controller-runtime built-ins cannot:

Metric	Answers
`grove_operator_status_update_conflict_total{controller, target_kind}`	"Which resource type is the 409 contention hotspot?"
`grove_operator_reconcile_errors_total{controller, error_type}`	"What fraction of errors are transient vs persistent?"

Cardinality bound: ≤ 33 new series total (3 controllers × 3 target_kinds + 3 controllers × 8 error_types). No namespace, resource-name, function-name, or file-path labels.

Other review items addressed

Embedded Go source removed. The proposal now describes the metric surface only (names, labels, semantics, cardinality, recording verbs contract). Helper types, registration code, and call sites are implementation details of the follow-up PR, not this proposal. (Addresses the "this looks more like an implementation PR than a design doc" signal.)
Relationship to built-in metrics made explicit. New Relationship to Built-in Metrics section + expanded Built-in controller-runtime Metrics Reference table. The proposal explicitly does not duplicate controller_runtime_reconcile_*, workqueue_*, or rest_client_*.
Scheduler-backend wording fixed (@shayasoolin L71): Grove supports scheduler backends; they are not part of Grove. Non-Goals now reflects this.
Coexistence with operator/e2e/measurement/ (@shayasoolin L329): added an explicit Test Plan coexistence check. The two paths serve different questions through different sinks (Prometheus /metrics scrape vs. E2E milestone logging).
Terminology normalized (@shayasoolin L558): PC → PCLQ throughout, with PCS/PCLQ/PCSG defined in new Appendix B.

Stats

591 → 377 lines (-36%).
5 Go code blocks (181 lines) → 0.
MetricsContext, RecordSubOperation, sub_operation_duration_seconds — all removed.
make update-toc + make verify-toc pass locally.

Requesting re-review. Happy to iterate further if the aggregate-only scoping still misses something.

brluobt requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners April 1, 2026 15:02

shayasoolin reviewed Apr 28, 2026

View reviewed changes

Ronkahn21 requested changes May 3, 2026

View reviewed changes

kangclzjc and others added 3 commits May 9, 2026 16:46

brluobt force-pushed the proposal/controller-reconcile-metrics branch from 8dceb53 to 27c59ff Compare May 9, 2026 16:46

brluobt requested a review from danbar2 as a code owner May 9, 2026 16:46

brluobt requested review from Ronkahn21 and shayasoolin May 9, 2026 16:48

Conversation

brluobt commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Key Design Decisions

Test Plan

Uh oh!

copy-pr-bot Bot commented Mar 24, 2026

Uh oh!

Ronkahn21 commented Mar 30, 2026

Metrics Review: Duplicates of Existing controller-runtime Metrics

Proposed metrics that already exist

Evidence from live metrics scrape

Uh oh!

brluobt commented Apr 1, 2026

Uh oh!

shayasoolin Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

brluobt May 9, 2026

Choose a reason for hiding this comment

Uh oh!

shayasoolin Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

brluobt May 9, 2026

Choose a reason for hiding this comment

Uh oh!

shayasoolin Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

brluobt May 9, 2026

Choose a reason for hiding this comment

Uh oh!

sanjaychatterjee commented Apr 29, 2026

Uh oh!

Ronkahn21 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Points to address

Uh oh!

brluobt commented May 9, 2026

Uh oh!

brluobt commented May 9, 2026

GREP-498 v2 — revision summary

Core architectural change

Metric surface in v2

Other review items addressed

Stats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

brluobt commented Mar 24, 2026 •

edited

Loading

Ronkahn21 left a comment •

edited

Loading