Skip to content

GREP-498: Controller Reconciliation Prometheus Metrics#499

Open
brluobt wants to merge 3 commits into
ai-dynamo:mainfrom
brluobt:proposal/controller-reconcile-metrics
Open

GREP-498: Controller Reconciliation Prometheus Metrics#499
brluobt wants to merge 3 commits into
ai-dynamo:mainfrom
brluobt:proposal/controller-reconcile-metrics

Conversation

@brluobt
Copy link
Copy Markdown

@brluobt brluobt commented Mar 24, 2026

Summary

Add Prometheus metrics to all three Grove controllers (PodCliqueSet, PodClique, PodCliqueScalingGroup).

Grove currently has zero custom Prometheus metrics. This GREP proposes a metrics framework providing:

  • Reconcile duration, count, error rate, and in-flight tracking via ObservedReconciler wrapper
  • Sub-operation timing within each reconcile loop via MetricsContext
  • 409 Conflict tracking with target-resource attribution for cross-controller contention diagnosis

Tracking issue: Fixes #498

Motivation

During large-scale inference graph deployments (e.g., via NVIDIA Dynamo), reconciliation performance degrades significantly due to high event rates and cross-controller resource contention on PCS/PC objects. Without metrics, diagnosing these issues requires ad-hoc log analysis.

Key Design Decisions

  • ObservedReconciler wrapper: Generic wrapper for all 3 controllers, minimal code change
  • MetricsContext via context.Context: Nil-safe, zero overhead when not present, no interface changes
  • Conflict tracking: grove_operator_status_update_conflict_total with target_kind label for precise contention attribution
  • No new dependencies: Uses existing prometheus/client_golang via controller-runtime
  • No logic changes: Purely additive instrumentation

Test Plan

  • Unit tests for ObservedReconciler, MetricsContext, and error categorization
  • Integration test verifying metrics at /metrics endpoint
  • E2E test at scale verifying sub-operation and conflict metrics

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Ronkahn21
Copy link
Copy Markdown
Contributor

Metrics Review: Duplicates of Existing controller-runtime Metrics

I double-checked the proposed metrics against what controller-runtime already exposes out of the box. 4 of the 5 reconcile-level metrics proposed here are duplicates of existing built-in metrics.

Proposed metrics that already exist

Proposed Metric Already Exists As
grove_operator_reconcile_duration_seconds controller_runtime_reconcile_time_seconds
grove_operator_reconcile_total controller_runtime_reconcile_total
grove_operator_reconcile_inflight controller_runtime_active_workers
grove_operator_reconcile_requeue_total controller_runtime_reconcile_total{result="requeue"|"requeue_after"}

Evidence from live metrics scrape

# Reconcile counts with result breakdown — already exists
controller_runtime_reconcile_total{controller="podcliqueset-controller",result="error"} 9
controller_runtime_reconcile_total{controller="podcliqueset-controller",result="requeue_after"} 13
controller_runtime_reconcile_total{controller="podcliqueset-controller",result="success"} 19

# Reconcile duration histogram — already exists
controller_runtime_reconcile_time_seconds_sum{controller="podcliqueset-controller"} 411.79
controller_runtime_reconcile_time_seconds_count{controller="podcliqueset-controller"} 41

# In-flight workers — already exists
controller_runtime_active_workers{controller="podcliqueset-controller"} 0
controller_runtime_max_concurrent_reconciles{controller="podcliqueset-controller"} 20

# Workqueue metrics — already exist
workqueue_depth{controller="podcliqueset-controller"} 0
workqueue_adds_total{controller="podcliqueset-controller"} 41
workqueue_retries_total{controller="podcliqueset-controller"} 22
workqueue_queue_duration_seconds_sum{controller="podcliqueset-controller"} 364.76
workqueue_work_duration_seconds_sum{controller="podcliqueset-controller"} 411.79

Could you update the proposal to remove the 4 duplicates above and also drop the namespace label from the remaining metrics? controller-runtime intentionally omits namespace to avoid cardinality issues.

The sub-operation duration, conflict tracking, and error categorization metrics are great additions — would be nice to see the proposal focused on those. Thanks!

@brluobt
Copy link
Copy Markdown
Author

brluobt commented Apr 1, 2026

Thanks for the thorough review @Ronkahn21! Updated in the latest commit:

  • Removed the 4 duplicate reconcile-level metrics and the ObservedReconciler wrapper
  • Removed namespace label from all custom metrics
  • Focused the proposal on 3 genuinely new metrics: sub-operation timing, conflict attribution, and error categorization
  • Added a "Relationship to Built-in Metrics" section documenting existing controller_runtime_* metrics

### Non-Goals

* This proposal does NOT duplicate any controller-runtime built-in metrics
* This proposal does NOT add metrics to the Grove scheduler component (only the operator)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such thing as "Grove scheduler component". Grove supports different scheduler backends, none of them is part of Grove itself.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. The Motivation/Non-Goals now read:

"This proposal does NOT add metrics to Grove's scheduler backends (these are external components Grove supports, not part of Grove itself); it covers only the Grove operator controllers."

(commit 27c59ff, L83 in the revised README)

if stepResult := r.ensureFinalizer(ctx, logger, pclq); ctrlcommon.ShortCircuitReconcileFlow(stepResult) {
return r.recordIncompleteReconcile(ctx, logger, pclq, &stepResult)
}
mc.RecordSubOperation("ensure_finalizer", start)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface looks good, we'll need to see how it lives together with the existing time measurements for scale-test milestone logging. Worth reviewing the existing code again (it might have changes since this PR was first drafted) and revisit the details.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — added a coexistence check to the Test Plan (commit 27c59ff):

"Verify these metrics do not duplicate or interfere with the scale-test milestone instrumentation in operator/e2e/measurement/. The new metrics expose aggregate per-controller signals via the /metrics endpoint; the measurement package continues to own scale-test milestone logging. Different pipelines, no expected interference."

The two paths address different questions and emit through different sinks:

  • This proposal → fleet-level aggregate metrics on the /metrics endpoint (Prometheus scrape), answering "across all instances and time, how often does X happen?"
  • operator/e2e/measurement/ → scale-test milestone logging for E2E test assertions, answering "did this specific E2E run hit milestone Y within timeout Z?"

The implementation PR will include an integration test asserting both are usable from the same operator binary without collision.


Grove Operator
PCS Controller → creates PC/PCSG, updates PCS status
PC Controller → creates Pods, updates PC status (replicas, readyReplicas)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use PCLQ instead of PC, across the whole document.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — PC → PCLQ across the whole document, with PCS/PCLQ/PCSG formally defined in the new Appendix B terminology table (commit 27c59ff).

@sanjaychatterjee
Copy link
Copy Markdown
Collaborator

@Ronkahn21 to review this.

Copy link
Copy Markdown
Contributor

@Ronkahn21 Ronkahn21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal uses metrics to answer questions that belong to tracing and profiling. "Which sub-operation in this reconcile is slow?" is a tracing question — per-request span data answers it, not an aggregate histogram. The proposed function-name labels (ensure_finalizer, sync_pods, update_observed_generation) bind the metric surface to today's code; refactor any reconciler and the labels break. Metrics should do what metrics do well: aggregate, fleet-level signals that survive refactors and back dashboards and alerts. The needs the GREP describes are real, but they have better answers. Grove's scale-test infrastructure already downloads pprof data after each run, which covers sub-operation latency today. Production tracing can be its own proposal if pprof proves insufficient.

A GREP should describe user-visible changes, not implementation. The current doc embeds about 130 lines of Go — full source for metrics.go and metrics_context.go, per-call-site examples, import lists. None of that belongs in a design doc. Describe what the framework exposes (metric names, labels, recording verbs) and why; leave the wiring for the implementation PR.

Points to address

  • Drop the embedded Go. Replace it with an API surface: metric names, labels, and recording verbs.
  • Drop sub-operation timing, or collapse it to phase=spec|status if there's evidence controller_runtime_reconcile_time_seconds falls short at fleet aggregate. Grove's scale-test pprof already covers per-reconcile latency.
  • Justify each remaining metric against the closest existing built-in. For example, what does grove_operator_status_update_conflict_total{target_kind=...} add over rest_client_requests_total{code="409", url=...}? The burden is on the new metric.
  • Rewrite user stories at fleet aggregate. SRE-on-Grafana scenarios: P95 latency, conflict-rate alert, persistent-error alert. Add a capacity-shaped story (active workers, queue depth) that's answered entirely by built-ins.
  • List tracing and profiling as explicit non-goals. Note that scale-test pprof covers per-reconcile diagnosis today; tracing can be its own GREP if production needs exceed that.
  • Drop the "Grove scheduler component" non-goal (per @shayasoolin). Grove has no scheduler component — backends are pluggable and own their own metrics.

@brluobt
Copy link
Copy Markdown
Author

brluobt commented May 9, 2026

Thanks @Ronkahn21 — after thinking it through, I agree with the framing. Sub-operation timing is a tracing concern, not a metrics concern, and function-name labels coupling the metrics surface to today's call graph is exactly the trap you're pointing at.

Plan for the next revision:

  1. Drop the sub-operation timing histogram and MetricsContext from this proposal. Per-reconcile sub-operation visibility moves to the tracing side (OpenTelemetry spans) — either as a follow-up GREP or as a note in Future Work here, depending on what the maintainers prefer.
  2. Keep ObservedReconciler (aggregate reconcile duration / count / error rate / in-flight per controller — no function-name labels).
  3. Keep the 409 conflict tracking metric (grove_operator_status_update_conflict_total with target_kind) — this is the fleet-level signal the original motivation is really about.

@shayasoolin — I owe you responses on the three inline comments (L71 / L329 / L558). I'll address those in the same revision pass, along with rebasing off main.

Targeting early next week for the revised proposal. Thanks both for the careful review.

kangclzjc and others added 3 commits May 9, 2026 16:46
Add a Grove Enhancement Proposal for introducing Prometheus metrics
to all three Grove controllers (PodCliqueSet, PodClique,
PodCliqueScalingGroup). Grove currently has zero custom metrics,
providing no visibility into reconcile duration, sub-operation
latency, error categorization, or cross-controller conflict rates.

Motivated by observed performance degradation during large-scale
Pod creation when Grove is used with upstream operators (e.g.,
NVIDIA Dynamo) that concurrently operate on PCS/PC objects.

Tracking issue: ai-dynamo#498

Signed-off-by: brluo <brluobt@gmail.com>
Co-authored-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
- Remove 4 duplicate metrics already provided by controller-runtime
  (reconcile_duration, reconcile_total, reconcile_inflight, reconcile_requeue)
- Remove namespace label from all custom metrics to avoid cardinality issues
- Remove ObservedReconciler wrapper (no longer needed)
- Focus proposal on 3 genuinely new metrics:
  1. reconcile_sub_operation_duration_seconds (sub-operation timing)
  2. status_update_conflict_total (409 conflict attribution)
  3. reconcile_errors_total (error categorization by K8s error type)
- Add 'Relationship to Built-in Metrics' section documenting existing
  controller_runtime_* and workqueue_* metrics
- Update PromQL queries to reference built-in metrics by their actual names
- Regenerate TOC via mdtoc

Signed-off-by: brluo <brluobt@gmail.com>
Co-authored-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Revise GREP-498 per review feedback from @Ronkahn21 and @shayasoolin:

Methodology corrections (addressing the core architectural feedback):
- Drop sub_operation_duration_seconds histogram: per-request,
  sub-operation visibility is a distributed-tracing concern, not
  a Prometheus-metrics concern. Prometheus aggregates across
  instances and loses per-request context; labeling a histogram
  by internal Go function names also couples the metric surface
  to code structure and breaks silently on refactor.
- Reframe Alt 2 (OpenTelemetry tracing) from rejected to scoped
  out; track per-request span instrumentation as Future Work.
- Remove all embedded Go source (MetricsContext type, helper
  functions, registration snippets, import blocks, File Changes
  Summary). The proposal now describes the metric surface
  (names, labels, semantics, cardinality bound) only; helper
  types and call sites belong to the implementation PR.

Scope reductions:
- Retain two custom metrics:
  * grove_operator_status_update_conflict_total (conflict
    attribution by target resource kind)
  * grove_operator_reconcile_errors_total (error categorization
    by k8serrors.Is* predicates)
- Add explicit Cardinality Bound section (<= 33 new series).
- Add Recording Verbs (Contract Only) section.
- Add Future Work section for the deferred tracing follow-up.

Detail fixes from inline comments:
- Motivation: scheduler-backend wording corrected - Grove
  supports these backends, they are not part of Grove itself.
- Test Plan: add coexistence check against
  operator/e2e/measurement/ scale-test milestone instrumentation.
- Appendix A: normalize terminology - PodClique -> PCLQ across
  document (with PCS/PCLQ/PCSG defined in Appendix B glossary).

Proposal shrinks from 591 lines to 377 lines (-36%).
No implementation code is introduced by this revision; the
follow-up implementation PR(s) will land after GREP approval.

Refs: ai-dynamo#499
Signed-off-by: Bruce Luo <brluo@nvidia.com>
@brluobt brluobt force-pushed the proposal/controller-reconcile-metrics branch from 8dceb53 to 27c59ff Compare May 9, 2026 16:46
@brluobt brluobt requested a review from danbar2 as a code owner May 9, 2026 16:46
@brluobt
Copy link
Copy Markdown
Author

brluobt commented May 9, 2026

GREP-498 v2 — revision summary

Thanks @Ronkahn21 and @shayasoolin for the thorough review. The proposal has been substantially reworked; pushed in commit 27c59ff.

Core architectural change

The biggest delta in this revision is a methodology correction suggested by @Ronkahn21's feedback:

Metrics are for aggregate, fleet-level questions. Per-request "which step inside this reconcile is slow?" questions belong to tracing.

This was the right framing, and the original proposal got it wrong by trying to use a sub-operation histogram (sub_operation_duration_seconds{operation="ensure_finalizer", ...}) to answer per-request questions. Aggregate histograms cannot reconstruct a single reconcile's internal timeline, and labeling the metric by internal Go function names couples the metric surface to the code structure and breaks silently on refactor.

Change in v2:

  • Dropped sub_operation_duration_seconds entirely.
  • Reframed Alt 2 (OpenTelemetry tracing) from rejected to scoped out, with per-request span instrumentation tracked under a new Future Work section.
  • The proposal is now strictly scoped to aggregate, fleet-level signals.

Metric surface in v2

Two custom metrics remain, both of which answer genuine gaps that controller-runtime built-ins cannot:

Metric Answers
grove_operator_status_update_conflict_total{controller, target_kind} "Which resource type is the 409 contention hotspot?"
grove_operator_reconcile_errors_total{controller, error_type} "What fraction of errors are transient vs persistent?"

Cardinality bound: ≤ 33 new series total (3 controllers × 3 target_kinds + 3 controllers × 8 error_types). No namespace, resource-name, function-name, or file-path labels.

Other review items addressed

  1. Embedded Go source removed. The proposal now describes the metric surface only (names, labels, semantics, cardinality, recording verbs contract). Helper types, registration code, and call sites are implementation details of the follow-up PR, not this proposal. (Addresses the "this looks more like an implementation PR than a design doc" signal.)

  2. Relationship to built-in metrics made explicit. New Relationship to Built-in Metrics section + expanded Built-in controller-runtime Metrics Reference table. The proposal explicitly does not duplicate controller_runtime_reconcile_*, workqueue_*, or rest_client_*.

  3. Scheduler-backend wording fixed (@shayasoolin L71): Grove supports scheduler backends; they are not part of Grove. Non-Goals now reflects this.

  4. Coexistence with operator/e2e/measurement/ (@shayasoolin L329): added an explicit Test Plan coexistence check. The two paths serve different questions through different sinks (Prometheus /metrics scrape vs. E2E milestone logging).

  5. Terminology normalized (@shayasoolin L558): PC → PCLQ throughout, with PCS/PCLQ/PCSG defined in new Appendix B.

Stats

  • 591 → 377 lines (-36%).
  • 5 Go code blocks (181 lines) → 0.
  • MetricsContext, RecordSubOperation, sub_operation_duration_seconds — all removed.
  • make update-toc + make verify-toc pass locally.

Requesting re-review. Happy to iterate further if the aggregate-only scoping still misses something.

@brluobt brluobt requested review from Ronkahn21 and shayasoolin May 9, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GREP: Controller Reconciliation Prometheus Metrics

5 participants