GREP-498: Controller Reconciliation Prometheus Metrics#499
Conversation
Metrics Review: Duplicates of Existing controller-runtime MetricsI double-checked the proposed metrics against what controller-runtime already exposes out of the box. 4 of the 5 reconcile-level metrics proposed here are duplicates of existing built-in metrics. Proposed metrics that already exist
Evidence from live metrics scrapeCould you update the proposal to remove the 4 duplicates above and also drop the The sub-operation duration, conflict tracking, and error categorization metrics are great additions — would be nice to see the proposal focused on those. Thanks! |
|
Thanks for the thorough review @Ronkahn21! Updated in the latest commit:
|
| ### Non-Goals | ||
|
|
||
| * This proposal does NOT duplicate any controller-runtime built-in metrics | ||
| * This proposal does NOT add metrics to the Grove scheduler component (only the operator) |
There was a problem hiding this comment.
There's no such thing as "Grove scheduler component". Grove supports different scheduler backends, none of them is part of Grove itself.
There was a problem hiding this comment.
Good catch — fixed. The Motivation/Non-Goals now read:
"This proposal does NOT add metrics to Grove's scheduler backends (these are external components Grove supports, not part of Grove itself); it covers only the Grove operator controllers."
(commit 27c59ff, L83 in the revised README)
| if stepResult := r.ensureFinalizer(ctx, logger, pclq); ctrlcommon.ShortCircuitReconcileFlow(stepResult) { | ||
| return r.recordIncompleteReconcile(ctx, logger, pclq, &stepResult) | ||
| } | ||
| mc.RecordSubOperation("ensure_finalizer", start) |
There was a problem hiding this comment.
The interface looks good, we'll need to see how it lives together with the existing time measurements for scale-test milestone logging. Worth reviewing the existing code again (it might have changes since this PR was first drafted) and revisit the details.
There was a problem hiding this comment.
Good question — added a coexistence check to the Test Plan (commit 27c59ff):
"Verify these metrics do not duplicate or interfere with the scale-test milestone instrumentation in
operator/e2e/measurement/. The new metrics expose aggregate per-controller signals via the/metricsendpoint; the measurement package continues to own scale-test milestone logging. Different pipelines, no expected interference."
The two paths address different questions and emit through different sinks:
- This proposal → fleet-level aggregate metrics on the
/metricsendpoint (Prometheus scrape), answering "across all instances and time, how often does X happen?" operator/e2e/measurement/→ scale-test milestone logging for E2E test assertions, answering "did this specific E2E run hit milestone Y within timeout Z?"
The implementation PR will include an integration test asserting both are usable from the same operator binary without collision.
|
|
||
| Grove Operator | ||
| PCS Controller → creates PC/PCSG, updates PCS status | ||
| PC Controller → creates Pods, updates PC status (replicas, readyReplicas) |
There was a problem hiding this comment.
Please use PCLQ instead of PC, across the whole document.
There was a problem hiding this comment.
Done — PC → PCLQ across the whole document, with PCS/PCLQ/PCSG formally defined in the new Appendix B terminology table (commit 27c59ff).
|
@Ronkahn21 to review this. |
There was a problem hiding this comment.
The proposal uses metrics to answer questions that belong to tracing and profiling. "Which sub-operation in this reconcile is slow?" is a tracing question — per-request span data answers it, not an aggregate histogram. The proposed function-name labels (ensure_finalizer, sync_pods, update_observed_generation) bind the metric surface to today's code; refactor any reconciler and the labels break. Metrics should do what metrics do well: aggregate, fleet-level signals that survive refactors and back dashboards and alerts. The needs the GREP describes are real, but they have better answers. Grove's scale-test infrastructure already downloads pprof data after each run, which covers sub-operation latency today. Production tracing can be its own proposal if pprof proves insufficient.
A GREP should describe user-visible changes, not implementation. The current doc embeds about 130 lines of Go — full source for metrics.go and metrics_context.go, per-call-site examples, import lists. None of that belongs in a design doc. Describe what the framework exposes (metric names, labels, recording verbs) and why; leave the wiring for the implementation PR.
Points to address
- Drop the embedded Go. Replace it with an API surface: metric names, labels, and recording verbs.
- Drop sub-operation timing, or collapse it to
phase=spec|statusif there's evidencecontroller_runtime_reconcile_time_secondsfalls short at fleet aggregate. Grove's scale-test pprof already covers per-reconcile latency. - Justify each remaining metric against the closest existing built-in. For example, what does
grove_operator_status_update_conflict_total{target_kind=...}add overrest_client_requests_total{code="409", url=...}? The burden is on the new metric. - Rewrite user stories at fleet aggregate. SRE-on-Grafana scenarios: P95 latency, conflict-rate alert, persistent-error alert. Add a capacity-shaped story (active workers, queue depth) that's answered entirely by built-ins.
- List tracing and profiling as explicit non-goals. Note that scale-test pprof covers per-reconcile diagnosis today; tracing can be its own GREP if production needs exceed that.
- Drop the "Grove scheduler component" non-goal (per @shayasoolin). Grove has no scheduler component — backends are pluggable and own their own metrics.
|
Thanks @Ronkahn21 — after thinking it through, I agree with the framing. Sub-operation timing is a tracing concern, not a metrics concern, and function-name labels coupling the metrics surface to today's call graph is exactly the trap you're pointing at. Plan for the next revision:
@shayasoolin — I owe you responses on the three inline comments (L71 / L329 / L558). I'll address those in the same revision pass, along with rebasing off Targeting early next week for the revised proposal. Thanks both for the careful review. |
Add a Grove Enhancement Proposal for introducing Prometheus metrics to all three Grove controllers (PodCliqueSet, PodClique, PodCliqueScalingGroup). Grove currently has zero custom metrics, providing no visibility into reconcile duration, sub-operation latency, error categorization, or cross-controller conflict rates. Motivated by observed performance degradation during large-scale Pod creation when Grove is used with upstream operators (e.g., NVIDIA Dynamo) that concurrently operate on PCS/PC objects. Tracking issue: ai-dynamo#498 Signed-off-by: brluo <brluobt@gmail.com> Co-authored-by: kangclzjc <kangz@nvidia.com> Signed-off-by: kangclzjc <kangz@nvidia.com>
- Remove 4 duplicate metrics already provided by controller-runtime (reconcile_duration, reconcile_total, reconcile_inflight, reconcile_requeue) - Remove namespace label from all custom metrics to avoid cardinality issues - Remove ObservedReconciler wrapper (no longer needed) - Focus proposal on 3 genuinely new metrics: 1. reconcile_sub_operation_duration_seconds (sub-operation timing) 2. status_update_conflict_total (409 conflict attribution) 3. reconcile_errors_total (error categorization by K8s error type) - Add 'Relationship to Built-in Metrics' section documenting existing controller_runtime_* and workqueue_* metrics - Update PromQL queries to reference built-in metrics by their actual names - Regenerate TOC via mdtoc Signed-off-by: brluo <brluobt@gmail.com> Co-authored-by: kangclzjc <kangz@nvidia.com> Signed-off-by: kangclzjc <kangz@nvidia.com>
Revise GREP-498 per review feedback from @Ronkahn21 and @shayasoolin: Methodology corrections (addressing the core architectural feedback): - Drop sub_operation_duration_seconds histogram: per-request, sub-operation visibility is a distributed-tracing concern, not a Prometheus-metrics concern. Prometheus aggregates across instances and loses per-request context; labeling a histogram by internal Go function names also couples the metric surface to code structure and breaks silently on refactor. - Reframe Alt 2 (OpenTelemetry tracing) from rejected to scoped out; track per-request span instrumentation as Future Work. - Remove all embedded Go source (MetricsContext type, helper functions, registration snippets, import blocks, File Changes Summary). The proposal now describes the metric surface (names, labels, semantics, cardinality bound) only; helper types and call sites belong to the implementation PR. Scope reductions: - Retain two custom metrics: * grove_operator_status_update_conflict_total (conflict attribution by target resource kind) * grove_operator_reconcile_errors_total (error categorization by k8serrors.Is* predicates) - Add explicit Cardinality Bound section (<= 33 new series). - Add Recording Verbs (Contract Only) section. - Add Future Work section for the deferred tracing follow-up. Detail fixes from inline comments: - Motivation: scheduler-backend wording corrected - Grove supports these backends, they are not part of Grove itself. - Test Plan: add coexistence check against operator/e2e/measurement/ scale-test milestone instrumentation. - Appendix A: normalize terminology - PodClique -> PCLQ across document (with PCS/PCLQ/PCSG defined in Appendix B glossary). Proposal shrinks from 591 lines to 377 lines (-36%). No implementation code is introduced by this revision; the follow-up implementation PR(s) will land after GREP approval. Refs: ai-dynamo#499 Signed-off-by: Bruce Luo <brluo@nvidia.com>
8dceb53 to
27c59ff
Compare
GREP-498 v2 — revision summaryThanks @Ronkahn21 and @shayasoolin for the thorough review. The proposal has been substantially reworked; pushed in commit 27c59ff. Core architectural changeThe biggest delta in this revision is a methodology correction suggested by @Ronkahn21's feedback:
This was the right framing, and the original proposal got it wrong by trying to use a sub-operation histogram ( Change in v2:
Metric surface in v2Two custom metrics remain, both of which answer genuine gaps that
Cardinality bound: ≤ 33 new series total (3 controllers × 3 target_kinds + 3 controllers × 8 error_types). No namespace, resource-name, function-name, or file-path labels. Other review items addressed
Stats
Requesting re-review. Happy to iterate further if the aggregate-only scoping still misses something. |
Summary
Add Prometheus metrics to all three Grove controllers (PodCliqueSet, PodClique, PodCliqueScalingGroup).
Grove currently has zero custom Prometheus metrics. This GREP proposes a metrics framework providing:
ObservedReconcilerwrapperMetricsContextTracking issue: Fixes #498
Motivation
During large-scale inference graph deployments (e.g., via NVIDIA Dynamo), reconciliation performance degrades significantly due to high event rates and cross-controller resource contention on PCS/PC objects. Without metrics, diagnosing these issues requires ad-hoc log analysis.
Key Design Decisions
grove_operator_status_update_conflict_totalwithtarget_kindlabel for precise contention attributionTest Plan