fix(metrics): centralize abnormal_instance observe/clear at reconcile entry#6840
Conversation
… entry Introduce TaskObserveInstance, a common task that mirrors TaskTrack: it refreshes the abnormal_instance gauge when state.Object() is non-nil and clears every series matching (namespace, instance) when the object has been GC'd from the API server. Wire it into every instance builder right after TaskTrack, before the CondObjectHasBeenDeleted IfBreak. With observe + clear in one place: - Force-delete paths that strip finalizers are covered. The watch DELETE event enqueues a reconcile whose TaskContextObject returns NotFound; state.Object() is nil; TaskObserveInstance runs ClearByKey and the IfBreak short-circuits the rest of the pipeline. No extra informer handler needed. - The series drift caused by running CondReady after FinalizerDel in the deletion branch self-heals on the next reconcile (the DELETE event that follows the API-server GC), because that reconcile sees state.Object() == nil and sweeps the (namespace, instance) partial match. Cluster / component / group label churn is also swept. Ship ClearInstanceConditionMetricsByKey on top of prometheus DeletePartialMatch so the clear path does not require the business labels (cluster / component / group) that are no longer readable once the CR is gone. This does not delete the existing ObserveCondition calls inside TaskInstanceConditionSynced / Ready nor the Clear inside TaskInstanceFinalizerDel. Those remain as belt-and-braces; removing them is a follow-up once this path has soaked.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6840 +/- ##
==========================================
+ Coverage 37.23% 37.25% +0.01%
==========================================
Files 391 392 +1
Lines 22382 22413 +31
==========================================
+ Hits 8334 8350 +16
- Misses 14048 14063 +15
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR centralizes tidb_operator_abnormal_instance gauge observation and cleanup into a single reconcile task so the metric is refreshed every reconcile and reliably cleared when the instance CR no longer exists (including force-delete paths that bypass finalizers).
Changes:
- Add
TaskObserveInstanceto observe conditions when the CR exists, or clear metrics by(namespace, instance)when it doesn’t. - Add
ObserveConditionsandClearInstanceConditionMetricsByKeyto the abnormal-instance metrics helper, plus unit tests covering partial-match deletion and drifted labels. - Wire the new task into multiple instance controller runners immediately after
TaskTrack.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/controllers/common/task_observe.go | New reconcile task to observe/clear abnormal_instance metrics at pipeline entry. |
| pkg/controllers/common/task_observe_test.go | Unit tests for observe vs clear behavior of the new task. |
| pkg/metrics/abnormal_instance.go | Add bulk observe helper and partial-match clear helper. |
| pkg/metrics/abnormal_instance_test.go | Add coverage for partial-match sweep (incl. drifted labels) and sibling preservation. |
| pkg/controllers/pd/builder.go | Wire TaskObserveInstance after TaskTrack. |
| pkg/controllers/tidb/builder.go | Wire TaskObserveInstance after TaskTrack. |
| pkg/controllers/tikv/builder.go | Wire TaskObserveInstance after TaskTrack. |
| pkg/controllers/tiflash/builder.go | Wire TaskObserveInstance after TaskTrack. |
| pkg/controllers/ticdc/builder.go | Wire TaskObserveInstance after TaskTrack. |
| pkg/controllers/tso/builder.go | Wire TaskObserveInstance after TaskTrack. |
| pkg/controllers/tiproxy/builder.go | Wire TaskObserveInstance after TaskTrack. |
| pkg/controllers/scheduling/builder.go | Wire TaskObserveInstance after TaskTrack. |
| pkg/controllers/tikvworker/builder.go | Wire TaskObserveInstance after TaskTrack. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Without component in the partial match, a PD and a TiDB that legitimately share the same (namespace, name) would have their series wiped when either one went away. scope.Component[S]() is a compile-time constant for the reconcile kind, so it is available even after the CR is gone and its labels are no longer readable.
…manager These three instance controllers run TaskInstanceConditionSynced/Ready, so they write to the AbnormalInstance gauge and need the same entry-point observe/clear hook as the other nine.
|
/lgtm |
|
@liubog2008: once the present PR merges, I will cherry-pick it on top of release-2.1 in the new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
[LGTM Timeline notifier]Timeline:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liubog2008 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@liubog2008: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
… entry (#6840) (#6844) * fix(metrics): centralize abnormal_instance observe/clear at reconcile entry Introduce TaskObserveInstance, a common task that mirrors TaskTrack: it refreshes the abnormal_instance gauge when state.Object() is non-nil and clears every series matching (namespace, instance) when the object has been GC'd from the API server. Wire it into every instance builder right after TaskTrack, before the CondObjectHasBeenDeleted IfBreak. With observe + clear in one place: - Force-delete paths that strip finalizers are covered. The watch DELETE event enqueues a reconcile whose TaskContextObject returns NotFound; state.Object() is nil; TaskObserveInstance runs ClearByKey and the IfBreak short-circuits the rest of the pipeline. No extra informer handler needed. - The series drift caused by running CondReady after FinalizerDel in the deletion branch self-heals on the next reconcile (the DELETE event that follows the API-server GC), because that reconcile sees state.Object() == nil and sweeps the (namespace, instance) partial match. Cluster / component / group label churn is also swept. Ship ClearInstanceConditionMetricsByKey on top of prometheus DeletePartialMatch so the clear path does not require the business labels (cluster / component / group) that are no longer readable once the CR is gone. This does not delete the existing ObserveCondition calls inside TaskInstanceConditionSynced / Ready nor the Clear inside TaskInstanceFinalizerDel. Those remain as belt-and-braces; removing them is a follow-up once this path has soaked. * fix(metrics): use the repo-wide license header year * fix(metrics): qualify ClearByKey by component to avoid cross-kind sweep Without component in the partial match, a PD and a TiDB that legitimately share the same (namespace, name) would have their series wiped when either one went away. scope.Component[S]() is a compile-time constant for the reconcile kind, so it is available even after the CR is gone and its labels are no longer readable. * fix(metrics): wire TaskObserveInstance into router/scheduler/resourcemanager These three instance controllers run TaskInstanceConditionSynced/Ready, so they write to the AbnormalInstance gauge and need the same entry-point observe/clear hook as the other nine. --------- Co-authored-by: fgksgf <fgksgf@gmail.com>
Summary
Supersedes #6839. After reviewer feedback that observe and clear should live together rather than being spread across multiple tasks plus an external informer handler, this PR introduces a single
TaskObserveInstanceinpkg/controllers/commonthat mirrors the shape ofTaskTrack.Motivation
On a dev EKS we observed the
tidb_operator_abnormal_instancegauge retainvalue=1for instances whose CR was already gone. Two paths produced it:CondObjectIsDeletingIfBreak block,TaskInstanceFinalizerDelcleared the gauge but theTaskInstanceConditionReadythat followed immediately wrote it back to 1 (pod was gone,PodNotCreated).kubectl patch --type=merge -p '{\"metadata\":{\"finalizers\":null}}', common when unwedging a stuck instance) bypassesTaskInstanceFinalizerDelentirely, so the finalize-time cleanup never runs.Both would generate false-positive
metric == 1 for: 30malerts on non-existent instances and grow label cardinality across each cluster lifecycle.Approach
Move observe and clear to the same task, placed right after
TaskTrackand before theCondObjectHasBeenDeletedIfBreak. A single branch decides whether the object still exists:TaskContextObjectreturnsNotFound;state.Object()is nil; this task runs the clear and the IfBreak short-circuits the rest of the pipeline.Changes
pkg/controllers/common/task_observe.go: newTaskObserveInstance.pkg/controllers/common/task_observe_test.go: unit tests for observe / clear branches.pkg/metrics/abnormal_instance.go: newObserveConditionsconvenience andClearInstanceConditionMetricsByKeyusingprometheus.GaugeVec.DeletePartialMatch(so the clear path does not need the business labels that are no longer readable once the CR is gone). This partial match also sweeps up any series written under a driftedcluster/component/grouplabel, handling that secondary leak source for free.pkg/metrics/abnormal_instance_test.go: covers the partial-match sweep (including drifted labels) and the sibling-instance preservation.TaskTrack.Intentionally not in this PR
ObserveConditioncalls insideTaskInstanceConditionSynced/Readyand theClearInstanceConditionMetricsinsideTaskInstanceFinalizerDelstay as belt-and-braces. Removing them is a follow-up once this path has soaked.CondSynced/Ready/Running+StatusPersisterafterFinalizerDelis likewise untouched. Those writes are redundant but no longer harmful because the very next reconcile (DELETE event) resets them.Test plan
go build ./...go test ./pkg/controllers/... ./pkg/metrics/...- all greengolangci-lint runon modified packages - 0 issueskubectl patch --type=merge -p '{"metadata":{"finalizers":null}}'a TiProxy CR and confirm its series is cleared within a reconcile cycle.