Skip to content

fix(metrics): centralize abnormal_instance observe/clear at reconcile entry#6840

Merged
ti-chi-bot[bot] merged 4 commits intopingcap:mainfrom
fgksgf:fix/abnormal-instance-observe-at-entry
Apr 22, 2026
Merged

fix(metrics): centralize abnormal_instance observe/clear at reconcile entry#6840
ti-chi-bot[bot] merged 4 commits intopingcap:mainfrom
fgksgf:fix/abnormal-instance-observe-at-entry

Conversation

@fgksgf
Copy link
Copy Markdown
Member

@fgksgf fgksgf commented Apr 21, 2026

Summary

Supersedes #6839. After reviewer feedback that observe and clear should live together rather than being spread across multiple tasks plus an external informer handler, this PR introduces a single TaskObserveInstance in pkg/controllers/common that mirrors the shape of TaskTrack.

Motivation

On a dev EKS we observed the tidb_operator_abnormal_instance gauge retain value=1 for instances whose CR was already gone. Two paths produced it:

  1. Inside the CondObjectIsDeleting IfBreak block, TaskInstanceFinalizerDel cleared the gauge but the TaskInstanceConditionReady that followed immediately wrote it back to 1 (pod was gone, PodNotCreated).
  2. Force-delete (kubectl patch --type=merge -p '{\"metadata\":{\"finalizers\":null}}', common when unwedging a stuck instance) bypasses TaskInstanceFinalizerDel entirely, so the finalize-time cleanup never runs.

Both would generate false-positive metric == 1 for: 30m alerts on non-existent instances and grow label cardinality across each cluster lifecycle.

Approach

Move observe and clear to the same task, placed right after TaskTrack and before the CondObjectHasBeenDeleted IfBreak. A single branch decides whether the object still exists:

obj := state.Object()
if obj == nil {
    metrics.ClearInstanceConditionMetricsByKey(key.Namespace, key.Name)
    return task.Complete()...
}
metrics.ObserveConditions(obj, coreutil.StatusConditions[S](obj))
  • Watch DELETE events (graceful or force) enqueue a reconcile whose TaskContextObject returns NotFound; state.Object() is nil; this task runs the clear and the IfBreak short-circuits the rest of the pipeline.
  • Any intermediate contamination from the existing deletion-branch Cond tasks self-heals on that follow-up reconcile.

Changes

  • pkg/controllers/common/task_observe.go: new TaskObserveInstance.
  • pkg/controllers/common/task_observe_test.go: unit tests for observe / clear branches.
  • pkg/metrics/abnormal_instance.go: new ObserveConditions convenience and ClearInstanceConditionMetricsByKey using prometheus.GaugeVec.DeletePartialMatch (so the clear path does not need the business labels that are no longer readable once the CR is gone). This partial match also sweeps up any series written under a drifted cluster / component / group label, handling that secondary leak source for free.
  • pkg/metrics/abnormal_instance_test.go: covers the partial-match sweep (including drifted labels) and the sibling-instance preservation.
  • 9 instance builders (pd/tidb/tikv/tiflash/ticdc/tso/scheduling/tiproxy/tikvworker): one new line each, right after TaskTrack.

Intentionally not in this PR

  • The existing ObserveCondition calls inside TaskInstanceConditionSynced / Ready and the ClearInstanceConditionMetrics inside TaskInstanceFinalizerDel stay as belt-and-braces. Removing them is a follow-up once this path has soaked.
  • The deletion-branch shape in the seven builders that run CondSynced / Ready / Running + StatusPersister after FinalizerDel is likewise untouched. Those writes are redundant but no longer harmful because the very next reconcile (DELETE event) resets them.

Test plan

  • go build ./...
  • go test ./pkg/controllers/... ./pkg/metrics/... - all green
  • golangci-lint run on modified packages - 0 issues
  • Dev EKS soak: deploy this image, verify current leaked series disappear on operator restart, then kubectl patch --type=merge -p '{"metadata":{"finalizers":null}}' a TiProxy CR and confirm its series is cleared within a reconcile cycle.
  • E2E cases can be added in a follow-up once the shape is accepted.

… entry

Introduce TaskObserveInstance, a common task that mirrors TaskTrack: it
refreshes the abnormal_instance gauge when state.Object() is non-nil and
clears every series matching (namespace, instance) when the object has
been GC'd from the API server. Wire it into every instance builder right
after TaskTrack, before the CondObjectHasBeenDeleted IfBreak.

With observe + clear in one place:

- Force-delete paths that strip finalizers are covered. The watch DELETE
  event enqueues a reconcile whose TaskContextObject returns NotFound;
  state.Object() is nil; TaskObserveInstance runs ClearByKey and the
  IfBreak short-circuits the rest of the pipeline. No extra informer
  handler needed.

- The series drift caused by running CondReady after FinalizerDel in the
  deletion branch self-heals on the next reconcile (the DELETE event
  that follows the API-server GC), because that reconcile sees
  state.Object() == nil and sweeps the (namespace, instance) partial
  match. Cluster / component / group label churn is also swept.

Ship ClearInstanceConditionMetricsByKey on top of prometheus
DeletePartialMatch so the clear path does not require the business
labels (cluster / component / group) that are no longer readable once
the CR is gone.

This does not delete the existing ObserveCondition calls inside
TaskInstanceConditionSynced / Ready nor the Clear inside
TaskInstanceFinalizerDel. Those remain as belt-and-braces; removing
them is a follow-up once this path has soaked.
@ti-chi-bot ti-chi-bot Bot requested a review from shonge April 21, 2026 09:29
@github-actions github-actions Bot added the v2 for operator v2 label Apr 21, 2026
@ti-chi-bot ti-chi-bot Bot added the size/L label Apr 21, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 51.61290% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.25%. Comparing base (d7ab4fa) to head (15fe2a5).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6840      +/-   ##
==========================================
+ Coverage   37.23%   37.25%   +0.01%     
==========================================
  Files         391      392       +1     
  Lines       22382    22413      +31     
==========================================
+ Hits         8334     8350      +16     
- Misses      14048    14063      +15     
Flag Coverage Δ
unittest 37.25% <51.61%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR centralizes tidb_operator_abnormal_instance gauge observation and cleanup into a single reconcile task so the metric is refreshed every reconcile and reliably cleared when the instance CR no longer exists (including force-delete paths that bypass finalizers).

Changes:

  • Add TaskObserveInstance to observe conditions when the CR exists, or clear metrics by (namespace, instance) when it doesn’t.
  • Add ObserveConditions and ClearInstanceConditionMetricsByKey to the abnormal-instance metrics helper, plus unit tests covering partial-match deletion and drifted labels.
  • Wire the new task into multiple instance controller runners immediately after TaskTrack.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/controllers/common/task_observe.go New reconcile task to observe/clear abnormal_instance metrics at pipeline entry.
pkg/controllers/common/task_observe_test.go Unit tests for observe vs clear behavior of the new task.
pkg/metrics/abnormal_instance.go Add bulk observe helper and partial-match clear helper.
pkg/metrics/abnormal_instance_test.go Add coverage for partial-match sweep (incl. drifted labels) and sibling preservation.
pkg/controllers/pd/builder.go Wire TaskObserveInstance after TaskTrack.
pkg/controllers/tidb/builder.go Wire TaskObserveInstance after TaskTrack.
pkg/controllers/tikv/builder.go Wire TaskObserveInstance after TaskTrack.
pkg/controllers/tiflash/builder.go Wire TaskObserveInstance after TaskTrack.
pkg/controllers/ticdc/builder.go Wire TaskObserveInstance after TaskTrack.
pkg/controllers/tso/builder.go Wire TaskObserveInstance after TaskTrack.
pkg/controllers/tiproxy/builder.go Wire TaskObserveInstance after TaskTrack.
pkg/controllers/scheduling/builder.go Wire TaskObserveInstance after TaskTrack.
pkg/controllers/tikvworker/builder.go Wire TaskObserveInstance after TaskTrack.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/metrics/abnormal_instance.go Outdated
Comment thread pkg/controllers/pd/builder.go
fgksgf added 2 commits April 21, 2026 19:48
Without component in the partial match, a PD and a TiDB that legitimately share the same (namespace, name) would have their series wiped when either one went away. scope.Component[S]() is a compile-time constant for the reconcile kind, so it is available even after the CR is gone and its labels are no longer readable.
…manager

These three instance controllers run TaskInstanceConditionSynced/Ready, so they write to the AbnormalInstance gauge and need the same entry-point observe/clear hook as the other nine.
@ti-chi-bot ti-chi-bot Bot added size/XL and removed size/L labels Apr 21, 2026
@liubog2008
Copy link
Copy Markdown
Member

/lgtm
/cherry-pick release-2.1

@ti-chi-bot
Copy link
Copy Markdown
Member

@liubog2008: once the present PR merges, I will cherry-pick it on top of release-2.1 in the new PR and assign it to you.

Details

In response to this:

/lgtm
/cherry-pick release-2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot Bot added the lgtm label Apr 22, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 22, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-22 01:29:43.734771016 +0000 UTC m=+2129388.940131073: ☑️ agreed by liubog2008.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 22, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liubog2008

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the approved label Apr 22, 2026
@ti-chi-bot ti-chi-bot Bot merged commit 818d6d3 into pingcap:main Apr 22, 2026
10 checks passed
@ti-chi-bot
Copy link
Copy Markdown
Member

@liubog2008: new pull request created to branch release-2.1: #6844.

Details

In response to this:

/lgtm
/cherry-pick release-2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

fgksgf added a commit that referenced this pull request Apr 22, 2026
… entry (#6840) (#6844)

* fix(metrics): centralize abnormal_instance observe/clear at reconcile entry

Introduce TaskObserveInstance, a common task that mirrors TaskTrack: it
refreshes the abnormal_instance gauge when state.Object() is non-nil and
clears every series matching (namespace, instance) when the object has
been GC'd from the API server. Wire it into every instance builder right
after TaskTrack, before the CondObjectHasBeenDeleted IfBreak.

With observe + clear in one place:

- Force-delete paths that strip finalizers are covered. The watch DELETE
  event enqueues a reconcile whose TaskContextObject returns NotFound;
  state.Object() is nil; TaskObserveInstance runs ClearByKey and the
  IfBreak short-circuits the rest of the pipeline. No extra informer
  handler needed.

- The series drift caused by running CondReady after FinalizerDel in the
  deletion branch self-heals on the next reconcile (the DELETE event
  that follows the API-server GC), because that reconcile sees
  state.Object() == nil and sweeps the (namespace, instance) partial
  match. Cluster / component / group label churn is also swept.

Ship ClearInstanceConditionMetricsByKey on top of prometheus
DeletePartialMatch so the clear path does not require the business
labels (cluster / component / group) that are no longer readable once
the CR is gone.

This does not delete the existing ObserveCondition calls inside
TaskInstanceConditionSynced / Ready nor the Clear inside
TaskInstanceFinalizerDel. Those remain as belt-and-braces; removing
them is a follow-up once this path has soaked.

* fix(metrics): use the repo-wide license header year

* fix(metrics): qualify ClearByKey by component to avoid cross-kind sweep

Without component in the partial match, a PD and a TiDB that legitimately share the same (namespace, name) would have their series wiped when either one went away. scope.Component[S]() is a compile-time constant for the reconcile kind, so it is available even after the CR is gone and its labels are no longer readable.

* fix(metrics): wire TaskObserveInstance into router/scheduler/resourcemanager

These three instance controllers run TaskInstanceConditionSynced/Ready, so they write to the AbnormalInstance gauge and need the same entry-point observe/clear hook as the other nine.

---------

Co-authored-by: fgksgf <fgksgf@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants