feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts by fgksgf · Pull Request #6835 · pingcap/tidb-operator

fgksgf · 2026-04-20T10:30:03Z

Summary

Add a single Prometheus gauge tidb_operator_abnormal_instance that flags managed instances stuck in an abnormal state.

tidb_operator_abnormal_instance{
    namespace, cluster, component, group, instance,
    condition="Synced"|"Ready",
}

1 when the named condition is False, 0 otherwise.
Series stays present while the operator manages the instance, removed when the instance is finalized.

Recommended alerting form (Prometheus owns the duration logic):

- alert: TiKVInstanceAbnormalTooLong
  expr: tidb_operator_abnormal_instance{component="tikv", condition="Synced"} == 1
  for: 30m

Why two condition values

Synced=False: operator has more work to do (template mismatch, pod missing, scale-in pending). Persistent value indicates the operator cannot make the pod converge.
Ready=False: pod is up but cannot serve (PD unreachable, leader eviction stuck, store offline). Persistent value indicates a degraded data plane even when the rollout looks fine.

Both warrant attention; one metric with a condition label keeps the schema small and leaves room for additional conditions.

Implementation

The gauge is updated inside the existing TaskInstanceConditionSynced and TaskInstanceConditionReady common tasks. PD / TiKV / TiDB / TiFlash / TiCDC / TiProxy / Scheduler / Scheduling / TSO / ResourceManager / Router / TiKVWorker all expose both samples automatically.
ClearInstanceConditionMetrics is internalized into the existing TaskInstanceFinalizerDel. Future instance controllers using the standard finalize task get cleanup for free.
Helpers live in pkg/metrics next to the gauge definition.

Test plan

make lint — 0 issues
make unit — all packages pass
Unit tests cover Observe paths for True / False / absent for both conditions plus the finalizer-clear path

codecov-commenter · 2026-04-20T10:36:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 37.23%. Comparing base (1ec456d) to head (6140515).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6835      +/-   ##
==========================================
+ Coverage   37.16%   37.23%   +0.07%     
==========================================
  Files         390      391       +1     
  Lines       22360    22382      +22     
==========================================
+ Hits         8310     8334      +24     
+ Misses      14050    14048       -2

Flag	Coverage Δ
unittest	`37.23% <100.00%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…te alerts Add a single Prometheus gauge tidb_operator_abnormal_instance whose value is 1 when the named condition on the instance is False and 0 otherwise, along with the wiring to keep the series clean across cluster lifecycles. tidb_operator_abnormal_instance{ namespace, cluster, component, group, instance, condition="Synced"|"Ready", } The series stays present while the operator manages the instance and is removed when the instance is finalized, so dashboards see no gaps and Prometheus cardinality does not grow across cluster create/destroy cycles. Recommended alerting form (Prometheus owns the duration logic, the operator only reports the binary state): - alert: TiKVInstanceAbnormalTooLong expr: tidb_operator_abnormal_instance{component="tikv", condition="Synced"} == 1 for: 30m Why two condition values: Synced=False means the operator has more work to do (template mismatch, pod missing, scale-in pending). Persistent Synced=False indicates the operator cannot make the pod converge. Ready=False means the pod is up but cannot serve (PD unreachable, leader eviction stuck, store offline). Persistent Ready=False indicates a degraded data plane even when the rollout looks fine to the operator. Both classes of stuck warrant attention; one metric with a condition label keeps the schema small and leaves room for additional conditions. Implementation: - The gauge is updated inside the existing TaskInstanceConditionSynced and TaskInstanceConditionReady common tasks. PD / TiKV / TiDB / TiFlash / TiCDC / TiProxy / Scheduler / Scheduling / TSO / ResourceManager / Router / TiKVWorker all expose both samples automatically. - ClearInstanceConditionMetrics is internalized into the existing TaskInstanceFinalizerDel: when an instance is fully finalized, its series are dropped from the registry. Future instance controllers using the standard finalize task get this cleanup for free. - Helpers live in pkg/metrics next to the gauge definition; controllers only call into the metrics package.

liubog2008 · 2026-04-21T01:45:37Z

/lgtm

ti-chi-bot · 2026-04-21T01:45:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liubog2008

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [liubog2008]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-21T01:45:45Z

[LGTM Timeline notifier]

Timeline:

2026-04-21 01:45:44.785681512 +0000 UTC m=+2043949.991041569: ☑️ agreed by liubog2008.

liubog2008 · 2026-04-21T01:45:47Z

/cherry-pick release-2.1

ti-chi-bot · 2026-04-21T01:45:49Z

@liubog2008: once the present PR merges, I will cherry-pick it on top of release-2.1 in the new PR and assign it to you.

Details

In response to this:

/cherry-pick release-2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2026-04-21T01:50:35Z

@liubog2008: new pull request created to branch release-2.1: #6837.

Details

In response to this:

/cherry-pick release-2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot Bot requested a review from shonge April 20, 2026 10:30

github-actions Bot added the v2 for operator v2 label Apr 20, 2026

ti-chi-bot Bot added the size/XXL label Apr 20, 2026

fgksgf requested a review from Copilot April 20, 2026 10:30

Copilot started reviewing on behalf of fgksgf April 20, 2026 10:31 View session

This comment was marked as outdated.

Sign in to view

fgksgf requested a review from Copilot April 20, 2026 12:15

Copilot started reviewing on behalf of fgksgf April 20, 2026 12:16 View session

This comment was marked as outdated.

Sign in to view

ti-chi-bot Bot added size/XL and removed size/XXL labels Apr 20, 2026

fgksgf changed the title ~~feat(tikv): add rolling restart duration metric~~ feat(metrics): add per-instance unsynced/unready duration gauges Apr 20, 2026

liubog2008 reviewed Apr 20, 2026

View reviewed changes

Comment thread pkg/controllers/common/instance_unsynced_metric.go Outdated

fgksgf changed the title ~~feat(metrics): add per-instance unsynced/unready duration gauges~~ feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts Apr 20, 2026

fgksgf force-pushed the feat/rolling-restart-metric branch from 18b9dbc to 5056c48 Compare April 20, 2026 13:48

fgksgf force-pushed the feat/rolling-restart-metric branch from 5056c48 to 6140515 Compare April 20, 2026 13:57

ti-chi-bot Bot added size/L and removed size/XL labels Apr 20, 2026

ti-chi-bot Bot added the lgtm label Apr 21, 2026

ti-chi-bot Bot added the approved label Apr 21, 2026

ti-chi-bot Bot merged commit d7ab4fa into pingcap:main Apr 21, 2026
10 checks passed

ti-chi-bot mentioned this pull request Apr 21, 2026

feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts (#6835) #6837

Merged

3 tasks

fgksgf deleted the feat/rolling-restart-metric branch April 21, 2026 01:53

fgksgf mentioned this pull request Apr 21, 2026

fix(metrics): prevent abnormal_instance gauge leaks on deletion paths #6839

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts#6835

feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts#6835
ti-chi-bot[bot] merged 1 commit intopingcap:mainfrom
fgksgf:feat/rolling-restart-metric

fgksgf commented Apr 20, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 20, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

liubog2008 commented Apr 21, 2026

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

liubog2008 commented Apr 21, 2026

Uh oh!

ti-chi-bot commented Apr 21, 2026

Uh oh!

Uh oh!

ti-chi-bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

fgksgf commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why two condition values

Implementation

Test plan

Uh oh!

codecov-commenter commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

liubog2008 commented Apr 21, 2026

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

[LGTM Timeline notifier]

Uh oh!

liubog2008 commented Apr 21, 2026

Uh oh!

ti-chi-bot commented Apr 21, 2026

Uh oh!

Uh oh!

ti-chi-bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fgksgf commented Apr 20, 2026 •

edited

Loading

codecov-commenter commented Apr 20, 2026 •

edited

Loading