Skip to content

feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts#6835

Merged
ti-chi-bot[bot] merged 1 commit intopingcap:mainfrom
fgksgf:feat/rolling-restart-metric
Apr 21, 2026
Merged

feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts#6835
ti-chi-bot[bot] merged 1 commit intopingcap:mainfrom
fgksgf:feat/rolling-restart-metric

Conversation

@fgksgf
Copy link
Copy Markdown
Member

@fgksgf fgksgf commented Apr 20, 2026

Summary

Add a single Prometheus gauge tidb_operator_abnormal_instance that flags managed instances stuck in an abnormal state.

tidb_operator_abnormal_instance{
    namespace, cluster, component, group, instance,
    condition="Synced"|"Ready",
}
  • 1 when the named condition is False, 0 otherwise.
  • Series stays present while the operator manages the instance, removed when the instance is finalized.

Recommended alerting form (Prometheus owns the duration logic):

- alert: TiKVInstanceAbnormalTooLong
  expr: tidb_operator_abnormal_instance{component="tikv", condition="Synced"} == 1
  for: 30m

Why two condition values

  • Synced=False: operator has more work to do (template mismatch, pod missing, scale-in pending). Persistent value indicates the operator cannot make the pod converge.
  • Ready=False: pod is up but cannot serve (PD unreachable, leader eviction stuck, store offline). Persistent value indicates a degraded data plane even when the rollout looks fine.

Both warrant attention; one metric with a condition label keeps the schema small and leaves room for additional conditions.

Implementation

  • The gauge is updated inside the existing TaskInstanceConditionSynced and TaskInstanceConditionReady common tasks. PD / TiKV / TiDB / TiFlash / TiCDC / TiProxy / Scheduler / Scheduling / TSO / ResourceManager / Router / TiKVWorker all expose both samples automatically.
  • ClearInstanceConditionMetrics is internalized into the existing TaskInstanceFinalizerDel. Future instance controllers using the standard finalize task get cleanup for free.
  • Helpers live in pkg/metrics next to the gauge definition.

Test plan

  • make lint — 0 issues
  • make unit — all packages pass
  • Unit tests cover Observe paths for True / False / absent for both conditions plus the finalizer-clear path

@ti-chi-bot ti-chi-bot Bot requested a review from shonge April 20, 2026 10:30
@github-actions github-actions Bot added the v2 for operator v2 label Apr 20, 2026
@ti-chi-bot ti-chi-bot Bot added the size/XXL label Apr 20, 2026
@fgksgf fgksgf requested a review from Copilot April 20, 2026 10:30
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 37.23%. Comparing base (1ec456d) to head (6140515).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6835      +/-   ##
==========================================
+ Coverage   37.16%   37.23%   +0.07%     
==========================================
  Files         390      391       +1     
  Lines       22360    22382      +22     
==========================================
+ Hits         8310     8334      +24     
+ Misses      14050    14048       -2     
Flag Coverage Δ
unittest 37.23% <100.00%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This comment was marked as outdated.

This comment was marked as outdated.

@ti-chi-bot ti-chi-bot Bot added size/XL and removed size/XXL labels Apr 20, 2026
@fgksgf fgksgf changed the title feat(tikv): add rolling restart duration metric feat(metrics): add per-instance unsynced/unready duration gauges Apr 20, 2026
Comment thread pkg/controllers/common/instance_unsynced_metric.go Outdated
@fgksgf fgksgf changed the title feat(metrics): add per-instance unsynced/unready duration gauges feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts Apr 20, 2026
@fgksgf fgksgf force-pushed the feat/rolling-restart-metric branch from 18b9dbc to 5056c48 Compare April 20, 2026 13:48
…te alerts

Add a single Prometheus gauge tidb_operator_abnormal_instance whose value
is 1 when the named condition on the instance is False and 0 otherwise,
along with the wiring to keep the series clean across cluster lifecycles.

  tidb_operator_abnormal_instance{
      namespace, cluster, component, group, instance,
      condition="Synced"|"Ready",
  }

The series stays present while the operator manages the instance and is
removed when the instance is finalized, so dashboards see no gaps and
Prometheus cardinality does not grow across cluster create/destroy
cycles.

Recommended alerting form (Prometheus owns the duration logic, the
operator only reports the binary state):

  - alert: TiKVInstanceAbnormalTooLong
    expr: tidb_operator_abnormal_instance{component="tikv", condition="Synced"} == 1
    for:  30m

Why two condition values:
Synced=False means the operator has more work to do (template mismatch,
pod missing, scale-in pending). Persistent Synced=False indicates the
operator cannot make the pod converge. Ready=False means the pod is up
but cannot serve (PD unreachable, leader eviction stuck, store offline).
Persistent Ready=False indicates a degraded data plane even when the
rollout looks fine to the operator. Both classes of stuck warrant
attention; one metric with a condition label keeps the schema small and
leaves room for additional conditions.

Implementation:
- The gauge is updated inside the existing TaskInstanceConditionSynced
  and TaskInstanceConditionReady common tasks. PD / TiKV / TiDB / TiFlash /
  TiCDC / TiProxy / Scheduler / Scheduling / TSO / ResourceManager /
  Router / TiKVWorker all expose both samples automatically.
- ClearInstanceConditionMetrics is internalized into the existing
  TaskInstanceFinalizerDel: when an instance is fully finalized, its
  series are dropped from the registry. Future instance controllers
  using the standard finalize task get this cleanup for free.
- Helpers live in pkg/metrics next to the gauge definition; controllers
  only call into the metrics package.
@fgksgf fgksgf force-pushed the feat/rolling-restart-metric branch from 5056c48 to 6140515 Compare April 20, 2026 13:57
@ti-chi-bot ti-chi-bot Bot added size/L and removed size/XL labels Apr 20, 2026
@liubog2008
Copy link
Copy Markdown
Member

/lgtm

@ti-chi-bot ti-chi-bot Bot added the lgtm label Apr 21, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liubog2008

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the approved label Apr 21, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 21, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-21 01:45:44.785681512 +0000 UTC m=+2043949.991041569: ☑️ agreed by liubog2008.

@liubog2008
Copy link
Copy Markdown
Member

/cherry-pick release-2.1

@ti-chi-bot
Copy link
Copy Markdown
Member

@liubog2008: once the present PR merges, I will cherry-pick it on top of release-2.1 in the new PR and assign it to you.

Details

In response to this:

/cherry-pick release-2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot Bot merged commit d7ab4fa into pingcap:main Apr 21, 2026
10 checks passed
@ti-chi-bot
Copy link
Copy Markdown
Member

@liubog2008: new pull request created to branch release-2.1: #6837.

Details

In response to this:

/cherry-pick release-2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@fgksgf fgksgf deleted the feat/rolling-restart-metric branch April 21, 2026 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants