Skip to content

feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts (#6835)#6837

Merged
fgksgf merged 1 commit intopingcap:release-2.1from
ti-chi-bot:cherry-pick-6835-to-release-2.1
Apr 21, 2026
Merged

feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts (#6835)#6837
fgksgf merged 1 commit intopingcap:release-2.1from
ti-chi-bot:cherry-pick-6835-to-release-2.1

Conversation

@ti-chi-bot
Copy link
Copy Markdown
Member

This is an automated cherry-pick of #6835

Summary

Add a single Prometheus gauge tidb_operator_abnormal_instance that flags managed instances stuck in an abnormal state.

tidb_operator_abnormal_instance{
    namespace, cluster, component, group, instance,
    condition="Synced"|"Ready",
}
  • 1 when the named condition is False, 0 otherwise.
  • Series stays present while the operator manages the instance, removed when the instance is finalized.

Recommended alerting form (Prometheus owns the duration logic):

- alert: TiKVInstanceAbnormalTooLong
  expr: tidb_operator_abnormal_instance{component="tikv", condition="Synced"} == 1
  for: 30m

Why two condition values

  • Synced=False: operator has more work to do (template mismatch, pod missing, scale-in pending). Persistent value indicates the operator cannot make the pod converge.
  • Ready=False: pod is up but cannot serve (PD unreachable, leader eviction stuck, store offline). Persistent value indicates a degraded data plane even when the rollout looks fine.

Both warrant attention; one metric with a condition label keeps the schema small and leaves room for additional conditions.

Implementation

  • The gauge is updated inside the existing TaskInstanceConditionSynced and TaskInstanceConditionReady common tasks. PD / TiKV / TiDB / TiFlash / TiCDC / TiProxy / Scheduler / Scheduling / TSO / ResourceManager / Router / TiKVWorker all expose both samples automatically.
  • ClearInstanceConditionMetrics is internalized into the existing TaskInstanceFinalizerDel. Future instance controllers using the standard finalize task get cleanup for free.
  • Helpers live in pkg/metrics next to the gauge definition.

Test plan

  • make lint — 0 issues
  • make unit — all packages pass
  • Unit tests cover Observe paths for True / False / absent for both conditions plus the finalizer-clear path

…te alerts

Add a single Prometheus gauge tidb_operator_abnormal_instance whose value
is 1 when the named condition on the instance is False and 0 otherwise,
along with the wiring to keep the series clean across cluster lifecycles.

  tidb_operator_abnormal_instance{
      namespace, cluster, component, group, instance,
      condition="Synced"|"Ready",
  }

The series stays present while the operator manages the instance and is
removed when the instance is finalized, so dashboards see no gaps and
Prometheus cardinality does not grow across cluster create/destroy
cycles.

Recommended alerting form (Prometheus owns the duration logic, the
operator only reports the binary state):

  - alert: TiKVInstanceAbnormalTooLong
    expr: tidb_operator_abnormal_instance{component="tikv", condition="Synced"} == 1
    for:  30m

Why two condition values:
Synced=False means the operator has more work to do (template mismatch,
pod missing, scale-in pending). Persistent Synced=False indicates the
operator cannot make the pod converge. Ready=False means the pod is up
but cannot serve (PD unreachable, leader eviction stuck, store offline).
Persistent Ready=False indicates a degraded data plane even when the
rollout looks fine to the operator. Both classes of stuck warrant
attention; one metric with a condition label keeps the schema small and
leaves room for additional conditions.

Implementation:
- The gauge is updated inside the existing TaskInstanceConditionSynced
  and TaskInstanceConditionReady common tasks. PD / TiKV / TiDB / TiFlash /
  TiCDC / TiProxy / Scheduler / Scheduling / TSO / ResourceManager /
  Router / TiKVWorker all expose both samples automatically.
- ClearInstanceConditionMetrics is internalized into the existing
  TaskInstanceFinalizerDel: when an instance is fully finalized, its
  series are dropped from the registry. Future instance controllers
  using the standard finalize task get this cleanup for free.
- Helpers live in pkg/metrics next to the gauge definition; controllers
  only call into the metrics package.
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from liubog2008. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot requested a review from howardlau1999 April 21, 2026 01:50
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 37.23%. Comparing base (1e6432e) to head (d56c139).

Additional details and impacted files
@@               Coverage Diff               @@
##           release-2.1    #6837      +/-   ##
===============================================
+ Coverage        37.16%   37.23%   +0.07%     
===============================================
  Files              390      391       +1     
  Lines            22360    22382      +22     
===============================================
+ Hits              8310     8334      +24     
+ Misses           14050    14048       -2     
Flag Coverage Δ
unittest 37.23% <100.00%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fgksgf fgksgf merged commit ceace17 into pingcap:release-2.1 Apr 21, 2026
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants