feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts (#6835)#6837
Merged
fgksgf merged 1 commit intopingcap:release-2.1from Apr 21, 2026
Conversation
…te alerts
Add a single Prometheus gauge tidb_operator_abnormal_instance whose value
is 1 when the named condition on the instance is False and 0 otherwise,
along with the wiring to keep the series clean across cluster lifecycles.
tidb_operator_abnormal_instance{
namespace, cluster, component, group, instance,
condition="Synced"|"Ready",
}
The series stays present while the operator manages the instance and is
removed when the instance is finalized, so dashboards see no gaps and
Prometheus cardinality does not grow across cluster create/destroy
cycles.
Recommended alerting form (Prometheus owns the duration logic, the
operator only reports the binary state):
- alert: TiKVInstanceAbnormalTooLong
expr: tidb_operator_abnormal_instance{component="tikv", condition="Synced"} == 1
for: 30m
Why two condition values:
Synced=False means the operator has more work to do (template mismatch,
pod missing, scale-in pending). Persistent Synced=False indicates the
operator cannot make the pod converge. Ready=False means the pod is up
but cannot serve (PD unreachable, leader eviction stuck, store offline).
Persistent Ready=False indicates a degraded data plane even when the
rollout looks fine to the operator. Both classes of stuck warrant
attention; one metric with a condition label keeps the schema small and
leaves room for additional conditions.
Implementation:
- The gauge is updated inside the existing TaskInstanceConditionSynced
and TaskInstanceConditionReady common tasks. PD / TiKV / TiDB / TiFlash /
TiCDC / TiProxy / Scheduler / Scheduling / TSO / ResourceManager /
Router / TiKVWorker all expose both samples automatically.
- ClearInstanceConditionMetrics is internalized into the existing
TaskInstanceFinalizerDel: when an instance is fully finalized, its
series are dropped from the registry. Future instance controllers
using the standard finalize task get this cleanup for free.
- Helpers live in pkg/metrics next to the gauge definition; controllers
only call into the metrics package.
3 tasks
Contributor
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release-2.1 #6837 +/- ##
===============================================
+ Coverage 37.16% 37.23% +0.07%
===============================================
Files 390 391 +1
Lines 22360 22382 +22
===============================================
+ Hits 8310 8334 +24
+ Misses 14050 14048 -2
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an automated cherry-pick of #6835
Summary
Add a single Prometheus gauge
tidb_operator_abnormal_instancethat flags managed instances stuck in an abnormal state.1when the named condition isFalse,0otherwise.Recommended alerting form (Prometheus owns the duration logic):
Why two condition values
Synced=False: operator has more work to do (template mismatch, pod missing, scale-in pending). Persistent value indicates the operator cannot make the pod converge.Ready=False: pod is up but cannot serve (PD unreachable, leader eviction stuck, store offline). Persistent value indicates a degraded data plane even when the rollout looks fine.Both warrant attention; one metric with a
conditionlabel keeps the schema small and leaves room for additional conditions.Implementation
TaskInstanceConditionSyncedandTaskInstanceConditionReadycommon tasks. PD / TiKV / TiDB / TiFlash / TiCDC / TiProxy / Scheduler / Scheduling / TSO / ResourceManager / Router / TiKVWorker all expose both samples automatically.ClearInstanceConditionMetricsis internalized into the existingTaskInstanceFinalizerDel. Future instance controllers using the standard finalize task get cleanup for free.pkg/metricsnext to the gauge definition.Test plan
make lint— 0 issuesmake unit— all packages passTrue/False/ absent for both conditions plus the finalizer-clear path