feat(metrics): add per-instance abnormal_instance gauge for stuck-state alerts#6835
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #6835 +/- ##
==========================================
+ Coverage 37.16% 37.23% +0.07%
==========================================
Files 390 391 +1
Lines 22360 22382 +22
==========================================
+ Hits 8310 8334 +24
+ Misses 14050 14048 -2
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
18b9dbc to
5056c48
Compare
…te alerts
Add a single Prometheus gauge tidb_operator_abnormal_instance whose value
is 1 when the named condition on the instance is False and 0 otherwise,
along with the wiring to keep the series clean across cluster lifecycles.
tidb_operator_abnormal_instance{
namespace, cluster, component, group, instance,
condition="Synced"|"Ready",
}
The series stays present while the operator manages the instance and is
removed when the instance is finalized, so dashboards see no gaps and
Prometheus cardinality does not grow across cluster create/destroy
cycles.
Recommended alerting form (Prometheus owns the duration logic, the
operator only reports the binary state):
- alert: TiKVInstanceAbnormalTooLong
expr: tidb_operator_abnormal_instance{component="tikv", condition="Synced"} == 1
for: 30m
Why two condition values:
Synced=False means the operator has more work to do (template mismatch,
pod missing, scale-in pending). Persistent Synced=False indicates the
operator cannot make the pod converge. Ready=False means the pod is up
but cannot serve (PD unreachable, leader eviction stuck, store offline).
Persistent Ready=False indicates a degraded data plane even when the
rollout looks fine to the operator. Both classes of stuck warrant
attention; one metric with a condition label keeps the schema small and
leaves room for additional conditions.
Implementation:
- The gauge is updated inside the existing TaskInstanceConditionSynced
and TaskInstanceConditionReady common tasks. PD / TiKV / TiDB / TiFlash /
TiCDC / TiProxy / Scheduler / Scheduling / TSO / ResourceManager /
Router / TiKVWorker all expose both samples automatically.
- ClearInstanceConditionMetrics is internalized into the existing
TaskInstanceFinalizerDel: when an instance is fully finalized, its
series are dropped from the registry. Future instance controllers
using the standard finalize task get this cleanup for free.
- Helpers live in pkg/metrics next to the gauge definition; controllers
only call into the metrics package.
5056c48 to
6140515
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liubog2008 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
/cherry-pick release-2.1 |
|
@liubog2008: once the present PR merges, I will cherry-pick it on top of release-2.1 in the new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
@liubog2008: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
Summary
Add a single Prometheus gauge
tidb_operator_abnormal_instancethat flags managed instances stuck in an abnormal state.1when the named condition isFalse,0otherwise.Recommended alerting form (Prometheus owns the duration logic):
Why two condition values
Synced=False: operator has more work to do (template mismatch, pod missing, scale-in pending). Persistent value indicates the operator cannot make the pod converge.Ready=False: pod is up but cannot serve (PD unreachable, leader eviction stuck, store offline). Persistent value indicates a degraded data plane even when the rollout looks fine.Both warrant attention; one metric with a
conditionlabel keeps the schema small and leaves room for additional conditions.Implementation
TaskInstanceConditionSyncedandTaskInstanceConditionReadycommon tasks. PD / TiKV / TiDB / TiFlash / TiCDC / TiProxy / Scheduler / Scheduling / TSO / ResourceManager / Router / TiKVWorker all expose both samples automatically.ClearInstanceConditionMetricsis internalized into the existingTaskInstanceFinalizerDel. Future instance controllers using the standard finalize task get cleanup for free.pkg/metricsnext to the gauge definition.Test plan
make lint— 0 issuesmake unit— all packages passTrue/False/ absent for both conditions plus the finalizer-clear path