fix(metrics): prevent abnormal_instance gauge leaks on deletion paths#6839
fix(metrics): prevent abnormal_instance gauge leaks on deletion paths#6839fgksgf wants to merge 1 commit intopingcap:mainfrom
Conversation
Post-merge of pingcap#6835, the gauge leaked on two paths seen on a dev EKS: - Seven instance builders (pd/tidb/tiproxy/ticdc/tso/scheduling/tikvworker) still ran TaskInstanceConditionSynced/Ready/Running + TaskStatusPersister inside the CondObjectIsDeleting IfBreak block. TaskInstanceFinalizerDel cleared the gauge, but the CondReady that followed immediately wrote it back to 1 because the pod was already gone (PodNotCreated). Align them with the tikv/tiflash template, whose deletion branch only runs FinalizerDel. - Force-delete paths that strip finalizers bypass FinalizerDel entirely, so the finalize-time cleanup never runs. Add a watch-level DELETE handler on every instance kind that writes the gauge, clearing the series the moment the API server GCs the object. Adds two e2e cases on TiKV (graceful delete via group, force delete via finalizer strip) to lock both cleanup paths in.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6839 +/- ##
==========================================
+ Coverage 37.23% 37.25% +0.01%
==========================================
Files 391 392 +1
Lines 22382 22371 -11
==========================================
Hits 8334 8334
+ Misses 14048 14037 -11
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
Closing in favor of a smaller refactor: a single TaskObserveInstance in pkg/controllers/common that handles both observe (when obj exists) and clear (when obj is nil, i.e. watch DELETE), mirroring the shape of TaskTrack. Reviewer feedback: observe and clear should live together. New PR coming. |
Summary
Follow-up to #6835. On a dev EKS we saw the
tidb_operator_abnormal_instancegauge retainvalue=1for 8 instances whose CR / Pod / parent Group had all been deleted, which would have produced false-positivemetric == 1 for: 30malerts and grown label cardinality across each cluster lifecycle.Root cause is two independent paths:
Seven instance builders keep Cond tasks inside the
CondObjectIsDeletingIfBreak block* (pd/tidb/tiproxy/ticdc/tso/scheduling/tikvworker).TaskInstanceFinalizerDelclears the gauge viaClearInstanceConditionMetrics, but theTaskInstanceConditionReadythat follows immediately re-populates it withvalue=1because the pod is gone (PodNotCreated).TaskStatusPersisterat the tail then fails withNotFound, confirming the CR is already deleted.tikv/tiflashalready had the correct shape - onlyFinalizerDelin the deletion branch. This PR aligns the remaining seven.Force-delete paths that strip finalizers bypass
FinalizerDelentirely.kubectl patch --type=merge -p '{\"metadata\":{\"finalizers\":null}}'is an operationally common action when unwedging a stuck instance (exactly the state this alert exists for), so relying solely on the finalize-time cleanup is insufficient. Add a watch-levelDELETEhandler on every instance kind that writes the gauge, so the series clears the moment the API server GCs the object. Covers force-delete, operator-down-during-delete, and any other path that reaches the informer without going throughFinalizerDel.Changes
pkg/controllers/{pd,tidb,tiproxy,ticdc,tso,scheduling,tikvworker}/builder.go: removeTaskInstanceConditionSynced/Ready/Running+TaskStatusPersisterfrom theCondObjectIsDeletingbranch.pkg/metrics/cleanup_handler.go: newRegisterAbnormalInstanceCleanupthat attaches aDeleteFuncon each instance kind's shared informer.cmd/tidb-operator/main.go: wire the registration into manager setup.pkg/metrics/abnormal_instance.go: updateClearInstanceConditionMetricsdoc to describe both cleanup call sites.tests/e2e/utils/metrics/metrics.go: add genericFetchOperatorMetrichelper.tests/e2e/metrics/abnormal_instance_leak.go: two cases on TiKV - graceful delete via the group, and force-delete via finalizer strip afterDelete()setsDeletionTimestamp.Test plan
go build ./...go vet ./...on modified packagesgolangci-lint runon modified packages - 0 issuespkg/metricsand controllertaskspackagesAbnormal Instance Gauge Leakagainst a kind clusterkubectl patch --type=merge -p '{"metadata":{"finalizers":null}}'a TiProxy CR and confirm its series is cleared within seconds