Skip to content

fix(metrics): prevent abnormal_instance gauge leaks on deletion paths#6839

Closed
fgksgf wants to merge 1 commit intopingcap:mainfrom
fgksgf:fix/abnormal-instance-metric-leak
Closed

fix(metrics): prevent abnormal_instance gauge leaks on deletion paths#6839
fgksgf wants to merge 1 commit intopingcap:mainfrom
fgksgf:fix/abnormal-instance-metric-leak

Conversation

@fgksgf
Copy link
Copy Markdown
Member

@fgksgf fgksgf commented Apr 21, 2026

Summary

Follow-up to #6835. On a dev EKS we saw the tidb_operator_abnormal_instance gauge retain value=1 for 8 instances whose CR / Pod / parent Group had all been deleted, which would have produced false-positive metric == 1 for: 30m alerts and grown label cardinality across each cluster lifecycle.

Root cause is two independent paths:

  1. Seven instance builders keep Cond tasks inside the CondObjectIsDeleting IfBreak block* (pd/tidb/tiproxy/ticdc/tso/scheduling/tikvworker). TaskInstanceFinalizerDel clears the gauge via ClearInstanceConditionMetrics, but the TaskInstanceConditionReady that follows immediately re-populates it with value=1 because the pod is gone (PodNotCreated). TaskStatusPersister at the tail then fails with NotFound, confirming the CR is already deleted. tikv / tiflash already had the correct shape - only FinalizerDel in the deletion branch. This PR aligns the remaining seven.

  2. Force-delete paths that strip finalizers bypass FinalizerDel entirely. kubectl patch --type=merge -p '{\"metadata\":{\"finalizers\":null}}' is an operationally common action when unwedging a stuck instance (exactly the state this alert exists for), so relying solely on the finalize-time cleanup is insufficient. Add a watch-level DELETE handler on every instance kind that writes the gauge, so the series clears the moment the API server GCs the object. Covers force-delete, operator-down-during-delete, and any other path that reaches the informer without going through FinalizerDel.

Changes

  • pkg/controllers/{pd,tidb,tiproxy,ticdc,tso,scheduling,tikvworker}/builder.go: remove TaskInstanceConditionSynced/Ready/Running + TaskStatusPersister from the CondObjectIsDeleting branch.
  • pkg/metrics/cleanup_handler.go: new RegisterAbnormalInstanceCleanup that attaches a DeleteFunc on each instance kind's shared informer.
  • cmd/tidb-operator/main.go: wire the registration into manager setup.
  • pkg/metrics/abnormal_instance.go: update ClearInstanceConditionMetrics doc to describe both cleanup call sites.
  • tests/e2e/utils/metrics/metrics.go: add generic FetchOperatorMetric helper.
  • tests/e2e/metrics/abnormal_instance_leak.go: two cases on TiKV - graceful delete via the group, and force-delete via finalizer strip after Delete() sets DeletionTimestamp.

Test plan

  • go build ./...
  • go vet ./... on modified packages
  • golangci-lint run on modified packages - 0 issues
  • Unit tests green on pkg/metrics and controller tasks packages
  • E2E Abnormal Instance Gauge Leak against a kind cluster
  • Dev EKS soak: deploy this image, confirm current 8 leaked series disappear after operator restart, then kubectl patch --type=merge -p '{"metadata":{"finalizers":null}}' a TiProxy CR and confirm its series is cleared within seconds

Post-merge of pingcap#6835, the gauge leaked on two paths seen on a dev EKS:

- Seven instance builders (pd/tidb/tiproxy/ticdc/tso/scheduling/tikvworker)
  still ran TaskInstanceConditionSynced/Ready/Running + TaskStatusPersister
  inside the CondObjectIsDeleting IfBreak block. TaskInstanceFinalizerDel
  cleared the gauge, but the CondReady that followed immediately wrote it
  back to 1 because the pod was already gone (PodNotCreated). Align them
  with the tikv/tiflash template, whose deletion branch only runs
  FinalizerDel.

- Force-delete paths that strip finalizers bypass FinalizerDel entirely,
  so the finalize-time cleanup never runs. Add a watch-level DELETE
  handler on every instance kind that writes the gauge, clearing the
  series the moment the API server GCs the object.

Adds two e2e cases on TiKV (graceful delete via group, force delete via
finalizer strip) to lock both cleanup paths in.
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign azurezyq for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot requested a review from shonge April 21, 2026 08:24
@github-actions github-actions Bot added the v2 for operator v2 label Apr 21, 2026
@ti-chi-bot ti-chi-bot Bot added the size/XL label Apr 21, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.25%. Comparing base (d7ab4fa) to head (0f0eb7a).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6839      +/-   ##
==========================================
+ Coverage   37.23%   37.25%   +0.01%     
==========================================
  Files         391      392       +1     
  Lines       22382    22371      -11     
==========================================
  Hits         8334     8334              
+ Misses      14048    14037      -11     
Flag Coverage Δ
unittest 37.25% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fgksgf
Copy link
Copy Markdown
Member Author

fgksgf commented Apr 21, 2026

Closing in favor of a smaller refactor: a single TaskObserveInstance in pkg/controllers/common that handles both observe (when obj exists) and clear (when obj is nil, i.e. watch DELETE), mirroring the shape of TaskTrack. Reviewer feedback: observe and clear should live together. New PR coming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL v2 for operator v2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants