[release-4.21] OCPBUGS-87002: Replace HTTP backend liveness check with admin socket check#786
Conversation
…check Use HAProxy admin socket `show version` command for the liveness probe instead of sending an HTTP request to the backend. This directly tests whether the HAProxy process is alive and responsive, rather than testing through the data plane. The HTTP-based liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the liveness probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running. The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load. The readiness probe continues to use the HTTP backend check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the admin socket URL definition to the top of the Run method and reuse it for the Prometheus collector ScrapeURI default, the liveness probe, and the ConfigManager connection info. Remove the hardcoded default from the haproxy metrics package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@openshift-cherrypick-robot: Jira Issue OCPBUGS-67161 has been cloned as Jira Issue OCPBUGS-87002. Will retitle bug to link to clone. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-87002, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
This is a clean cherry-pick. /approve This is a low-medium risk backport that changes the internal implementation of the router controller's liveness check. It has no customer-facing impact except that it reduces the risk that the kubelet terminates and restarts a router pod that is under extreme load. /label backport-risk-assessed |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@openshift-cherrypick-robot: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/hold |
|
The team agreed that we can proceed with the backport. |
|
Tested with Cluster bot ➜ oc get clusterversion ➜ oc get route send traffic➜ oc rsh web-server-rc-74wzz Summary: Total data: 124476 bytes Response time histogram: Latency distribution: Details (average, fastest, slowest): Status code distribution: Error distribution: / # hey -n 50000 -c 30000 http://service-unsecure-default.apps.ci-ln-3k53cjt-76ef8.aws-4.ci.openshift.org Summary: Total data: 98762 bytes Response time histogram: Latency distribution: Details (average, fastest, slowest): Status code distribution: Error distribution: check the log, the router pod doesn't reload or restart, no issues of the health-check as well➜ oc -n openshift-ingress logs router-default-5c8f77f699-v4t5h ➜ Downloads oc -n openshift-ingress get pods Hence marking as verified |
|
@melvinjoseph86: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
f4c2ba8
into
openshift:release-4.21
|
@openshift-cherrypick-robot: Jira Issue Verification Checks: Jira Issue OCPBUGS-87002 Jira Issue OCPBUGS-87002 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-cherrypick-robot: new pull request created: #789 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in release 4.21.0-0.nightly-2026-06-05-205942 |
This is an automated cherry-pick of #737
/assign Miciah
/cherrypick release-4.20