Skip to content

fix(metricsservice): rebuild CA pool on rotation instead of appending#7713

Open
alliasgher wants to merge 3 commits intokedacore:mainfrom
alliasgher:fix-certpool-monotonic-growth
Open

fix(metricsservice): rebuild CA pool on rotation instead of appending#7713
alliasgher wants to merge 3 commits intokedacore:mainfrom
alliasgher:fix-certpool-monotonic-growth

Conversation

@alliasgher
Copy link
Copy Markdown
Contributor

Fixes #7691

On every Kubernetes Secret rotation event the existing code calls certPool.AppendCertsFromPEM on the same long-lived pool. Because x509.CertPool has no mechanism to remove entries, the pool grows unboundedly for the lifetime of the operator pod.

Changes:

  • Introduce a currentPool variable (protected by the existing certMutex) that is replaced wholesale on each rotation instead of appended to.
  • Server side: use GetConfigForClient callback so every new TLS handshake picks up the latest pool without a data race on tls.Config fields.
  • Client side: set InsecureSkipVerify + VerifyPeerCertificate so the fresh pool is read under the lock on each connection (the gosec nolint comment explains why this is safe — real verification happens in VerifyPeerCertificate).

On every Kubernetes Secret rotation event, the old code called
certPool.AppendCertsFromPEM on the same long-lived pool. Because
x509.CertPool has no way to remove entries, the pool grows
unboundedly for the lifetime of the operator pod.

Fix:
- Introduce a currentPool variable (protected by the existing
  certMutex) that is replaced wholesale on each rotation.
- Server side: use GetConfigForClient so every new TLS handshake
  picks up the latest pool without a data race on tls.Config fields.
- Client side: set InsecureSkipVerify+VerifyPeerCertificate so the
  fresh pool is read under the lock on each connection.

Signed-off-by: alliasgher <alliasgher123@gmail.com>
@alliasgher alliasgher requested a review from a team as a code owner May 1, 2026 22:12
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented May 1, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@keda-automation keda-automation requested a review from a team May 1, 2026 22:13
Comment thread pkg/metricsservice/utils/tls.go Fixed
Signed-off-by: alliasgher <alliasgher123@gmail.com>
@JorTurFer
Copy link
Copy Markdown
Member

JorTurFer commented May 3, 2026

/run-e2e internal-
Update: You can check the progress here

Copy link
Copy Markdown
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloud you update changelog to include this fix?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses the metrics service mTLS certificate-rotation path so CA bundles do not accumulate indefinitely in memory. It updates the TLS credential loader in pkg/metricsservice to rebuild CA pools on secret rotation and to wire the refreshed pool into new server/client connections.

Changes:

  • Rebuild the CA pool from scratch on each Secret rotation instead of appending PEMs to a long-lived x509.CertPool.
  • Add dynamic server-side TLS config selection so new handshakes can observe the latest client CA pool.
  • Add client-side custom certificate verification intended to consult the current CA pool on each connection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/metricsservice/utils/tls.go Outdated
Comment on lines +170 to +177
// VerifyConnection is called after the TLS handshake with the peer's
// verified chain available; we use it to re-check against currentPool so
// that a freshly-rotated CA is honoured without setting InsecureSkipVerify.
config.RootCAs = certPool
config.VerifyConnection = func(cs tls.ConnectionState) error {
certMutex.RLock()
pool := currentPool
certMutex.RUnlock()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 208097f.

Comment thread pkg/metricsservice/utils/tls.go Outdated
Comment on lines +112 to +114
certMutex.Lock()
currentPool = newPool
certMutex.Unlock()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 208097f.

Comment thread pkg/metricsservice/utils/tls.go Outdated
newPool = x509.NewCertPool()
}
if !newPool.AppendCertsFromPEM(pemClientCA) {
log.Error(err, "failed to add client CA's certificate")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 208097f.

@JorTurFer
Copy link
Copy Markdown
Member

duplicated #7700

@JorTurFer JorTurFer added the duplicate This issue or pull request already exists label May 3, 2026
Signed-off-by: alliasgher <alliasgher123@gmail.com>
@keda-automation keda-automation requested a review from a team May 3, 2026 18:11
@alliasgher alliasgher closed this May 3, 2026
@alliasgher alliasgher reopened this May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

duplicate This issue or pull request already exists

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TLS CertPool Monotonic Growth

4 participants