fix: tolerate transient scaler errors during cache rebuild by aayushbaluni · Pull Request #7714 · kedacore/keda

aayushbaluni · 2026-05-02T04:27:08Z

Problem

During scaler cache rebuild, a transient "redis: client is closed" error incorrectly sets Ready=False on ScaledJob, causing unnecessary condition flapping.

Change

Add tolerance for transient scaler errors during cache rebuild so Ready condition is not set to False for brief mid-rebuild client closures.

Fixes #7574.

Made with Cursor

snyk-io · 2026-05-02T04:27:24Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

github-actions · 2026-05-02T04:27:24Z

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

Add an entry in our changelog in alphabetical order and link related issue
Update the documentation, if needed
Add unit & e2e tests for your changes
GitHub checks are passing
Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

JorTurFer · 2026-05-03T17:20:54Z

Hello
Thanks for the PR. The current code tolerates transient errors actually. The func 'GetMetricsAndActivityForScaler' is checking the metric and on first fail, it drops the cached scaler and recreates if before trying again without any notification. This happens silently and it's indeed to tolerate transient errors.

Personally I'd not add any other extra mechanism but just accept that the upstream has failed twice in a row, Maybe other colleagues have other opinion, let's give them some time to review the PR

rickbrouwer · 2026-05-03T18:41:05Z

Agreed with @JorTurFer. Shouldn't this be handled in scalers_cache.go rather than in the callers? The Len = 0 feels more like a race in the cache itself.

JorTurFer · 2026-05-03T18:43:46Z

The Len = 0 feels more like a race in the cache itself.

This was a race condition that we solved last year IIRC

aayushbaluni · 2026-05-04T05:44:21Z

@JorTurFer @rickbrouwer Thanks for the review and context.

You're right — the existing two-attempt retry in GetMetricsAndActivityForScaler (drop cached scaler → recreate → retry) already covers transient upstream failures. My PR was focused on the Len = 0 scenario, but I understand now that:

The two-strike approach is intentional and sufficient for transient errors.
The Len = 0 race condition was already fixed previously.

Given that, this PR adds unnecessary complexity. I'm happy to close it if you both agree the current behavior is correct as-is. Let me know if there's a different angle you'd like me to explore instead.

JorTurFer · 2026-05-04T16:46:35Z

In which version have you seen Len = 0? I think that it's solved already

rickbrouwer · 2026-05-04T16:49:49Z

2.19 is reported in the issue. We see him occasionally too in 2.19. I think it can, just a super quick analysis. The author is welcome to dive into it themselves if the author wish. Should the author not do so and close PR, then I will dive in to see if everything is correct.

fix: tolerate transient scaler errors during cache rebuild

877f60b

aayushbaluni requested a review from a team as a code owner May 2, 2026 04:27

keda-automation requested a review from a team May 2, 2026 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tolerate transient scaler errors during cache rebuild#7714

fix: tolerate transient scaler errors during cache rebuild#7714
aayushbaluni wants to merge 1 commit intokedacore:mainfrom
aayushbaluni:fix/7574-scaledjob-transient-ready

aayushbaluni commented May 2, 2026

Uh oh!

snyk-io Bot commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

JorTurFer commented May 3, 2026

Uh oh!

rickbrouwer commented May 3, 2026

Uh oh!

JorTurFer commented May 3, 2026

Uh oh!

aayushbaluni commented May 4, 2026

Uh oh!

JorTurFer commented May 4, 2026

Uh oh!

rickbrouwer commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aayushbaluni commented May 2, 2026

Problem

Change

Uh oh!

snyk-io Bot commented May 2, 2026

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

JorTurFer commented May 3, 2026

Uh oh!

rickbrouwer commented May 3, 2026

Uh oh!

JorTurFer commented May 3, 2026

Uh oh!

aayushbaluni commented May 4, 2026

Uh oh!

JorTurFer commented May 4, 2026

Uh oh!

rickbrouwer commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rickbrouwer commented May 4, 2026 •

edited

Loading