Skip to content

fix: tolerate transient scaler errors during cache rebuild#7714

Open
aayushbaluni wants to merge 1 commit intokedacore:mainfrom
aayushbaluni:fix/7574-scaledjob-transient-ready
Open

fix: tolerate transient scaler errors during cache rebuild#7714
aayushbaluni wants to merge 1 commit intokedacore:mainfrom
aayushbaluni:fix/7574-scaledjob-transient-ready

Conversation

@aayushbaluni
Copy link
Copy Markdown

Problem

During scaler cache rebuild, a transient "redis: client is closed" error incorrectly sets Ready=False on ScaledJob, causing unnecessary condition flapping.

Change

Add tolerance for transient scaler errors during cache rebuild so Ready condition is not set to False for brief mid-rebuild client closures.

Fixes #7574.

Made with Cursor

@aayushbaluni aayushbaluni requested a review from a team as a code owner May 2, 2026 04:27
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented May 2, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@keda-automation keda-automation requested a review from a team May 2, 2026 04:27
@JorTurFer
Copy link
Copy Markdown
Member

Hello
Thanks for the PR. The current code tolerates transient errors actually. The func 'GetMetricsAndActivityForScaler' is checking the metric and on first fail, it drops the cached scaler and recreates if before trying again without any notification. This happens silently and it's indeed to tolerate transient errors.

Personally I'd not add any other extra mechanism but just accept that the upstream has failed twice in a row, Maybe other colleagues have other opinion, let's give them some time to review the PR

@rickbrouwer
Copy link
Copy Markdown
Member

Agreed with @JorTurFer. Shouldn't this be handled in scalers_cache.go rather than in the callers? The Len = 0 feels more like a race in the cache itself.

@JorTurFer
Copy link
Copy Markdown
Member

The Len = 0 feels more like a race in the cache itself.

This was a race condition that we solved last year IIRC

@aayushbaluni
Copy link
Copy Markdown
Author

@JorTurFer @rickbrouwer Thanks for the review and context.

You're right — the existing two-attempt retry in GetMetricsAndActivityForScaler (drop cached scaler → recreate → retry) already covers transient upstream failures. My PR was focused on the Len = 0 scenario, but I understand now that:

  1. The two-strike approach is intentional and sufficient for transient errors.
  2. The Len = 0 race condition was already fixed previously.

Given that, this PR adds unnecessary complexity. I'm happy to close it if you both agree the current behavior is correct as-is. Let me know if there's a different angle you'd like me to explore instead.

@JorTurFer
Copy link
Copy Markdown
Member

In which version have you seen Len = 0? I think that it's solved already

@rickbrouwer
Copy link
Copy Markdown
Member

rickbrouwer commented May 4, 2026

2.19 is reported in the issue. We see him occasionally too in 2.19. I think it can, just a super quick analysis. The author is welcome to dive into it themselves if the author wish. Should the author not do so and close PR, then I will dive in to see if everything is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ScaledJob: transient "redis: client is closed" error during scaler cache rebuild sets Ready=False

3 participants