fix(scaling): jitter first tick of scale loop to avoid thundering herd by ggarb · Pull Request #7676 · kedacore/keda

ggarb · 2026-04-22T18:38:51Z

Problem

KEDA spawns one scale loop per ScalableObject (ScaledObject / ScaledJob). Each scale loop uses time.NewTimer(pollingInterval) with no per-object offset (pkg/scaling/scale_handler.go). That means objects whose scale loops are
spawned in a short window — by a bulk creation, a single reconcile pass after an operator restart, or a batch API call — end up polling on the same pollingInterval boundary forever.

For external-metric triggers (Prometheus, custom metrics, etc.) this produces a periodic thundering herd against the upstream metric source and against any client-side infrastructure (DNS, TCP accept queues, per-scaler http.Transport
connection pools) in the path. Under light load the effect is invisible; under even moderate per-cluster scale it becomes a correctness problem.

Reproduction

Measured on a 1k ScaledObject Prometheus-trigger load test (bulk-created by kube-burner in a ~33 s window), KEDA 2.19 release image, default pollingInterval: 30s, default KEDA_HTTP_DEFAULT_TIMEOUT=3000:

~50 % of scaler polls fail with context deadline exceeded (Client.Timeout exceeded while awaiting headers).
The Prometheus server is idle on the other end: ~29 mCPU, and 100 % of the queries it received completed in <100 ms per its own histogram.
KEDA logs ~2500 errors/min; Prom records only a handful of cancelled requests in the same window. Several hundred requests per minute never reach Prom at all.
Operator CPU profile shows ~20 % of cycles in net.(*Resolver).goLookupIPCNAMEOrder — DNS re-resolution for every cold connection.
Error pattern is distinctly bursty (~500 errors in a 6 s window, ~10 s apart) — aligned tick timers, not tail latency. Raising the timeout to 10 s does not help; timeouts move to 10 s and dial tcp: connection refused errors start
appearing too.

Fix

Insert a deterministic per-object offset fnv64a(UID) % pollingInterval before the first tick of each scale loop. Subsequent ticks are unchanged. Keyed off the object's UID so the phase is stable across operator restarts — otherwise
every leader-election flip or pod crash would re-align all loops.

Helper lives in pkg/scaling/jitter.go. Call site is a small select against the jitter timer and ctx.Done() before the existing for loop in startScaleLoop.

Validation

Same cluster, same workload, with this patch. Prometheus' --web.max-connections was raised above its default 512 for these runs — with KEDA's current per-scaler http.Transport (one keepalive per scaler), N > ~500 scalers exceeds
Prom's default connection cap regardless of this fix; once polls are no longer bursting, the connection count stabilizes above the default and Prom starts rejecting new connections. That's a separate efficiency concern (shared
Transport across scalers of the same type) that can be addressed in a follow-up.

1k Prom-trigger:

0 scaler-poll errors sustained over 3+ minutes (was ~2500/min)
Operator CPU: 83 → 70 mCPU
Operator DNS share of CPU: 20 % → 1.3 % (warm keepalives no longer evicted by bursts)

10k Prom-trigger (same patch):

10 000 / 10 000 ScaledObjects, 0 operator restarts
316 282 Prom queries served, 99.9997 % under 100 ms, 0 scaler-poll errors across ~20 min of wall clock
Operator 333 mCPU / 1163 MiB — ~33 mCPU and ~111 MiB per 1 k ScaledObjects, matching the type: cpu sizing baseline within 10 %

Tests

Added pkg/scaling/jitter_test.go covering:

Offset is in [0, pollingInterval) across 1000 synthetic UIDs
Same UID produces the same offset across repeated calls (stable phase across restarts)
Zero or negative interval → 0
Empty UID → 0
Distribution across 10 000 synthetic UIDs is approximately uniform per-decile (loose binomial tolerance)

go test ./pkg/... passes on this branch.

Notes

Backward-compatible: steady-state polling cadence is unchanged. Only the first tick per scale loop is delayed by up to pollingInterval.
Applies to both ScaledObject and ScaledJob (shared startScaleLoop).
Does not affect push-scaler paths (startPushScalers is unchanged).

Checklist

Tests have been added
Changelog has been updated (Fixes → General)
Commits are signed (DCO)
New scaler — N/A
Schema regen — N/A (no scaler schema changes)
Helm chart PR — N/A (no deployment manifest changes)
Docs PR — N/A (behavior is transparent to users)

snyk-io · 2026-04-22T18:39:06Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

github-actions · 2026-04-22T18:39:41Z

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

Add an entry in our changelog in alphabetical order and link related issue
Update the documentation, if needed
Add unit & e2e tests for your changes
GitHub checks are passing
Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

JorTurFer · 2026-04-26T15:50:21Z

+	// ~30-second boundary forever, creating a thundering herd against
+	// external metric sources. Keyed off the object UID so the phase is
+	// stable across operator restarts.
+	if offset := jitterOffset(withTriggers.UID, pollingInterval); offset > 0 {


I like the improvement! The only concern I have is that some specific edge cases can use realy long polling intervals. Does it make sense to set a max window? Something like

min(pollingInterval, 1 min)

If the polling interval is short, it will work, but for long polling intervals, the request will be delayed as max 1 min

rickbrouwer · 2026-04-26T17:13:29Z

I still need to take a good look at the whole thing, but my first question is what is the reason to put this in its own file rather than inline in scale_handler.go?

Scale loops previously fired their first tick via time.NewTimer(pollingInterval) with no per-object offset. ScalableObjects created in a short window (bulk apply, or operator restart re-spawning existing loops) then polled on the same pollingInterval boundary forever, creating a periodic thundering herd against external metric sources. Observed in a load test at 1k Prometheus-trigger ScaledObjects created by kube-burner in a ~33s window: ~50% of subsequent polls failed with HTTP client timeouts even at default pollingInterval 30s, while Prometheus itself was idle (<100ms response, <30 mCPU). Errors came in bursts of ~500 per 6s window every ~10s -- the signature of aligned tick timers, not tail latency. Raising KEDA_HTTP_DEFAULT_TIMEOUT from 3s to 10s did not help; the timeouts simply moved to 10s. This change inserts a deterministic per-object offset (hash(UID) mod pollingInterval) before the first tick, spreading scale loops spawned in a batch across the polling interval. Keyed off UID so the phase is stable across operator restarts -- otherwise every leader-election flip would re-align all loops. Subsequent ticks continue at pollingInterval with no change in semantics. Added helper jitterOffset in pkg/scaling/jitter.go with unit tests covering determinism, range, zero/empty inputs, and distribution across 10k synthetic UIDs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Greg Garber <ggarb@netflix.com>

ggarb requested a review from a team as a code owner April 22, 2026 18:38

keda-automation requested a review from a team April 22, 2026 18:39

ggarb force-pushed the fix-polling-jitter-6abd2fb3 branch 2 times, most recently from ef5a691 to 213347c Compare April 22, 2026 20:54

ggarb mentioned this pull request Apr 22, 2026

fix(http): share http.Transport across scalers to eliminate re-dial storm at scale #7677

Open

7 tasks

ggarb force-pushed the fix-polling-jitter-6abd2fb3 branch from 213347c to 483bc90 Compare April 23, 2026 15:11

JorTurFer reviewed Apr 26, 2026

View reviewed changes

ggarb force-pushed the fix-polling-jitter-6abd2fb3 branch from 483bc90 to 3e84673 Compare May 4, 2026 15:21

keda-automation requested a review from a team May 4, 2026 15:22

ggarb force-pushed the fix-polling-jitter-6abd2fb3 branch from 3e84673 to 3c447de Compare May 4, 2026 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scaling): jitter first tick of scale loop to avoid thundering herd#7676

fix(scaling): jitter first tick of scale loop to avoid thundering herd#7676
ggarb wants to merge 1 commit intokedacore:mainfrom
ggarb:fix-polling-jitter-6abd2fb3

ggarb commented Apr 22, 2026

Uh oh!

snyk-io Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

JorTurFer Apr 26, 2026 •

edited

Loading

Uh oh!

rickbrouwer commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggarb commented Apr 22, 2026

Problem

Reproduction

Fix

Validation

Tests

Notes

Checklist

Uh oh!

snyk-io Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

JorTurFer Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rickbrouwer commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snyk-io Bot commented Apr 22, 2026 •

edited

Loading

JorTurFer Apr 26, 2026 •

edited

Loading