Skip to content

fix(scaling): jitter first tick of scale loop to avoid thundering herd#7676

Open
ggarb wants to merge 1 commit intokedacore:mainfrom
ggarb:fix-polling-jitter-6abd2fb3
Open

fix(scaling): jitter first tick of scale loop to avoid thundering herd#7676
ggarb wants to merge 1 commit intokedacore:mainfrom
ggarb:fix-polling-jitter-6abd2fb3

Conversation

@ggarb
Copy link
Copy Markdown

@ggarb ggarb commented Apr 22, 2026

Problem

KEDA spawns one scale loop per ScalableObject (ScaledObject / ScaledJob). Each scale loop uses time.NewTimer(pollingInterval) with no per-object offset (pkg/scaling/scale_handler.go). That means objects whose scale loops are
spawned in a short window — by a bulk creation, a single reconcile pass after an operator restart, or a batch API call — end up polling on the same pollingInterval boundary forever.

For external-metric triggers (Prometheus, custom metrics, etc.) this produces a periodic thundering herd against the upstream metric source and against any client-side infrastructure (DNS, TCP accept queues, per-scaler http.Transport
connection pools) in the path. Under light load the effect is invisible; under even moderate per-cluster scale it becomes a correctness problem.

Reproduction

Measured on a 1k ScaledObject Prometheus-trigger load test (bulk-created by kube-burner in a ~33 s window), KEDA 2.19 release image, default pollingInterval: 30s, default KEDA_HTTP_DEFAULT_TIMEOUT=3000:

  • ~50 % of scaler polls fail with context deadline exceeded (Client.Timeout exceeded while awaiting headers).
  • The Prometheus server is idle on the other end: ~29 mCPU, and 100 % of the queries it received completed in <100 ms per its own histogram.
  • KEDA logs ~2500 errors/min; Prom records only a handful of cancelled requests in the same window. Several hundred requests per minute never reach Prom at all.
  • Operator CPU profile shows ~20 % of cycles in net.(*Resolver).goLookupIPCNAMEOrder — DNS re-resolution for every cold connection.
  • Error pattern is distinctly bursty (~500 errors in a 6 s window, ~10 s apart) — aligned tick timers, not tail latency. Raising the timeout to 10 s does not help; timeouts move to 10 s and dial tcp: connection refused errors start
    appearing too.

Fix

Insert a deterministic per-object offset fnv64a(UID) % pollingInterval before the first tick of each scale loop. Subsequent ticks are unchanged. Keyed off the object's UID so the phase is stable across operator restarts — otherwise
every leader-election flip or pod crash would re-align all loops.

Helper lives in pkg/scaling/jitter.go. Call site is a small select against the jitter timer and ctx.Done() before the existing for loop in startScaleLoop.

Validation

Same cluster, same workload, with this patch. Prometheus' --web.max-connections was raised above its default 512 for these runs — with KEDA's current per-scaler http.Transport (one keepalive per scaler), N > ~500 scalers exceeds
Prom's default connection cap regardless of this fix; once polls are no longer bursting, the connection count stabilizes above the default and Prom starts rejecting new connections. That's a separate efficiency concern (shared
Transport across scalers of the same type) that can be addressed in a follow-up.

1k Prom-trigger:

  • 0 scaler-poll errors sustained over 3+ minutes (was ~2500/min)
  • Operator CPU: 83 → 70 mCPU
  • Operator DNS share of CPU: 20 % → 1.3 % (warm keepalives no longer evicted by bursts)

10k Prom-trigger (same patch):

  • 10 000 / 10 000 ScaledObjects, 0 operator restarts
  • 316 282 Prom queries served, 99.9997 % under 100 ms, 0 scaler-poll errors across ~20 min of wall clock
  • Operator 333 mCPU / 1163 MiB — ~33 mCPU and ~111 MiB per 1 k ScaledObjects, matching the type: cpu sizing baseline within 10 %

Tests

Added pkg/scaling/jitter_test.go covering:

  • Offset is in [0, pollingInterval) across 1000 synthetic UIDs
  • Same UID produces the same offset across repeated calls (stable phase across restarts)
  • Zero or negative interval → 0
  • Empty UID → 0
  • Distribution across 10 000 synthetic UIDs is approximately uniform per-decile (loose binomial tolerance)

go test ./pkg/... passes on this branch.

Notes

  • Backward-compatible: steady-state polling cadence is unchanged. Only the first tick per scale loop is delayed by up to pollingInterval.
  • Applies to both ScaledObject and ScaledJob (shared startScaleLoop).
  • Does not affect push-scaler paths (startPushScalers is unchanged).

Checklist

  • Tests have been added
  • Changelog has been updated (Fixes → General)
  • Commits are signed (DCO)
  • New scaler — N/A
  • Schema regen — N/A (no scaler schema changes)
  • Helm chart PR — N/A (no deployment manifest changes)
  • Docs PR — N/A (behavior is transparent to users)

@ggarb ggarb requested a review from a team as a code owner April 22, 2026 18:38
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented Apr 22, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link
Copy Markdown

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@keda-automation keda-automation requested a review from a team April 22, 2026 18:39
@ggarb ggarb force-pushed the fix-polling-jitter-6abd2fb3 branch 2 times, most recently from ef5a691 to 213347c Compare April 22, 2026 20:54
@ggarb ggarb force-pushed the fix-polling-jitter-6abd2fb3 branch from 213347c to 483bc90 Compare April 23, 2026 15:11
// ~30-second boundary forever, creating a thundering herd against
// external metric sources. Keyed off the object UID so the phase is
// stable across operator restarts.
if offset := jitterOffset(withTriggers.UID, pollingInterval); offset > 0 {
Copy link
Copy Markdown
Member

@JorTurFer JorTurFer Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the improvement! The only concern I have is that some specific edge cases can use realy long polling intervals. Does it make sense to set a max window? Something like

min(pollingInterval, 1 min)

If the polling interval is short, it will work, but for long polling intervals, the request will be delayed as max 1 min

@rickbrouwer
Copy link
Copy Markdown
Member

I still need to take a good look at the whole thing, but my first question is what is the reason to put this in its own file rather than inline in scale_handler.go?

@ggarb ggarb force-pushed the fix-polling-jitter-6abd2fb3 branch from 483bc90 to 3e84673 Compare May 4, 2026 15:21
@keda-automation keda-automation requested a review from a team May 4, 2026 15:22
Scale loops previously fired their first tick via
time.NewTimer(pollingInterval) with no per-object offset. ScalableObjects
created in a short window (bulk apply, or operator restart re-spawning
existing loops) then polled on the same pollingInterval boundary forever,
creating a periodic thundering herd against external metric sources.

Observed in a load test at 1k Prometheus-trigger ScaledObjects created
by kube-burner in a ~33s window: ~50% of subsequent polls failed with
HTTP client timeouts even at default pollingInterval 30s, while
Prometheus itself was idle (<100ms response, <30 mCPU). Errors came in
bursts of ~500 per 6s window every ~10s -- the signature of aligned
tick timers, not tail latency. Raising KEDA_HTTP_DEFAULT_TIMEOUT from
3s to 10s did not help; the timeouts simply moved to 10s.

This change inserts a deterministic per-object offset (hash(UID) mod
pollingInterval) before the first tick, spreading scale loops spawned
in a batch across the polling interval. Keyed off UID so the phase is
stable across operator restarts -- otherwise every leader-election
flip would re-align all loops.

Subsequent ticks continue at pollingInterval with no change in
semantics. Added helper jitterOffset in pkg/scaling/jitter.go with
unit tests covering determinism, range, zero/empty inputs, and
distribution across 10k synthetic UIDs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Greg Garber <ggarb@netflix.com>
@ggarb ggarb force-pushed the fix-polling-jitter-6abd2fb3 branch from 3e84673 to 3c447de Compare May 4, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants