fix(webhook): use field indexes in verifyScaledObjects to avoid O(N) admission cost#7681
fix(webhook): use field indexes in verifyScaledObjects to avoid O(N) admission cost#7681ggarb wants to merge 2 commits intokedacore:mainfrom
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
Thank you for your contribution! 🙏 Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected. While you are waiting, make sure to:
Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient. Learn more about our contribution guide. |
…admission cost verifyScaledObjects performs two duplicate-conflict checks on every ScaledObject admission: one for duplicate scaleTargetRef and one for duplicate HPA name. Both checks listed ALL ScaledObjects in the namespace via kc.List, which — because controller-runtime's cached client DeepCopies every returned item — allocated O(N) memory per admission. Measured impact with a heap profile during 10k SO creation burst: verifyScaledObjects consumed 71 % of inuse_space (106 MB at peak). At 60k SOs the list allocates ~900 MB per admission; at 30/s creation rate that is ~27 GB/s of allocation, which outpaces MADV_FREE and causes the webhook OOMKill loop seen in production-scale tests. The fix registers two controller-runtime field indexes in SetupWebhookWithManager: spec.scaleTargetRef.name → so.Spec.ScaleTargetRef.Name spec.hpaName → getHpaName(*so) (computed default or explicit override) verifyScaledObjects now issues two narrow indexed List calls (each returning 0–1 items in the common case) instead of one full-namespace scan. Per-admission allocation drops from O(N * object_size) to O(1 * object_size) regardless of cluster scale. Pairs with kedacore#7670 (remove eager MarshalIndent from the same loop) for a complete webhook memory fix at scale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Greg Garber <ggarb@netflix.com>
12a232e to
966b7c5
Compare
…rifyHpas Three changes to reduce webhook memory pressure during SO creation bursts: 1. Replace unconditional json.MarshalIndent calls in ValidateCreate, ValidateUpdate, isRemovingFinalizer, and the verifyHpas loop with structured logr key-value logging. The marshals ran on every admission even when V(1) logging was disabled, generating 60-100 KB of transient garbage per request — at burst=60 this outpaced MADV_FREE and caused repeated OOMKills despite the 20 GiB limit. 2. Replace isRemovingFinalizer's JSON string comparison with reflect.DeepEqual, eliminating two spec marshals per update admission. 3. Add hpaScaleTargetNameIdx field index for HPA objects and switch verifyHpas to an indexed List, reducing it from O(N_hpas) to O(1) — same fix pattern as the A1d verifyScaledObjects change (kedacore#7681). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
verifyHpas issued an unfiltered kc.List over all HPAs in the namespace on every ScaledObject admission. At 60k HPAs this allocates the entire namespace's HPA list per admission — the same O(N) anti-pattern fixed for verifyScaledObjects in the previous commit. Add hpaScaleTargetNameIdx (spec.scaleTargetRef.name on HPA objects) in SetupWebhookWithManager and switch verifyHpas to an indexed List, narrowing candidates to 0–1 HPAs that target the same workload. Also replace the per-HPA json.MarshalIndent debug log with a structured logr call since the marshal ran unconditionally regardless of log level. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
807ea2a to
6e80753
Compare
There was a problem hiding this comment.
Pull request overview
This PR reduces admission webhook memory/CPU overhead at scale by replacing full-namespace list scans in ScaledObject/HPA validation with controller-runtime cache field-index lookups.
Changes:
- Register cache field indexes for ScaledObject
spec.scaleTargetRef.name, computed HPA name, and HPAspec.scaleTargetRef.nameduring webhook setup. - Update
verifyScaledObjectsandverifyHpasto useclient.MatchingFieldsqueries instead of unfilteredListcalls. - Update the changelog to document the performance fix.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| apis/keda/v1alpha1/scaledobject_webhook.go | Adds field indexes and switches validation list operations to indexed lookups to avoid O(N) DeepCopy cost per admission. |
| CHANGELOG.md | Documents the webhook performance improvement under Fixes → General. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Field index names used by verifyScaledObjects to avoid O(N) full-namespace | ||
| // list scans on every SO admission. Without these indexes each admission must | ||
| // DeepCopy every ScaledObject in the namespace to find conflicts; at 60k SOs | ||
| // each admission allocates ~900 MiB, which is why the webhook OOMs under | ||
| // creation bursts. The indexes narrow candidates to 0–1 objects. |
There was a problem hiding this comment.
The comment claims the indexes narrow candidates to “0–1 objects” and describes the lookups as O(1). In practice the ScaledObject index is only on spec.scaleTargetRef.name, so a namespace can legitimately have multiple ScaledObjects targeting different GVKs with the same name (e.g., Deployment "foo" and StatefulSet "foo"), making the lookup O(k) where k is the number of matches for that name. Consider rewording this comment to avoid implying a strict O(1)/0–1 guarantee.
JorTurFer
left a comment
There was a problem hiding this comment.
wow! I didn't know about this optimisation and looks really nice!
Could you fix DCO check?
|
/run-e2e internal |
rickbrouwer
left a comment
There was a problem hiding this comment.
Could you add unit tests covering the new indexed lookups? I would really like to see some test cases here.
| scaleTargetRefNameIdx = "spec.scaleTargetRef.name" | ||
| hpaNameIdx = "spec.hpaName" | ||
| // hpaScaleTargetNameIdx indexes HPA objects by spec.scaleTargetRef.name so | ||
| // verifyHpas can issue an O(1) lookup instead of listing every HPA in the | ||
| // namespace. Index names are scoped per-GVK so reusing the same path string | ||
| // as scaleTargetRefNameIdx is safe. | ||
| hpaScaleTargetNameIdx = "spec.scaleTargetRef.name" |
There was a problem hiding this comment.
const scaleTargetRefNameIdx and hpaScaleTargetNameIdx has both the same value
dttung2905
left a comment
There was a problem hiding this comment.
Thanks for the PR, great analysis btw 🚀 We still need DCO to be signed. I might have more questions as I go into this in more detail 🙇
| ctx := context.Background() | ||
|
|
||
| // Check 1: no other SO in this namespace already manages the same workload. | ||
| // Use the scaleTargetRef.name index so only SOs targeting the same resource |
There was a problem hiding this comment.
After listing only by scaleTargetRef.name, the loop still must filter by GVK. Can you add or point to a test that has two targets named foo (e.g. Deployment + StatefulSet) and proves no false positive?
Problem
verifyScaledObjectsperforms two conflict checks on every ScaledObject admission:Both checks called
kc.List(the controller-runtime cached client) without any field filter, fetching every ScaledObject in the namespace. Because the cached client DeepCopies every returned item to prevent callers from mutating shared state, this allocates O(N × object_size) per admission.verifyHpashad the same problem: an unfilteredkc.Listover all HPAs in the namespace on every admission, allocating the entire namespace's HPA list regardless of how many HPAs are actually relevant (typically 0–1).Measured impact
Heap profile during 10k ScaledObject creation burst on KEDA 2.19:
RSS during burst reached 748 MiB at ~9k SOs, spiking erratically. At 60k SOs per namespace, each admission allocates ~900 MiB; at 30 admissions/s creation rate this is ~27 GB/s of allocation — enough to overwhelm MADV_FREE and cause the webhook OOMKill loop seen in production-scale tests.
Fix
Register three controller-runtime field indexes in
SetupWebhookWithManager:spec.scaleTargetRef.nameon ScaledObject — the target workload namespec.hpaNameon ScaledObject — the computed or explicit HPA name (viagetHpaName)spec.scaleTargetRef.nameon HPA — used byverifyHpasto narrow to HPAs targeting the same workloadverifyScaledObjectsthen issues two narrowclient.MatchingFieldsqueries instead of one full-namespace scan.verifyHpasissues one narrow query instead of listing every HPA. In the common case (no duplicates) each query returns 0–1 items; DeepCopy cost collapses from O(N × object_size) to O(1 × object_size) regardless of namespace scale.Validation
verifyScaledObjects fix — 10k creation burst, before and after:
verifyScaledObjectsinuse heapverifyHpas fix — 60k creation burst, before and after (with verifyScaledObjects fix applied):
Post-fix, the dominant inuse allocations are
cache.storeIndex.addKeyToIndex(the indexer populating as SOs arrive) — expected steady-state cost, proportional to N.Extrapolated to 60k SOs: peak ~666 MiB, well within a 2 GiB webhook limit. Pre-fix, 20 GiB was insufficient.
Notes
MarshalIndentfrom admission hot paths) — together they address the two main webhook memory issues at scale.Checklist
KUBEBUILDER_ASSETS— runs in CI)Relates to #7670