Skip to content

fix(webhook): use field indexes in verifyScaledObjects to avoid O(N) admission cost#7681

Open
ggarb wants to merge 2 commits intokedacore:mainfrom
ggarb:fix-webhook-verifyScaledObjects-indexer
Open

fix(webhook): use field indexes in verifyScaledObjects to avoid O(N) admission cost#7681
ggarb wants to merge 2 commits intokedacore:mainfrom
ggarb:fix-webhook-verifyScaledObjects-indexer

Conversation

@ggarb
Copy link
Copy Markdown

@ggarb ggarb commented Apr 23, 2026

Problem

verifyScaledObjects performs two conflict checks on every ScaledObject admission:

  1. Does any existing SO in this namespace already manage the same scaleTargetRef?
  2. Does any existing SO already own the same HPA name?

Both checks called kc.List (the controller-runtime cached client) without any field filter, fetching every ScaledObject in the namespace. Because the cached client DeepCopies every returned item to prevent callers from mutating shared state, this allocates O(N × object_size) per admission.

verifyHpas had the same problem: an unfiltered kc.List over all HPAs in the namespace on every admission, allocating the entire namespace's HPA list regardless of how many HPAs are actually relevant (typically 0–1).

Measured impact

Heap profile during 10k ScaledObject creation burst on KEDA 2.19:

github.com/kedacore/keda/v2/apis/keda/v1alpha1.verifyScaledObjects   106 MB  (71 % of total)
  └─ apis/keda/v1alpha1.(*ScaledObject).DeepCopy                     85 MB  (57 %)
     └─ meta/v1.(*ObjectMeta).DeepCopyInto / (*FieldsV1).DeepCopyInto / ...

RSS during burst reached 748 MiB at ~9k SOs, spiking erratically. At 60k SOs per namespace, each admission allocates ~900 MiB; at 30 admissions/s creation rate this is ~27 GB/s of allocation — enough to overwhelm MADV_FREE and cause the webhook OOMKill loop seen in production-scale tests.

Fix

Register three controller-runtime field indexes in SetupWebhookWithManager:

  • spec.scaleTargetRef.name on ScaledObject — the target workload name
  • spec.hpaName on ScaledObject — the computed or explicit HPA name (via getHpaName)
  • spec.scaleTargetRef.name on HPA — used by verifyHpas to narrow to HPAs targeting the same workload

verifyScaledObjects then issues two narrow client.MatchingFields queries instead of one full-namespace scan. verifyHpas issues one narrow query instead of listing every HPA. In the common case (no duplicates) each query returns 0–1 items; DeepCopy cost collapses from O(N × object_size) to O(1 × object_size) regardless of namespace scale.

Validation

verifyScaledObjects fix — 10k creation burst, before and after:

Metric Before After
Peak webhook RSS during burst 748 MiB 111 MiB
verifyScaledObjects inuse heap 106 MB (71 %) 0 %
Total inuse heap at 10k SOs 148 MiB 56 MiB
Growth pattern Volatile spikes + GC thrash Smooth ~11 MiB/1k SOs (informer cache only)

verifyHpas fix — 60k creation burst, before and after (with verifyScaledObjects fix applied):

Metric Before After
Webhook OOMKills (20 GiB limit) 12 0
Webhook RSS at peak >20 GiB ~1 MiB
Operator restarts 0 0

Post-fix, the dominant inuse allocations are cache.storeIndex.addKeyToIndex (the indexer populating as SOs arrive) — expected steady-state cost, proportional to N.

Extrapolated to 60k SOs: peak ~666 MiB, well within a 2 GiB webhook limit. Pre-fix, 20 GiB was insufficient.

Notes

Checklist

  • Tests have been added (non-envtest unit tests pass; envtest suite validates indexed List but requires KUBEBUILDER_ASSETS — runs in CI)
  • Changelog has been updated (Fixes → General)
  • Commits are signed (DCO)
  • New scaler — N/A
  • Schema regen — N/A
  • Helm chart PR — N/A
  • Docs PR — N/A

Relates to #7670

@ggarb ggarb requested a review from a team as a code owner April 23, 2026 16:24
@keda-automation keda-automation requested a review from a team April 23, 2026 16:24
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented Apr 23, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link
Copy Markdown

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

…admission cost

verifyScaledObjects performs two duplicate-conflict checks on every
ScaledObject admission: one for duplicate scaleTargetRef and one for
duplicate HPA name. Both checks listed ALL ScaledObjects in the
namespace via kc.List, which — because controller-runtime's cached
client DeepCopies every returned item — allocated O(N) memory per
admission.

Measured impact with a heap profile during 10k SO creation burst:
verifyScaledObjects consumed 71 % of inuse_space (106 MB at peak).
At 60k SOs the list allocates ~900 MB per admission; at 30/s creation
rate that is ~27 GB/s of allocation, which outpaces MADV_FREE and
causes the webhook OOMKill loop seen in production-scale tests.

The fix registers two controller-runtime field indexes in
SetupWebhookWithManager:

  spec.scaleTargetRef.name  → so.Spec.ScaleTargetRef.Name
  spec.hpaName              → getHpaName(*so)   (computed default or explicit override)

verifyScaledObjects now issues two narrow indexed List calls (each
returning 0–1 items in the common case) instead of one full-namespace
scan. Per-admission allocation drops from O(N * object_size) to
O(1 * object_size) regardless of cluster scale.

Pairs with kedacore#7670 (remove eager MarshalIndent from the same loop) for
a complete webhook memory fix at scale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Greg Garber <ggarb@netflix.com>
@ggarb ggarb force-pushed the fix-webhook-verifyScaledObjects-indexer branch from 12a232e to 966b7c5 Compare April 23, 2026 16:26
ggarb added a commit to ggarb/keda that referenced this pull request Apr 23, 2026
…rifyHpas

Three changes to reduce webhook memory pressure during SO creation bursts:

1. Replace unconditional json.MarshalIndent calls in ValidateCreate,
   ValidateUpdate, isRemovingFinalizer, and the verifyHpas loop with
   structured logr key-value logging. The marshals ran on every admission
   even when V(1) logging was disabled, generating 60-100 KB of transient
   garbage per request — at burst=60 this outpaced MADV_FREE and caused
   repeated OOMKills despite the 20 GiB limit.

2. Replace isRemovingFinalizer's JSON string comparison with
   reflect.DeepEqual, eliminating two spec marshals per update admission.

3. Add hpaScaleTargetNameIdx field index for HPA objects and switch
   verifyHpas to an indexed List, reducing it from O(N_hpas) to O(1)
   — same fix pattern as the A1d verifyScaledObjects change (kedacore#7681).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
verifyHpas issued an unfiltered kc.List over all HPAs in the namespace on
every ScaledObject admission. At 60k HPAs this allocates the entire
namespace's HPA list per admission — the same O(N) anti-pattern fixed for
verifyScaledObjects in the previous commit.

Add hpaScaleTargetNameIdx (spec.scaleTargetRef.name on HPA objects) in
SetupWebhookWithManager and switch verifyHpas to an indexed List, narrowing
candidates to 0–1 HPAs that target the same workload. Also replace the
per-HPA json.MarshalIndent debug log with a structured logr call since the
marshal ran unconditionally regardless of log level.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@ggarb ggarb force-pushed the fix-webhook-verifyScaledObjects-indexer branch from 807ea2a to 6e80753 Compare April 23, 2026 22:25
@JorTurFer JorTurFer requested a review from Copilot April 24, 2026 22:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces admission webhook memory/CPU overhead at scale by replacing full-namespace list scans in ScaledObject/HPA validation with controller-runtime cache field-index lookups.

Changes:

  • Register cache field indexes for ScaledObject spec.scaleTargetRef.name, computed HPA name, and HPA spec.scaleTargetRef.name during webhook setup.
  • Update verifyScaledObjects and verifyHpas to use client.MatchingFields queries instead of unfiltered List calls.
  • Update the changelog to document the performance fix.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
apis/keda/v1alpha1/scaledobject_webhook.go Adds field indexes and switches validation list operations to indexed lookups to avoid O(N) DeepCopy cost per admission.
CHANGELOG.md Documents the webhook performance improvement under Fixes → General.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +55 to +59
// Field index names used by verifyScaledObjects to avoid O(N) full-namespace
// list scans on every SO admission. Without these indexes each admission must
// DeepCopy every ScaledObject in the namespace to find conflicts; at 60k SOs
// each admission allocates ~900 MiB, which is why the webhook OOMs under
// creation bursts. The indexes narrow candidates to 0–1 objects.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment claims the indexes narrow candidates to “0–1 objects” and describes the lookups as O(1). In practice the ScaledObject index is only on spec.scaleTargetRef.name, so a namespace can legitimately have multiple ScaledObjects targeting different GVKs with the same name (e.g., Deployment "foo" and StatefulSet "foo"), making the lookup O(k) where k is the number of matches for that name. Consider rewording this comment to avoid implying a strict O(1)/0–1 guarantee.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow! I didn't know about this optimisation and looks really nice!

Could you fix DCO check?

@JorTurFer
Copy link
Copy Markdown
Member

JorTurFer commented Apr 24, 2026

/run-e2e internal
Update: You can check the progress here

@rickbrouwer rickbrouwer added Awaiting/2nd-approval This PR needs one more approval review waiting-author-response All PR's or Issues where we are waiting for a response from the author labels Apr 25, 2026
Copy link
Copy Markdown
Member

@rickbrouwer rickbrouwer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add unit tests covering the new indexed lookups? I would really like to see some test cases here.

Comment on lines +61 to +67
scaleTargetRefNameIdx = "spec.scaleTargetRef.name"
hpaNameIdx = "spec.hpaName"
// hpaScaleTargetNameIdx indexes HPA objects by spec.scaleTargetRef.name so
// verifyHpas can issue an O(1) lookup instead of listing every HPA in the
// namespace. Index names are scoped per-GVK so reusing the same path string
// as scaleTargetRefNameIdx is safe.
hpaScaleTargetNameIdx = "spec.scaleTargetRef.name"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const scaleTargetRefNameIdx and hpaScaleTargetNameIdx has both the same value

Copy link
Copy Markdown
Contributor

@dttung2905 dttung2905 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, great analysis btw 🚀 We still need DCO to be signed. I might have more questions as I go into this in more detail 🙇

ctx := context.Background()

// Check 1: no other SO in this namespace already manages the same workload.
// Use the scaleTargetRef.name index so only SOs targeting the same resource
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After listing only by scaleTargetRef.name, the loop still must filter by GVK. Can you add or point to a test that has two targets named foo (e.g. Deployment + StatefulSet) and proves no false positive?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Awaiting/2nd-approval This PR needs one more approval review waiting-author-response All PR's or Issues where we are waiting for a response from the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants