Skip to content

fix(webhook): eliminate per-admission json.MarshalIndent and index verifyHpas#1

Open
ggarb wants to merge 5 commits intomainfrom
fix-webhook-verifyScaledObjects-indexer
Open

fix(webhook): eliminate per-admission json.MarshalIndent and index verifyHpas#1
ggarb wants to merge 5 commits intomainfrom
fix-webhook-verifyScaledObjects-indexer

Conversation

@ggarb
Copy link
Copy Markdown
Owner

@ggarb ggarb commented Apr 23, 2026

Problem

During a 60k ScaledObject creation burst, the admission webhook OOMKilled 12 times
despite a 20 GiB memory limit. Three root causes:

1. Unconditional json.MarshalIndent on every admission

ValidateCreate, ValidateUpdate, and isRemovingFinalizer each called
json.MarshalIndent to build a debug log string — even when V(1) logging was
disabled. At burst=60 this generates ~60–100 KB of transient garbage per
admission that Go's allocator cannot release via MADV_FREE fast enough under
sustained load, causing RSS to climb until OOMKill.

2. isRemovingFinalizer JSON-string spec comparison

The function marshaled both so.Spec and oldSo.Spec to JSON strings to compare
them, allocating O(spec_size) on every update admission. The correct tool is
reflect.DeepEqual.

3. verifyHpas full-namespace HPA list (O(N))

verifyHpas issued an unfiltered kc.List over all HPAs in the namespace. At 60k
HPAs this allocates the entire namespace's HPA list on every SO admission. Fixed
with a spec.scaleTargetRef.name field index (same pattern as kedacore#7681).

Fix

  • Replace all json.MarshalIndent + fmt.Sprintf log calls with structured
    logr.Info(msg, key, value) — zero allocation when log level is inactive.
  • Replace isRemovingFinalizer JSON string comparison with reflect.DeepEqual.
  • Register a spec.scaleTargetRef.name field index on HPA objects in
    SetupWebhookWithManager and switch verifyHpas to an indexed List.

Test results

Validated on a 60k ScaledObject creation burst (same cluster and methodology as kedacore#7681):

Metric Before (fix-webhook-verifyobj-12a232e2) After (807ea2a)
Webhook OOMKills (20 GiB limit) 12 0
Webhook RSS at peak >20 GiB ~1 MiB
Operator restarts 0 0

This is a companion to kedacore#7681. Together they bring webhook memory during a 60k SO
creation burst from OOMKill territory to effectively zero overhead.

pawan-regoti and others added 5 commits April 20, 2026 10:16
Signed-off-by: Pawan Kumar Regoti <pawanregoti@gmail.com>
Signed-off-by: Pawan Regoti <pawanregoti@gmail.com>
Signed-off-by: Jan Wozniak <wozniak.jan@gmail.com>
)

When KEDA runs in environments where /sys/fs/cgroup/cpu.max is not
readable (e.g. EKS auto-mode, restricted SecurityContexts), maxprocs.Set()
returns a permission error and KEDA crashes on startup.

The Go runtime handles GOMAXPROCS sensibly without explicit configuration,
so this error should not be fatal. Log it as a warning and continue
startup.

Fixes kedacore#7653

Signed-off-by: ManvithaP-hub <62259625+ManvithaP-hub@users.noreply.github.com>
…admission cost

verifyScaledObjects performs two duplicate-conflict checks on every
ScaledObject admission: one for duplicate scaleTargetRef and one for
duplicate HPA name. Both checks listed ALL ScaledObjects in the
namespace via kc.List, which — because controller-runtime's cached
client DeepCopies every returned item — allocated O(N) memory per
admission.

Measured impact with a heap profile during 10k SO creation burst:
verifyScaledObjects consumed 71 % of inuse_space (106 MB at peak).
At 60k SOs the list allocates ~900 MB per admission; at 30/s creation
rate that is ~27 GB/s of allocation, which outpaces MADV_FREE and
causes the webhook OOMKill loop seen in production-scale tests.

The fix registers two controller-runtime field indexes in
SetupWebhookWithManager:

  spec.scaleTargetRef.name  → so.Spec.ScaleTargetRef.Name
  spec.hpaName              → getHpaName(*so)   (computed default or explicit override)

verifyScaledObjects now issues two narrow indexed List calls (each
returning 0–1 items in the common case) instead of one full-namespace
scan. Per-admission allocation drops from O(N * object_size) to
O(1 * object_size) regardless of cluster scale.

Pairs with kedacore#7670 (remove eager MarshalIndent from the same loop) for
a complete webhook memory fix at scale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Greg Garber <ggarb@netflix.com>
verifyHpas issued an unfiltered kc.List over all HPAs in the namespace on
every ScaledObject admission. At 60k HPAs this allocates the entire
namespace's HPA list per admission — the same O(N) anti-pattern fixed for
verifyScaledObjects in the previous commit.

Add hpaScaleTargetNameIdx (spec.scaleTargetRef.name on HPA objects) in
SetupWebhookWithManager and switch verifyHpas to an indexed List, narrowing
candidates to 0–1 HPAs that target the same workload. Also replace the
per-HPA json.MarshalIndent debug log with a structured logr call since the
marshal ran unconditionally regardless of log level.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@ggarb ggarb force-pushed the fix-webhook-verifyScaledObjects-indexer branch from 807ea2a to 6e80753 Compare April 23, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants