Skip to content

add soak-churn scale test#608

Draft
oleg-kushniriov wants to merge 1 commit into
ai-dynamo:mainfrom
oleg-kushniriov:e2e/scale-soak-churn-test
Draft

add soak-churn scale test#608
oleg-kushniriov wants to merge 1 commit into
ai-dynamo:mainfrom
oleg-kushniriov:e2e/scale-soak-churn-test

Conversation

@oleg-kushniriov
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a long-running soak / churn benchmark under operator/e2e/tests/scale/ that boots a small PodCliqueSet (~50 pods) and drives repeated scale-up / scale-down cycles against it over an extended duration. The goal is to surface bugs that single-shot benchmarks cannot see — slow
leaks, monotonically growing status fields, gradually drifting counters, finalizer pile-ups — by accumulating many incremental reconciles against the same controller instance.

The test is opt-in only. It's gated by a separate build tag (//go:build e2e && soak) and is invisible to make run-e2e and make run-scale-test. Default runtime is ~30 minutes, too long for routine CI; it runs on demand when investigating a regression or capturing a
baseline.

Single test, parameterized by env vars

Parameter Default Env override
Base PCS replicas 25 (50 pods) SOAK_BASE
Peak PCS replicas 50 (100 pods) SOAK_PEAK
Cycles 10 SOAK_CYCLES
Per-cycle hold 30 s
Worker nodes 30 kwok
Timeout 60 min

Cycle shape (executed N times)

  1. scale-up-cN — patch spec.replicas base → peak, wait for peak-pods-ready.
  2. hold-peak-cN — 30 s window for watch events to flush.
    | Cycles | 10 | SOAK_CYCLES |
    | Per-cycle hold | 30 s | — |
    | Worker nodes | 30 kwok | — |
    | Timeout | 60 min | — |

Cycle shape (executed N times)

  1. scale-up-cN — patch spec.replicas base → peak, wait for peak-pods-ready.
  2. hold-peak-cN — 30 s window for watch events to flush.
  3. scale-down-cN — patch spec.replicas peak → base, wait for base-pods-restored (live count drops to ≤ base).
  4. hold-base-cN — 30 s window.

What's in the PR

  • operator/e2e/tests/scale/soak_test.goTest_SoakChurn, gated //go:build e2e && soak. Loads env config, drives N cycles, runs final-check as the last timeline phase.
  • operator/e2e/yaml/soak-churn.yaml — PCS at base size (25 replicas × 2 pods/clique = 50 pods).
  • operator/e2e/measurement/condition/pod.go — adds PodsAtCountCondition (fires when live pod count drops to ≤ ExpectedCount). Required for the scale-down leg of each cycle since the existing PodsCreatedCondition is -only and would fire immediately.
  • operator/Makefilemake run-soak-test target with SOAK_BASE / SOAK_PEAK / SOAK_CYCLES / DIAG_DIR env-var forwarding.
  • operator/e2e/tests/scale/soak_churn.md — design doc with motivation, decisions, and the placement reasoning for the final-check phase.

How to run

cd operator
make scale-cluster-up E2E_CREATE_FLAGS="--set kwok.nodes=30"

# Quick smoke (~5 min) — 3 cycles
make run-soak-test SOAK_CYCLES=3

# Full default run (~30 min)
make run-soak-test

# Baseline capture for cross-revision comparison
make run-soak-test DIAG_DIR=/tmp/soak-rev-A

make scale-cluster-down

Results land in <DIAG_DIR>/SoakChurn/<runID>/scale-test-results.json with per-phase timings — diff with jq across revisions, diff pprof bundles (when Pyroscope is on) with go tool pprof -base A.pb.gz B.pb.gz. No comparison logic in the test itself — keeps it deterministic, no
baselines committed to the repo.

Which issue(s) this PR fixes:

Fixes #607

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add long-running soak / churn benchmark to e2e scale tests

2 participants