add soak-churn scale test by oleg-kushniriov · Pull Request #608 · ai-dynamo/grove

oleg-kushniriov · 2026-05-12T13:21:56Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a long-running soak / churn benchmark under operator/e2e/tests/scale/ that boots a small PodCliqueSet (~50 pods) and drives repeated scale-up / scale-down cycles against it over an extended duration. The goal is to surface bugs that single-shot benchmarks cannot see — slow
leaks, monotonically growing status fields, gradually drifting counters, finalizer pile-ups — by accumulating many incremental reconciles against the same controller instance.

The test is opt-in only. It's gated by a separate build tag (//go:build e2e && soak) and is invisible to make run-e2e and make run-scale-test. Default runtime is ~30 minutes, too long for routine CI; it runs on demand when investigating a regression or capturing a
baseline.

Single test, parameterized by env vars

Parameter	Default	Env override
Base PCS replicas	25 (50 pods)	`SOAK_BASE`
Peak PCS replicas	50 (100 pods)	`SOAK_PEAK`
Cycles	10	`SOAK_CYCLES`
Per-cycle hold	30 s	—
Worker nodes	30 kwok	—
Timeout	60 min	—

Cycle shape (executed N times)

scale-up-cN — patch spec.replicas base → peak, wait for peak-pods-ready.
hold-peak-cN — 30 s window for watch events to flush.
| Cycles | 10 | SOAK_CYCLES |
| Per-cycle hold | 30 s | — |
| Worker nodes | 30 kwok | — |
| Timeout | 60 min | — |

Cycle shape (executed N times)

scale-up-cN — patch spec.replicas base → peak, wait for peak-pods-ready.
hold-peak-cN — 30 s window for watch events to flush.
scale-down-cN — patch spec.replicas peak → base, wait for base-pods-restored (live count drops to ≤ base).
hold-base-cN — 30 s window.

What's in the PR

operator/e2e/tests/scale/soak_test.go — Test_SoakChurn, gated //go:build e2e && soak. Loads env config, drives N cycles, runs final-check as the last timeline phase.
operator/e2e/yaml/soak-churn.yaml — PCS at base size (25 replicas × 2 pods/clique = 50 pods).
operator/e2e/measurement/condition/pod.go — adds PodsAtCountCondition (fires when live pod count drops to ≤ ExpectedCount). Required for the scale-down leg of each cycle since the existing PodsCreatedCondition is ≥-only and would fire immediately.
operator/Makefile — make run-soak-test target with SOAK_BASE / SOAK_PEAK / SOAK_CYCLES / DIAG_DIR env-var forwarding.
operator/e2e/tests/scale/soak_churn.md — design doc with motivation, decisions, and the placement reasoning for the final-check phase.

How to run

cd operator
make scale-cluster-up E2E_CREATE_FLAGS="--set kwok.nodes=30"

# Quick smoke (~5 min) — 3 cycles
make run-soak-test SOAK_CYCLES=3

# Full default run (~30 min)
make run-soak-test

# Baseline capture for cross-revision comparison
make run-soak-test DIAG_DIR=/tmp/soak-rev-A

make scale-cluster-down

Results land in <DIAG_DIR>/SoakChurn/<runID>/scale-test-results.json with per-phase timings — diff with jq across revisions, diff pprof bundles (when Pyroscope is on) with go tool pprof -base A.pb.gz B.pb.gz. No comparison logic in the test itself — keeps it deterministic, no
baselines committed to the repo.

Which issue(s) this PR fixes:

Fixes #607

copy-pr-bot · 2026-05-12T13:22:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

add soak-churn scale test

f4f51be

oleg-kushniriov self-assigned this May 12, 2026

oleg-kushniriov added the run-e2e label May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add soak-churn scale test#608

add soak-churn scale test#608
oleg-kushniriov wants to merge 1 commit into
ai-dynamo:mainfrom
oleg-kushniriov:e2e/scale-soak-churn-test

oleg-kushniriov commented May 12, 2026

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oleg-kushniriov commented May 12, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants