added scale tests for scaleUp and scaleDown by oleg-kushniriov · Pull Request #606 · ai-dynamo/grove

oleg-kushniriov · 2026-05-12T12:06:47Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds an e2e scale benchmark suite under operator/e2e/tests/scale/ that measures the marginal cost of growing and shrinking a running PodCliqueSet by patching spec.replicas. Each scenario isolates a single resize event on a steady-state cluster so the timeline captures only
the controller's incremental work, not cold-start setup or teardown.

Grove's existing scale benchmarks (Test_ScaleTest_1000, Test_ScaleTest_5000_Deletion) exercise the full lifecycle (deploy → ready → delete) — they're a good guard against regressions in cold-start and cascade-delete, but they miss the day-2 path that production users hit most
often: changing spec.replicas on an already-running PCS.

Scenarios (six benchmarks + two sanity variants)

Scale-up:

Variant	Initial	Target	Pods (initial → target)	What it isolates
`ScaleUp_Tiny`	0	5	0 → 10	Sanity — exercises the same code paths as the real benchmarks but finishes in seconds; used to validate cluster/test plumbing on a small dev cluster.
`ScaleUp_FromZero`	0	500	0 → 1000	Cold-start: no PCSGs/PodCliques exist yet.
`ScaleUp_SmallDelta`	500 (1000 pods)	550	1000 → 1100	Steady-state +10%. Mostly cache-hit reconciles for unchanged children.
`ScaleUp_LargeDelta`	250 (500 pods)	500	500 → 1000	Burst 2x. Controller must double the live child set.

Scale-down:

Variant	Initial	Target	Pods (initial → target)	What it isolates
`ScaleDown_Tiny`	5 (10 pods)	0	10 → 0	Sanity — validates the new `PodsAtCountCondition` end-to-end on a small dev cluster.
`ScaleDown_ToZero`	500 (1000 pods)	0	1000 → 0	Cascade-delete-everything from a running steady state. Complements `Test_ScaleTest_5000_Deletion` at a smaller, finer-grained scale.
`ScaleDown_SmallDelta`	550 (1100 pods)	500	1100 → 1000	Steady-state −10%. Partial shrink; most children stay live. Stresses the spec-derived bounded-counter path.
`ScaleDown_LargeDelta`	500 (1000 pods)	250	1000 → 500	Burst −50%. Controller must tear down as many replicas as it keeps.

What's in the PR

operator/e2e/tests/scale/scale_up_test.go — Test_ScaleUp_Tiny, Test_ScaleUp_FromZero, Test_ScaleUp_SmallDelta, Test_ScaleUp_LargeDelta, plus shared runScaleUpTest helper and scaleUpVariant struct.
operator/e2e/tests/scale/scale_down_test.go — Test_ScaleDown_Tiny, Test_ScaleDown_ToZero, Test_ScaleDown_SmallDelta, Test_ScaleDown_LargeDelta, plus shared runScaleDownTest helper and scaleDownVariant struct.
Eight YAML fixtures under operator/e2e/yaml/: scale-up-{tiny,from-zero,small-delta,large-delta}.yaml, scale-down-{tiny,to-zero,small-delta,large-delta}.yaml. Each encodes the initial replica count; the test patches spec.replicas to the target.
New milestone condition PodsAtCountCondition in operator/e2e/measurement/condition/pod.go — fires when live pod count drops to ≤ target. Required for scale-down because the existing PodsCreatedCondition is ≥-only and would fire immediately when the starting count already
exceeds the target.
workerNodes override on scaleUpVariant / scaleDownVariant so the tiny sanity tests can run on smaller dev clusters (5 nodes) while the real benchmarks keep the 30-node default.
make run-scale-test TEST_PATTERN=<pattern> support — mirrors the existing run-e2e convention, no new targets.

Test shape (every scenario)

deploy — apply YAML at initial replica count; wait for initial-pods-created + initial-pods-ready (skipped when initial=0).
scale-up / scale-down — patch spec.replicas to target; this is the phase pprof/metrics windows isolate. Up-side milestones: all-pods-created + all-pods-ready. Down-side milestone: pods-at-target (the new condition).
delete — tear down the PCS, milestone pcs-deleted.

All scale tests reuse the existing runScaleTest scaffolding in scale_test.go, so output format (stdout summary + scale-test-results.json), pprof capture, and Grove metadata export match the rest of the scale suite.

How to run

cd operator
make scale-cluster-up E2E_CREATE_FLAGS="--set kwok.nodes=30"

# Sanity first (~3 s each)
make run-scale-test TEST_PATTERN='Test_Scale(Up|Down)_Tiny'

# Then the real benchmarks
make run-scale-test TEST_PATTERN=Test_ScaleUp
make run-scale-test TEST_PATTERN=Test_ScaleDown

# Or individual variants
make run-scale-test TEST_PATTERN=Test_ScaleUp_LargeDelta
make run-scale-test TEST_PATTERN=Test_ScaleDown_SmallDelta

make scale-cluster-down

Which issue(s) this PR fixes:

Fixes #604

Special notes for your reviewer:

The two Test_*_Tiny variants are intentionally part of the suite, not debug-only — they're sanity checks for the test plumbing itself (cluster, KWOK stages, new PodsAtCountCondition) and complete in seconds. Useful when iterating on the scale fixtures or after cluster
changes.
PodsAtCountCondition uses ≤ rather than == so a transient overshoot during cascade-delete between two consecutive polls doesn't make the milestone miss its trigger.
scaleDownWorkerNodes and scaleUpWorkerNodes are intentionally set to 30 (vs. defaultScaleWorkerNodes = 100) — these tests are about controller throughput on the day-2 path, not scheduler capacity. ~1100 KWOK pods on 30 nodes is ~37 pods/node, well under the 110-pod kubelet
default.
The Makefile change adds TEST_PATTERN to run-scale-test only; the previous default behavior (run everything in tests/scale/) is unchanged.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

copy-pr-bot · 2026-05-12T12:06:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

oleg-kushniriov added the run-e2e label May 12, 2026

oleg-kushniriov self-assigned this May 12, 2026

oleg-kushniriov force-pushed the e2e/scale-up-down-tests branch from 0afcadb to 16deff0 Compare May 12, 2026 12:15

added scale tests for scaleUp and scaleDown

71d2dc5

oleg-kushniriov force-pushed the e2e/scale-up-down-tests branch from 16deff0 to 71d2dc5 Compare May 14, 2026 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added scale tests for scaleUp and scaleDown#606

added scale tests for scaleUp and scaleDown#606
oleg-kushniriov wants to merge 1 commit into
ai-dynamo:mainfrom
oleg-kushniriov:e2e/scale-up-down-tests

oleg-kushniriov commented May 12, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oleg-kushniriov commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oleg-kushniriov commented May 12, 2026 •

edited

Loading