Skip to content

added scale tests for scaleUp and scaleDown#606

Draft
oleg-kushniriov wants to merge 1 commit into
ai-dynamo:mainfrom
oleg-kushniriov:e2e/scale-up-down-tests
Draft

added scale tests for scaleUp and scaleDown#606
oleg-kushniriov wants to merge 1 commit into
ai-dynamo:mainfrom
oleg-kushniriov:e2e/scale-up-down-tests

Conversation

@oleg-kushniriov
Copy link
Copy Markdown
Contributor

@oleg-kushniriov oleg-kushniriov commented May 12, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds an e2e scale benchmark suite under operator/e2e/tests/scale/ that measures the marginal cost of growing and shrinking a running PodCliqueSet by patching spec.replicas. Each scenario isolates a single resize event on a steady-state cluster so the timeline captures only
the controller's incremental work, not cold-start setup or teardown.

Grove's existing scale benchmarks (Test_ScaleTest_1000, Test_ScaleTest_5000_Deletion) exercise the full lifecycle (deploy → ready → delete) — they're a good guard against regressions in cold-start and cascade-delete, but they miss the day-2 path that production users hit most
often: changing spec.replicas on an already-running PCS.

Scenarios (six benchmarks + two sanity variants)

Scale-up:

Variant Initial Target Pods (initial → target) What it isolates
ScaleUp_Tiny 0 5 0 → 10 Sanity — exercises the same code paths as the real benchmarks but finishes in seconds; used to validate cluster/test plumbing on a small dev cluster.
ScaleUp_FromZero 0 500 0 → 1000 Cold-start: no PCSGs/PodCliques exist yet.
ScaleUp_SmallDelta 500 (1000 pods) 550 1000 → 1100 Steady-state +10%. Mostly cache-hit reconciles for unchanged children.
ScaleUp_LargeDelta 250 (500 pods) 500 500 → 1000 Burst 2x. Controller must double the live child set.

Scale-down:

Variant Initial Target Pods (initial → target) What it isolates
ScaleDown_Tiny 5 (10 pods) 0 10 → 0 Sanity — validates the new PodsAtCountCondition end-to-end on a small dev cluster.
ScaleDown_ToZero 500 (1000 pods) 0 1000 → 0 Cascade-delete-everything from a running steady state. Complements Test_ScaleTest_5000_Deletion at a smaller, finer-grained scale.
ScaleDown_SmallDelta 550 (1100 pods) 500 1100 → 1000 Steady-state −10%. Partial shrink; most children stay live. Stresses the spec-derived bounded-counter path.
ScaleDown_LargeDelta 500 (1000 pods) 250 1000 → 500 Burst −50%. Controller must tear down as many replicas as it keeps.

What's in the PR

  • operator/e2e/tests/scale/scale_up_test.goTest_ScaleUp_Tiny, Test_ScaleUp_FromZero, Test_ScaleUp_SmallDelta, Test_ScaleUp_LargeDelta, plus shared runScaleUpTest helper and scaleUpVariant struct.
  • operator/e2e/tests/scale/scale_down_test.goTest_ScaleDown_Tiny, Test_ScaleDown_ToZero, Test_ScaleDown_SmallDelta, Test_ScaleDown_LargeDelta, plus shared runScaleDownTest helper and scaleDownVariant struct.
  • Eight YAML fixtures under operator/e2e/yaml/: scale-up-{tiny,from-zero,small-delta,large-delta}.yaml, scale-down-{tiny,to-zero,small-delta,large-delta}.yaml. Each encodes the initial replica count; the test patches spec.replicas to the target.
  • New milestone condition PodsAtCountCondition in operator/e2e/measurement/condition/pod.go — fires when live pod count drops to ≤ target. Required for scale-down because the existing PodsCreatedCondition is -only and would fire immediately when the starting count already
    exceeds the target.
  • workerNodes override on scaleUpVariant / scaleDownVariant so the tiny sanity tests can run on smaller dev clusters (5 nodes) while the real benchmarks keep the 30-node default.
  • make run-scale-test TEST_PATTERN=<pattern> support — mirrors the existing run-e2e convention, no new targets.

Test shape (every scenario)

  1. deploy — apply YAML at initial replica count; wait for initial-pods-created + initial-pods-ready (skipped when initial=0).
  2. scale-up / scale-down — patch spec.replicas to target; this is the phase pprof/metrics windows isolate. Up-side milestones: all-pods-created + all-pods-ready. Down-side milestone: pods-at-target (the new condition).
  3. delete — tear down the PCS, milestone pcs-deleted.

All scale tests reuse the existing runScaleTest scaffolding in scale_test.go, so output format (stdout summary + scale-test-results.json), pprof capture, and Grove metadata export match the rest of the scale suite.

How to run

cd operator
make scale-cluster-up E2E_CREATE_FLAGS="--set kwok.nodes=30"

# Sanity first (~3 s each)
make run-scale-test TEST_PATTERN='Test_Scale(Up|Down)_Tiny'

# Then the real benchmarks
make run-scale-test TEST_PATTERN=Test_ScaleUp
make run-scale-test TEST_PATTERN=Test_ScaleDown

# Or individual variants
make run-scale-test TEST_PATTERN=Test_ScaleUp_LargeDelta
make run-scale-test TEST_PATTERN=Test_ScaleDown_SmallDelta

make scale-cluster-down

Which issue(s) this PR fixes:

Fixes #604

Special notes for your reviewer:

  • The two Test_*_Tiny variants are intentionally part of the suite, not debug-only — they're sanity checks for the test plumbing itself (cluster, KWOK stages, new PodsAtCountCondition) and complete in seconds. Useful when iterating on the scale fixtures or after cluster
    changes.
  • PodsAtCountCondition uses rather than == so a transient overshoot during cascade-delete between two consecutive polls doesn't make the milestone miss its trigger.
  • scaleDownWorkerNodes and scaleUpWorkerNodes are intentionally set to 30 (vs. defaultScaleWorkerNodes = 100) — these tests are about controller throughput on the day-2 path, not scheduler capacity. ~1100 KWOK pods on 30 nodes is ~37 pods/node, well under the 110-pod kubelet
    default.
  • The Makefile change adds TEST_PATTERN to run-scale-test only; the previous default behavior (run everything in tests/scale/) is unchanged.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:


@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@oleg-kushniriov oleg-kushniriov force-pushed the e2e/scale-up-down-tests branch from 16deff0 to 71d2dc5 Compare May 14, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ScaleUp / ScaleDown benchmark suite to e2e scale tests

1 participant