Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions operator/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,24 @@ run-scale-test:
@echo "> Running scale tests..."
@cd e2e && go test -count=1 -tags=e2e ./tests/scale/... -v -timeout 40m

# Run the soak / churn benchmark against an existing cluster.
# Gated behind the `soak` build tag so it does NOT run as part of
# run-e2e or run-scale-test. Long-running (~30 min default).
#
# Usage: make run-soak-test [DIAG_DIR=<path>] [DIAG_MODE=<mode>] \
# [SOAK_BASE=<n>] [SOAK_PEAK=<n>] [SOAK_CYCLES=<n>]
# Examples:
# make run-soak-test # Defaults: 25 → 50 PCS, 10 cycles
# make run-soak-test SOAK_CYCLES=3 # Shorter run for quick validation
# make run-soak-test DIAG_DIR=/tmp/soak-baseline # Capture artifacts for later comparison
.PHONY: run-soak-test
run-soak-test: export GROVE_E2E_DIAG_DIR = $(DIAG_DIR)
run-soak-test: export GROVE_E2E_DIAG_MODE = $(DIAG_MODE)
run-soak-test:
@echo "> Running soak / churn benchmark (long-running, on-demand)..."
@cd e2e && SOAK_BASE=$(SOAK_BASE) SOAK_PEAK=$(SOAK_PEAK) SOAK_CYCLES=$(SOAK_CYCLES) \
go test -count=1 -tags='e2e soak' ./tests/scale/... -v -timeout 60m -run Test_SoakChurn

# Make targets for local development and testing
# -------------------------------------------------------------
# Starts a local k8s cluster using kind.
Expand Down
35 changes: 35 additions & 0 deletions operator/e2e/measurement/condition/pod.go
Original file line number Diff line number Diff line change
Expand Up @@ -135,3 +135,38 @@ func (c *PodsReadyCondition) Met(ctx context.Context) (bool, error) {
func (c *PodsReadyCondition) Progress(_ context.Context) string {
return fmt.Sprintf("%d/%d pods ready", c.lastReady, c.ExpectedCount)
}

// PodsAtCountCondition fires when the live pod count drops to ExpectedCount or
// below. Intended for scale-down milestones where PodsCreatedCondition (≥-only)
// would fire immediately because the starting count already exceeds the target.
// Using ≤ rather than == makes the condition robust to a transient overshoot
// during cascade-delete where two consecutive polls might skip the exact target.
type PodsAtCountCondition struct {
Client client.Client
Namespace string
LabelSelector string
ExpectedCount int
lastCount int
sel parsedSelector
}

// Met returns true once the live pod count is ≤ ExpectedCount.
func (c *PodsAtCountCondition) Met(ctx context.Context) (bool, error) {
if c.ExpectedCount < 0 {
return false, errors.New("expected count cannot be negative")
}

c.sel.init(c.LabelSelector)
pods, err := listPods(ctx, c.Client, c.Namespace, &c.sel)
if err != nil {
return false, err
}

c.lastCount = len(pods)
return c.lastCount <= c.ExpectedCount, nil
}

// Progress returns a human-readable progress string.
func (c *PodsAtCountCondition) Progress(_ context.Context) string {
return fmt.Sprintf("%d pods (target ≤%d)", c.lastCount, c.ExpectedCount)
}
175 changes: 175 additions & 0 deletions operator/e2e/tests/scale/soak_churn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Soak / Churn Benchmark

## GitHub Issue

Use the **Enhancement** template (`.github/ISSUE_TEMPLATE/ENHANCEMENT.yaml`).
The fields below are ready to paste into the form.

**Title**

> Add long-running soak / churn benchmark to e2e scale tests

**Labels**

`enhancement`, `area/testing`, `area/operator`

---

### What you would like to be added?

Add a long-running benchmark under `operator/e2e/tests/scale/` that
boots a small `PodCliqueSet` (~100 pods) and then drives repeated
scale-up / scale-down cycles against it over an extended duration.
The goal is to surface bugs that only appear after many incremental
reconciles — leaks, monotonically growing slices, gradually drifting
counters, finalizer pile-ups — that single-shot benchmarks cannot see.

**Scope (single test, parameterized):**

| Parameter | Default | Env override | Rationale |
|---|---|---|---|
| Base PCS replicas | 25 (50 pods) | `SOAK_BASE` | Small enough to fit dev clusters; large enough to exercise multi-child reconciles. |
| Peak PCS replicas | 50 (100 pods) | `SOAK_PEAK` | 2x amplitude per cycle. |
| Cycles | 10 | `SOAK_CYCLES` | Enough per-cycle iterations to surface slow drift; bounded total runtime. |
| Per-cycle hold | ~30 s | — | Lets watch events flush and pprof capture overlap consecutive no-op reconciles. |
| Worker nodes | 30 kwok nodes | — | Same fixture as scale-up/down tests. |
| Timeout | 45 min | — | Soft cap; expected runtime ~25–30 min. |

**Deliverables:**

- `operator/e2e/tests/scale/soak_test.go`.
- `operator/e2e/yaml/soak-churn.yaml` (PCS sized at base; ~50 pods).
- Reuse `runScaleTest`, `ScalePCS`, `PodsCreatedCondition`,
`TimerCondition`. The same scale-down milestone gap noted in the
scale-up-down issue applies here (`PodsAtCountCondition` is
prerequisite).
- Final-check phase asserting no leaks/orphans/unbounded counters at
the end of the run.

**Cycle shape (executed N times):**

1. **scale-up** — patch `spec.replicas` base → peak. Wait until live
pod count reaches `peak * podsPerReplica`.
2. **hold-peak** — 30 s window.
3. **scale-down** — patch `spec.replicas` peak → base. Wait until live
pod count reaches `base * podsPerReplica`.
4. **hold-base** — 30 s window.

**Final-check assertions:**

- Live pod count equals base target (no leaked pods).
- No `PodCliqueScalingGroup` or `PodClique` exists outside the expected
child set.
- `status.updateProgress` (if present) has bounded counters:
`UpdatedPodCliquesCount ≤ TotalPodCliquesCount`.
- `status.lastErrors` is empty.

---

### Why is this needed?

Today's scale benchmarks run for minutes and measure a single event
(deploy, resize, delete). They are good at catching gross regressions
in per-event throughput, but they pass cleanly on a controller that is
slowly leaking heap, accumulating stale entries in an informer index,
or letting one status field drift by an item per reconcile.

Issues we have personally been bitten by, and that a soak/churn test
**would catch**:

- **#567 — unbounded slices in status.** The original
`UpdatedPodCliques` / `UpdatedPodCliqueScalingGroups` slices grew
with every reconcile. A single-shot benchmark looked fine; a soak
test would have flagged it within an hour.
- **Finalizer / cascade-delete drift.** Repeated scale-down cycles
exercise the partial-delete path many times. Any reconcile that
forgets to clear an expectation, or that re-emits a finalizer, will
accumulate across cycles.
- **Watch-event amplification.** Status mutators that fire a Patch on
byte-identical status (no real change) waste a write per reconcile
and trigger a cascade of watch events. The equality-short-circuit
guards added in #567 prevent this, but only a churn test exercises
the steady-state-Patch-suppression path enough to verify it.
- **Heap growth without a leak.** Even without an obvious bug, churn
reveals whether the controller's working-set memory plateaus or
keeps climbing. A pprof window over the second half of the run is
the cheapest way to spot a slow trend.

---

## Design Detail

The sections below are for the implementation PR — they are not part of
the GitHub issue body.

### Goals

- Drive ~10 scale-up / scale-down cycles against a single PCS over a
duration long enough to expose slow leaks (target: 30–60 min).
- Run on the smaller dev fixture (≤ 30 worker nodes, ~100 pods peak).
The point is reconcile churn, not absolute capacity.
- Reuse the existing `runScaleTest` scaffolding and pprof hook so the
output format and capture mechanism match the rest of the suite.
- Assert at the end that the PCS reaches its final steady state
cleanly — no orphaned children, counters bounded, no unexpected
`lastErrors`.

### Non-Goals

- Multi-PCS / multi-tenant soak. Single-PCS churn is enough to surface
controller-internal leaks; multi-tenant belongs in a separate test.
- Failure-injection (operator restarts, API server flakes). The soak
test is for steady-state churn; chaos is out of scope.
- Comparison against historical runs. The output is captured in the
same JSON format as other scale tests, but no regression gate is
defined here.

### Test Shape

Three top-level phases:

1. **deploy** — apply the YAML at base; wait for `pods-ready`.
2. **churn-cycles** — loop N times executing the cycle shape above.
Each cycle adds milestones to the timeline so the per-cycle cost is
visible in the output.
3. **soak-final-check** — assertions listed in the issue body; if any
fails the test fails.

There is no explicit `delete` phase — the test fixture cleanup tears
the PCS down at the end. (If we want a delete-time measurement, add
it; otherwise keep the scope tight.)

### Implementation Notes

- New file: `operator/e2e/tests/scale/soak_test.go`.
- New YAML fixture: `operator/e2e/yaml/soak-churn.yaml`. PCS sized at
the base (25 replicas, 50 pods) so the deploy phase is short and the
churn phases dominate the timeline.
- Per-cycle phases can be added programmatically in a loop inside the
test's `addPhases` callback — each iteration calls
`tracker.AddPhase` four times with unique names suffixed by the
cycle index (e.g. `scale-up-c3`).
- Reuses existing `ScalePCS`, `PodsCreatedCondition`, and
`TimerCondition`. **Same caveat as the scale-down doc:** the
scale-down legs need a "live pod count drops to N" condition that
`PodsCreatedCondition` (which is monotonic-up) does not provide.
- The pprof hook in `setupPprofHook` is already per-run; one capture
spans the whole soak. If we want per-cycle pprof captures, that's a
follow-up.
- Make cycle count and amplitude env-tunable (`SOAK_CYCLES`,
`SOAK_BASE`, `SOAK_PEAK`) so the same test can run at small scale in
CI and at larger scale on dev clusters without a new test function.

### Open Questions

- Should the soak test be tagged with a build tag stricter than `e2e`
(e.g. `e2e_soak`) so it isn't pulled into the default suite by
accident? The 30-min default runtime is too long for routine CI.
- Where should the final-check assertions live? Adding them as
measurement conditions keeps the timeline coherent; embedding them
as `t.Fatalf` checks after `tracker.Wait()` is simpler but breaks
the "everything is a phase" pattern.
- Do we want a heap-comparison assertion (e.g. compare pprof heap
samples between cycle 1 and cycle N, fail if delta exceeds a
threshold)? Useful but stateful — defer until the first real
regression motivates it.
Loading
Loading