test(e2e): support multiple scheduler backends (draft) by brluobt · Pull Request #595 · ai-dynamo/grove

brluobt · 2026-05-09T07:11:30Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Restructures the E2E suite from KAI-only to support multiple scheduler backends — KAI as primary, default-scheduler as the first additional backend. Continuation of @kangclzjc's prototype in #584 (closed) with a small polish commit on top.

Three-tier test classification (per RequireCapability runtime gating):

Agnostic (CM, CRD): primary backend only — no scheduler-specific code path.
Sensitive (RU, OD, SO): every enabled backend — behavior may diverge subtly across implementations.
Capability-gated (GS, TAS, AutoMNNVL): backends that declare the capability — RequireCapability(t, ...) auto-skips otherwise.

Six configuration layers touched (purely additive — no Makefile/test-runner/Helm-rendering changes):

Workload YAMLs (×22): drop schedulerName: kai-scheduler; operator's PreparePod() injects from defaultProfileName. Same workload YAML now runs unmodified on every backend.
Skaffold profiles: rename topology-test → e2e-kai; add e2e-default-scheduler.
infra-manager presets: new hack/e2e-default-scheduler.yaml overlay; KAI image prepull gated on cfg.scheduler.kai.enabled so default-scheduler rows skip the unused pull.
Capability discovery: operator/e2e/tests/{capabilities.go, capability_discovery.go}; cross-check unit test in capabilities_test.go fails the build if the hardcoded backend→capability table drifts from actual Go interface assertions.
Test gates: RequireCapability(t, GangScheduling) etc. added to GS and TAS suites.
CI matrix: new create_flags field threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS. default-scheduler rows: rolling_updates_default-scheduler, ondelete_updates_default-scheduler, startup_ordering_default-scheduler. e2e-skip mirrors for branch protection.

Which issue(s) this PR fixes:

Refs #594

(This PR does not resolve #594. The issue tracks ongoing multi-backend E2E work including path-filtered PR matrix and nightly runs as follow-ups.)

Special notes for your reviewer:

Two commits in this PR:

test(e2e): support multiple scheduler backends — cherry-picked from @kangclzjc's branch, message prefix updated from GREP: to test(e2e):. Author preserved as Kang.
test(e2e): polish multi-backend matrix naming and infra preset — two cosmetic items:
- CI matrix: rename _default → _default-scheduler (full backend name in the GHA UI; matches configv1alpha1.SchedulerNameKube value).
- hack/e2e-default-scheduler.yaml: replace dangling reference to a non-existent design doc with a concrete activation pointer.

Merge strategy: please prefer rebase-merge over squash to preserve @kangclzjc's authorship on commit 1. If squash is required, the trailer block should retain Co-authored-by: Kang Zhang <kangz@nvidia.com> and both Signed-off-by lines.

Out of scope for this PR (deferred to follow-ups, tracked by #594):

Path-aware PR matrix: today the matrix runs all configured rows when operator/** or .github/** changes. Refining to backend-aware filters (scheduler/kai/** → KAI rows only, etc.) is a separate PR.
Nightly workflow: exhaustive (suite × capable backend) matrix on schedule.
Scheduler version coverage: the capability table currently assumes a single pinned version per backend; multi-version coverage is a known gap to address later.

L20 validation ✅ — lightweight validation on a single-server k3d (l20-6, 30 KWOK workers) completed:

Run 1 — KAI baseline (cert_management + crd_installer): PASS
Run 2 — default-scheduler new path (rolling_updates): PASS

Full evidence (per-test results, what each run proves, environment notes) is in the validation summary comment.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

copy-pr-bot · 2026-05-09T07:11:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

brluobt · 2026-05-09T08:47:05Z

L20 validation: ✅ all PASS

Lightweight validation on a single-server k3d cluster (l20-6, 30 KWOK worker nodes).

Results

Run	Backend	Tests	Result
1	KAI (primary)	`Test_CM1_CertManagementRoundTrip`, `Test_CRD_Installer_AllCRDsExist`, `Test_CRD_Installer_InitContainerCompleted`, `Test_CRD_Installer_Idempotent`	✅ PASS
2	default-scheduler (new framework path)	All 20 `Test_RU*` (rolling_updates)	✅ PASS

What each run proves

Run 1 (KAI agnostic): removing hardcoded schedulerName: kai-scheduler from the 22 workload YAMLs and relying on PreparePod() to inject from defaultProfileName does not regress the KAI baseline. cert_management exercises the webhook TLS round-trip with workload deployment; crd_installer exercises the init-container path. Both pass end-to-end on KAI.

Run 2 (default-scheduler): the new framework end-to-end:

Skaffold profile e2e-default-scheduler deploys the operator with defaultProfileName: default-scheduler.
infra-manager preset hack/e2e-default-scheduler.yaml skips KAI installation; KAI image prepull is correctly gated off via cfg.scheduler.kai.enabled.
DiscoverCapabilities() reads the live OperatorConfiguration and resolves the active backend.
PreparePod() injects default-scheduler into pod specs from the unmodified workload YAMLs.
All 20 rolling_updates tests pass against default-scheduler — confirming sensitive-tier behavior is sound on the new backend.

Notes

Run 1 first attempt failed at make run-e2e with /bin/sh: syntax error near unexpected token '('. Root cause was unrelated to the framework: grove's run-e2e-full Makefile target expands $(TEST_PATTERN) unquoted, so the original pattern ^(Test_CM|Test_CRD_Installer) was reparsed by sh as a subshell. Re-ran with the equivalent prefix-only pattern ^Test_C (only Test_CM* and Test_CRD_Installer* start with Test_C in operator/e2e/tests/); all 4 tests passed.
Both runs created and tore down their own k3d cluster (shared-e2e-test-cluster), no leftover state.
This was lightweight validation by design — it does not exercise gang_scheduling, topology_aware_scheduling, auto_mnnvl, or resource_sharing (capability-gated, expected to skip on default-scheduler via RequireCapability). The CI matrix in this PR will exercise those on KAI.

Restructures the E2E suite from KAI-only to N-backend ready, landing KAI (primary) and default-scheduler. KAI rows are functionally unchanged; default-scheduler rows light up new sensitive coverage (RU/OD/SO). Skaffold adds an e2e-default-scheduler profile alongside e2e-kai. Workload YAMLs lose hardcoded schedulerName: kai-scheduler (22 files); the operator's PreparePod() assigns from defaultProfileName. The new hack/e2e-default-scheduler.yaml infra-manager preset disables KAI install/queues, and _run_prepull's KAI image list is now gated on cfg.scheduler.kai.enabled so default-scheduler rows skip the unused pull. A new RequireCapability(t, ...) gate in operator/e2e/tests/capabilities.go auto-skips capability-gated tests when the active backend does not provide the required capability. DiscoverCapabilities reads the live OperatorConfiguration via grove/config and joins it with a hardcoded backend->interface table; a unit test cross-checks the table against actual Go interface assertions for every registered backend. The CI matrix grows three default-scheduler rows (rolling_updates, ondelete_updates, startup_ordering) via a new create_flags field that threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS; e2e-skip syncs in lockstep. Signed-off-by: Kang Zhang <kangz@nvidia.com> Signed-off-by: Bruce Luo <brluobt@gmail.com>

Two small polish items on top of the previous commit: 1. CI matrix: rename the three default-scheduler rows from `*_default` to `*_default-scheduler` (rolling_updates_default-scheduler, ondelete_updates_default-scheduler, startup_ordering_default-scheduler) in both the e2e and e2e-skip matrices. The longer suffix matches the actual backend name (configv1alpha1.SchedulerNameKube = "default-scheduler") and removes ambiguity in the GitHub Actions UI when scanning a long matrix at a glance. 2. hack/e2e-default-scheduler.yaml: replace the dangling reference to a non-existent design proposal with a concrete activation pointer (infra-manager.py setup -f ..., threaded through E2E_CREATE_FLAGS). Signed-off-by: Bruce Luo <brluobt@gmail.com>

kangclzjc · 2026-05-18T01:14:40Z

+          # --- kai-scheduler (primary backend: agnostic + sensitive + KAI-supported capabilities) ---
          - test_name: gang_scheduling
            test_pattern: "^Test_GS"
+            create_flags: ""


if this is empty, can we just don't set this parameters?

kangclzjc · 2026-05-18T01:16:32Z

+          # of whether E2E ran or was skipped.
          - test_name: gang_scheduling
          - test_name: rolling_updates
+          - test_name: ondelete_updates


Is this because we don't do ondelete_updates test before?

This was referenced May 9, 2026

[E2E] Multi-backend E2E test framework #594

Open

feat(e2e): path-aware PR matrix selector + nightly workflow (draft) #600

Draft

kangclzjc and others added 2 commits May 12, 2026 07:47

brluobt force-pushed the e2e-multi-scheduler-backend branch from e49cea1 to 6aababc Compare May 12, 2026 07:47

kangclzjc reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): support multiple scheduler backends (draft)#595

test(e2e): support multiple scheduler backends (draft)#595
brluobt wants to merge 2 commits into
ai-dynamo:mainfrom
brluobt:e2e-multi-scheduler-backend

brluobt commented May 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 9, 2026

Uh oh!

brluobt commented May 9, 2026

Uh oh!

kangclzjc May 18, 2026

Uh oh!

kangclzjc May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brluobt commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

copy-pr-bot Bot commented May 9, 2026

Uh oh!

brluobt commented May 9, 2026

L20 validation: ✅ all PASS

Results

What each run proves

Notes

Uh oh!

kangclzjc May 18, 2026

Choose a reason for hiding this comment

Uh oh!

kangclzjc May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brluobt commented May 9, 2026 •

edited

Loading