Skip to content

test(e2e): support multiple scheduler backends (draft)#595

Draft
brluobt wants to merge 2 commits into
ai-dynamo:mainfrom
brluobt:e2e-multi-scheduler-backend
Draft

test(e2e): support multiple scheduler backends (draft)#595
brluobt wants to merge 2 commits into
ai-dynamo:mainfrom
brluobt:e2e-multi-scheduler-backend

Conversation

@brluobt
Copy link
Copy Markdown

@brluobt brluobt commented May 9, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Restructures the E2E suite from KAI-only to support multiple scheduler backends — KAI as primary, default-scheduler as the first additional backend. Continuation of @kangclzjc's prototype in #584 (closed) with a small polish commit on top.

Three-tier test classification (per RequireCapability runtime gating):

  • Agnostic (CM, CRD): primary backend only — no scheduler-specific code path.
  • Sensitive (RU, OD, SO): every enabled backend — behavior may diverge subtly across implementations.
  • Capability-gated (GS, TAS, AutoMNNVL): backends that declare the capability — RequireCapability(t, ...) auto-skips otherwise.

Six configuration layers touched (purely additive — no Makefile/test-runner/Helm-rendering changes):

  1. Workload YAMLs (×22): drop schedulerName: kai-scheduler; operator's PreparePod() injects from defaultProfileName. Same workload YAML now runs unmodified on every backend.
  2. Skaffold profiles: rename topology-teste2e-kai; add e2e-default-scheduler.
  3. infra-manager presets: new hack/e2e-default-scheduler.yaml overlay; KAI image prepull gated on cfg.scheduler.kai.enabled so default-scheduler rows skip the unused pull.
  4. Capability discovery: operator/e2e/tests/{capabilities.go, capability_discovery.go}; cross-check unit test in capabilities_test.go fails the build if the hardcoded backend→capability table drifts from actual Go interface assertions.
  5. Test gates: RequireCapability(t, GangScheduling) etc. added to GS and TAS suites.
  6. CI matrix: new create_flags field threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS. default-scheduler rows: rolling_updates_default-scheduler, ondelete_updates_default-scheduler, startup_ordering_default-scheduler. e2e-skip mirrors for branch protection.

Which issue(s) this PR fixes:

Refs #594

(This PR does not resolve #594. The issue tracks ongoing multi-backend E2E work including path-filtered PR matrix and nightly runs as follow-ups.)

Special notes for your reviewer:

Two commits in this PR:

  1. test(e2e): support multiple scheduler backends — cherry-picked from @kangclzjc's branch, message prefix updated from GREP: to test(e2e):. Author preserved as Kang.
  2. test(e2e): polish multi-backend matrix naming and infra preset — two cosmetic items:
    • CI matrix: rename _default_default-scheduler (full backend name in the GHA UI; matches configv1alpha1.SchedulerNameKube value).
    • hack/e2e-default-scheduler.yaml: replace dangling reference to a non-existent design doc with a concrete activation pointer.

Merge strategy: please prefer rebase-merge over squash to preserve @kangclzjc's authorship on commit 1. If squash is required, the trailer block should retain Co-authored-by: Kang Zhang <kangz@nvidia.com> and both Signed-off-by lines.

Out of scope for this PR (deferred to follow-ups, tracked by #594):

  • Path-aware PR matrix: today the matrix runs all configured rows when operator/** or .github/** changes. Refining to backend-aware filters (scheduler/kai/** → KAI rows only, etc.) is a separate PR.
  • Nightly workflow: exhaustive (suite × capable backend) matrix on schedule.
  • Scheduler version coverage: the capability table currently assumes a single pinned version per backend; multi-version coverage is a known gap to address later.

L20 validation ✅ — lightweight validation on a single-server k3d (l20-6, 30 KWOK workers) completed:

  • Run 1 — KAI baseline (cert_management + crd_installer): PASS
  • Run 2 — default-scheduler new path (rolling_updates): PASS

Full evidence (per-test results, what each run proves, environment notes) is in the validation summary comment.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@brluobt
Copy link
Copy Markdown
Author

brluobt commented May 9, 2026

L20 validation: ✅ all PASS

Lightweight validation on a single-server k3d cluster (l20-6, 30 KWOK worker nodes).

Results

Run Backend Tests Result
1 KAI (primary) Test_CM1_CertManagementRoundTrip, Test_CRD_Installer_AllCRDsExist, Test_CRD_Installer_InitContainerCompleted, Test_CRD_Installer_Idempotent ✅ PASS
2 default-scheduler (new framework path) All 20 Test_RU* (rolling_updates) ✅ PASS

What each run proves

Run 1 (KAI agnostic): removing hardcoded schedulerName: kai-scheduler from the 22 workload YAMLs and relying on PreparePod() to inject from defaultProfileName does not regress the KAI baseline. cert_management exercises the webhook TLS round-trip with workload deployment; crd_installer exercises the init-container path. Both pass end-to-end on KAI.

Run 2 (default-scheduler): the new framework end-to-end:

  • Skaffold profile e2e-default-scheduler deploys the operator with defaultProfileName: default-scheduler.
  • infra-manager preset hack/e2e-default-scheduler.yaml skips KAI installation; KAI image prepull is correctly gated off via cfg.scheduler.kai.enabled.
  • DiscoverCapabilities() reads the live OperatorConfiguration and resolves the active backend.
  • PreparePod() injects default-scheduler into pod specs from the unmodified workload YAMLs.
  • All 20 rolling_updates tests pass against default-scheduler — confirming sensitive-tier behavior is sound on the new backend.

Notes

  • Run 1 first attempt failed at make run-e2e with /bin/sh: syntax error near unexpected token '('. Root cause was unrelated to the framework: grove's run-e2e-full Makefile target expands $(TEST_PATTERN) unquoted, so the original pattern ^(Test_CM|Test_CRD_Installer) was reparsed by sh as a subshell. Re-ran with the equivalent prefix-only pattern ^Test_C (only Test_CM* and Test_CRD_Installer* start with Test_C in operator/e2e/tests/); all 4 tests passed.
  • Both runs created and tore down their own k3d cluster (shared-e2e-test-cluster), no leftover state.
  • This was lightweight validation by design — it does not exercise gang_scheduling, topology_aware_scheduling, auto_mnnvl, or resource_sharing (capability-gated, expected to skip on default-scheduler via RequireCapability). The CI matrix in this PR will exercise those on KAI.

kangclzjc and others added 2 commits May 12, 2026 07:47
Restructures the E2E suite from KAI-only to N-backend ready, landing
KAI (primary) and default-scheduler. KAI rows are functionally
unchanged; default-scheduler rows light up new sensitive coverage
(RU/OD/SO).

Skaffold adds an e2e-default-scheduler profile alongside e2e-kai.
Workload YAMLs lose hardcoded schedulerName: kai-scheduler (22 files);
the operator's PreparePod() assigns from defaultProfileName. The new
hack/e2e-default-scheduler.yaml infra-manager preset disables KAI
install/queues, and _run_prepull's KAI image list is now gated on
cfg.scheduler.kai.enabled so default-scheduler rows skip the unused
pull.

A new RequireCapability(t, ...) gate in operator/e2e/tests/capabilities.go
auto-skips capability-gated tests when the active backend does not
provide the required capability. DiscoverCapabilities reads the live
OperatorConfiguration via grove/config and joins it with a hardcoded
backend->interface table; a unit test cross-checks the table against
actual Go interface assertions for every registered backend.

The CI matrix grows three default-scheduler rows (rolling_updates,
ondelete_updates, startup_ordering) via a new create_flags field that
threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS; e2e-skip syncs
in lockstep.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Signed-off-by: Bruce Luo <brluobt@gmail.com>
Two small polish items on top of the previous commit:

1. CI matrix: rename the three default-scheduler rows from `*_default`
   to `*_default-scheduler` (rolling_updates_default-scheduler,
   ondelete_updates_default-scheduler, startup_ordering_default-scheduler)
   in both the e2e and e2e-skip matrices. The longer suffix matches the
   actual backend name (configv1alpha1.SchedulerNameKube = "default-scheduler")
   and removes ambiguity in the GitHub Actions UI when scanning a long
   matrix at a glance.

2. hack/e2e-default-scheduler.yaml: replace the dangling reference to a
   non-existent design proposal with a concrete activation pointer
   (infra-manager.py setup -f ..., threaded through E2E_CREATE_FLAGS).

Signed-off-by: Bruce Luo <brluobt@gmail.com>
@brluobt brluobt force-pushed the e2e-multi-scheduler-backend branch from e49cea1 to 6aababc Compare May 12, 2026 07:47
# --- kai-scheduler (primary backend: agnostic + sensitive + KAI-supported capabilities) ---
- test_name: gang_scheduling
test_pattern: "^Test_GS"
create_flags: ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is empty, can we just don't set this parameters?

# of whether E2E ran or was skipped.
- test_name: gang_scheduling
- test_name: rolling_updates
- test_name: ondelete_updates
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because we don't do ondelete_updates test before?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[E2E] Multi-backend E2E test framework

2 participants