test(e2e): support multiple scheduler backends (draft)#595
Draft
brluobt wants to merge 2 commits into
Draft
Conversation
Author
L20 validation: ✅ all PASSLightweight validation on a single-server k3d cluster (l20-6, 30 KWOK worker nodes). Results
What each run provesRun 1 (KAI agnostic): removing hardcoded Run 2 (default-scheduler): the new framework end-to-end:
Notes
|
This was referenced May 9, 2026
Restructures the E2E suite from KAI-only to N-backend ready, landing KAI (primary) and default-scheduler. KAI rows are functionally unchanged; default-scheduler rows light up new sensitive coverage (RU/OD/SO). Skaffold adds an e2e-default-scheduler profile alongside e2e-kai. Workload YAMLs lose hardcoded schedulerName: kai-scheduler (22 files); the operator's PreparePod() assigns from defaultProfileName. The new hack/e2e-default-scheduler.yaml infra-manager preset disables KAI install/queues, and _run_prepull's KAI image list is now gated on cfg.scheduler.kai.enabled so default-scheduler rows skip the unused pull. A new RequireCapability(t, ...) gate in operator/e2e/tests/capabilities.go auto-skips capability-gated tests when the active backend does not provide the required capability. DiscoverCapabilities reads the live OperatorConfiguration via grove/config and joins it with a hardcoded backend->interface table; a unit test cross-checks the table against actual Go interface assertions for every registered backend. The CI matrix grows three default-scheduler rows (rolling_updates, ondelete_updates, startup_ordering) via a new create_flags field that threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS; e2e-skip syncs in lockstep. Signed-off-by: Kang Zhang <kangz@nvidia.com> Signed-off-by: Bruce Luo <brluobt@gmail.com>
Two small polish items on top of the previous commit: 1. CI matrix: rename the three default-scheduler rows from `*_default` to `*_default-scheduler` (rolling_updates_default-scheduler, ondelete_updates_default-scheduler, startup_ordering_default-scheduler) in both the e2e and e2e-skip matrices. The longer suffix matches the actual backend name (configv1alpha1.SchedulerNameKube = "default-scheduler") and removes ambiguity in the GitHub Actions UI when scanning a long matrix at a glance. 2. hack/e2e-default-scheduler.yaml: replace the dangling reference to a non-existent design proposal with a concrete activation pointer (infra-manager.py setup -f ..., threaded through E2E_CREATE_FLAGS). Signed-off-by: Bruce Luo <brluobt@gmail.com>
e49cea1 to
6aababc
Compare
kangclzjc
reviewed
May 18, 2026
| # --- kai-scheduler (primary backend: agnostic + sensitive + KAI-supported capabilities) --- | ||
| - test_name: gang_scheduling | ||
| test_pattern: "^Test_GS" | ||
| create_flags: "" |
Contributor
There was a problem hiding this comment.
if this is empty, can we just don't set this parameters?
kangclzjc
reviewed
May 18, 2026
| # of whether E2E ran or was skipped. | ||
| - test_name: gang_scheduling | ||
| - test_name: rolling_updates | ||
| - test_name: ondelete_updates |
Contributor
There was a problem hiding this comment.
Is this because we don't do ondelete_updates test before?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Restructures the E2E suite from KAI-only to support multiple scheduler backends — KAI as primary, default-scheduler as the first additional backend. Continuation of @kangclzjc's prototype in #584 (closed) with a small polish commit on top.
Three-tier test classification (per
RequireCapabilityruntime gating):RequireCapability(t, ...)auto-skips otherwise.Six configuration layers touched (purely additive — no Makefile/test-runner/Helm-rendering changes):
schedulerName: kai-scheduler; operator'sPreparePod()injects fromdefaultProfileName. Same workload YAML now runs unmodified on every backend.topology-test→e2e-kai; adde2e-default-scheduler.hack/e2e-default-scheduler.yamloverlay; KAI image prepull gated oncfg.scheduler.kai.enabledso default-scheduler rows skip the unused pull.operator/e2e/tests/{capabilities.go, capability_discovery.go}; cross-check unit test incapabilities_test.gofails the build if the hardcoded backend→capability table drifts from actual Go interface assertions.RequireCapability(t, GangScheduling)etc. added to GS and TAS suites.create_flagsfield threads-f hack/<preset>.yamlthroughE2E_CREATE_FLAGS. default-scheduler rows:rolling_updates_default-scheduler,ondelete_updates_default-scheduler,startup_ordering_default-scheduler. e2e-skip mirrors for branch protection.Which issue(s) this PR fixes:
Refs #594
(This PR does not resolve #594. The issue tracks ongoing multi-backend E2E work including path-filtered PR matrix and nightly runs as follow-ups.)
Special notes for your reviewer:
Two commits in this PR:
test(e2e): support multiple scheduler backends— cherry-picked from @kangclzjc's branch, message prefix updated fromGREP:totest(e2e):. Author preserved as Kang.test(e2e): polish multi-backend matrix naming and infra preset— two cosmetic items:_default→_default-scheduler(full backend name in the GHA UI; matchesconfigv1alpha1.SchedulerNameKubevalue).hack/e2e-default-scheduler.yaml: replace dangling reference to a non-existent design doc with a concrete activation pointer.Merge strategy: please prefer rebase-merge over squash to preserve @kangclzjc's authorship on commit 1. If squash is required, the trailer block should retain
Co-authored-by: Kang Zhang <kangz@nvidia.com>and bothSigned-off-bylines.Out of scope for this PR (deferred to follow-ups, tracked by #594):
operator/**or.github/**changes. Refining to backend-aware filters (scheduler/kai/**→ KAI rows only, etc.) is a separate PR.L20 validation ✅ — lightweight validation on a single-server k3d (l20-6, 30 KWOK workers) completed:
cert_management+crd_installer): PASSrolling_updates): PASSFull evidence (per-test results, what each run proves, environment notes) is in the validation summary comment.
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: