feat(e2e): path-aware PR matrix selector + nightly workflow (draft)#600
Draft
brluobt wants to merge 4 commits into
Draft
feat(e2e): path-aware PR matrix selector + nightly workflow (draft)#600brluobt wants to merge 4 commits into
brluobt wants to merge 4 commits into
Conversation
Restructures the E2E suite from KAI-only to N-backend ready, landing KAI (primary) and default-scheduler. KAI rows are functionally unchanged; default-scheduler rows light up new sensitive coverage (RU/OD/SO). Skaffold adds an e2e-default-scheduler profile alongside e2e-kai. Workload YAMLs lose hardcoded schedulerName: kai-scheduler (22 files); the operator's PreparePod() assigns from defaultProfileName. The new hack/e2e-default-scheduler.yaml infra-manager preset disables KAI install/queues, and _run_prepull's KAI image list is now gated on cfg.scheduler.kai.enabled so default-scheduler rows skip the unused pull. A new RequireCapability(t, ...) gate in operator/e2e/tests/capabilities.go auto-skips capability-gated tests when the active backend does not provide the required capability. DiscoverCapabilities reads the live OperatorConfiguration via grove/config and joins it with a hardcoded backend->interface table; a unit test cross-checks the table against actual Go interface assertions for every registered backend. The CI matrix grows three default-scheduler rows (rolling_updates, ondelete_updates, startup_ordering) via a new create_flags field that threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS; e2e-skip syncs in lockstep. Signed-off-by: Kang Zhang <kangz@nvidia.com> Signed-off-by: Bruce Luo <brluobt@gmail.com>
Two small polish items on top of the previous commit: 1. CI matrix: rename the three default-scheduler rows from `*_default` to `*_default-scheduler` (rolling_updates_default-scheduler, ondelete_updates_default-scheduler, startup_ordering_default-scheduler) in both the e2e and e2e-skip matrices. The longer suffix matches the actual backend name (configv1alpha1.SchedulerNameKube = "default-scheduler") and removes ambiguity in the GitHub Actions UI when scanning a long matrix at a glance. 2. hack/e2e-default-scheduler.yaml: replace the dangling reference to a non-existent design proposal with a concrete activation pointer (infra-manager.py setup -f ..., threaded through E2E_CREATE_FLAGS). Signed-off-by: Bruce Luo <brluobt@gmail.com>
Adds hack/e2e-select/, a Python selector that computes the e2e matrix
based on:
- --mode pr|nightly
- --changed-files: list of changed paths (one per line; '-' for stdin)
- --labels: comma-separated PR labels
- --draft: PR draft state
It emits a JSON object with both ``run`` (the matrix that should execute)
and ``skip`` (the complement; the e2e-skip mirror emits these as
synthetic passes so branch-protection check names stay stable). The two
sets are disjoint and their union is the full matrix.
Selection logic:
- nightly : run = ALL_ROWS
- pr + run-e2e label : run = ALL_ROWS (safety escape)
- pr + draft + no run-e2e label : run = [] (draft policy)
- pr otherwise : path filter → affected backends
→ matching rows
Path rules (declaration-order, first match per file):
- docs/**, *.md, ... : no e2e
- operator/internal/scheduler/kai/** : kai-scheduler rows only
- operator/internal/scheduler/kube/**: default-scheduler rows only
- operator/internal/scheduler/** : shared framework → all backends
- operator/api/**, operator/charts/**: all backends + agnostic
- operator/e2e/**, .github/**, hack/**: all backends + agnostic
- operator/** : fallback → all backends
Adding a new backend: append rows to ALL_ROWS in main.py and (if the
backend has a scheduler/<name>/ subdir) a path rule above the generic
scheduler/** shared rule. testdata/ samples cover the existing cases
(kai-only, kube-only, shared, docs-only, run-e2e label, draft, nightly).
The unittest suite cross-checks generated output against the committed
golden samples (E2E_SELECT_REGENERATE=1 to update).
Run from repo root:
python3 hack/e2e-select/tests/test_selector.py
Signed-off-by: Bruce Luo <brluobt@gmail.com>
Builds on the selector landed in the previous commit.
build-check-test.yaml
---------------------
Replaces the static ``changes`` job and hardcoded matrix with an
``e2e-select`` job that runs the Python selector and emits matrix JSON
plus per-row run/skip bookkeeping. The ``e2e`` job consumes
``fromJSON(needs.e2e-select.outputs.run)`` as its matrix; ``e2e-skip``
consumes ``.skip``. Draft policy (a non-labelled draft PR runs no real
e2e) is enforced inside the selector, so the workflow ``if:`` conditions
collapse to ``has_run == 'true'`` / ``has_skip == 'true'``.
Effect for N=2 backends today:
- Touching operator/internal/scheduler/kai/** → 9 KAI rows run,
3 default-scheduler rows fall to e2e-skip.
- Touching operator/internal/scheduler/kube/** → reverse.
- Touching operator/internal/scheduler/types.go (shared) → all 12 run.
- Docs-only PR → all 12 fall to e2e-skip.
- Draft PR (no run-e2e label) → all 12 fall to e2e-skip.
- Any PR with run-e2e label → all 12 run (safety escape).
Required branch-protection check names (``E2E - <test_name>``) stay
constant: every PR resolves all 12 names via either e2e or e2e-skip.
e2e-nightly.yaml (new)
----------------------
Daily exhaustive matrix. Same selector with ``--mode nightly`` (path
filter and labels ignored). Adds:
- schedule: cron "0 7 * * *" (07:00 UTC; adjust to maintainer pref)
- workflow_dispatch for manual triggering
- per-row diagnostic artifact upload on failure (14-day retention)
- aggregate job-summary report on completion
Failure routing (auto-open issue / Slack) intentionally deferred to a
follow-up PR — keeping the first weeks of nightly silent while the
matrix is stabilising, to avoid spam.
Signed-off-by: Bruce Luo <brluobt@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Companion to #595. Adds the two follow-ups #595 deferred to issue #594 that are coupled with N>2 backend scaling:
Path-aware PR matrix —
hack/e2e-select/main.py, a small Python selector reading changed paths + PR labels + draft state, emits the GHA matrix as JSON. The PR workflow'se2ejob consumes.run;e2e-skipconsumes the complement (.skip) so branch-protection check names stay constant.Nightly workflow —
.github/workflows/e2e-nightly.yaml, runs the exhaustive (suite × backend) matrix daily at 07:00 UTC via the same selector with--mode nightly.The third follow-up listed on #594 (scheduler version coverage) is intentionally not in this PR — it is a version-dimension concern, orthogonal to backend count N, and the contract for it (which versions to test, capability-table version-awareness, runtime version discovery) is unspecified. It stays on #594 as a known gap.
Selection behavior (N=2 today)
operator/internal/scheduler/kai/**operator/internal/scheduler/kube/**operator/internal/scheduler/types.go(shared)operator/api/**,operator/charts/**operator/e2e/**,.github/**,hack/**docs/**,*.mdonlyrun-e2elabelrun-e2elabelAdding Volcano / Koordinator in a future PR is a one-line addition to
ALL_ROWS(and, if they get their ownscheduler/<name>/subdir, one new path rule). No workflow YAML edits needed for further backends.Which issue(s) this PR fixes:
Refs #594
(Does not resolve #594. The third follow-up — version coverage — remains tracked on the issue.)
Special notes for your reviewer:
Stacked on #595. This PR includes #595's commits as its foundation; please focus review on the two new commits in this PR:
361a179—feat(e2e): add path-aware test matrix selector(hack/e2e-select/— selector, 22 unit tests, 7 golden testdata samples)125b674—feat(e2e): wire PR matrix to selector + add nightly workflow(build-check-test.yamlrewrite + newe2e-nightly.yaml)The GitHub PR base is
mainbecause thee2e-multi-scheduler-backendbranch used to develop this work lives on a fork and cannot be selected as an upstream base. Once #595 lands, this PR will rebase ontomaincleanly; if #595 is squash-merged, the rebase will begit rebase --onto main <old-base>followed by a force-push.Why Python (not shell, not Go). ~150-line CI util; shell hits an early quoting wall (JSON, label parsing, fnmatch).
operator/hack/infra_manager/orchestrator.pyis the existing Python precedent in grove. Go would mean a new module just for CI, or coupling the selector into the operator package (worse).Why a selector instead of multiple jobs. GHA matrix entries cannot have per-row
if:. Splitting intoe2e-kai/e2e-kube/ ... per-backend jobs duplicates the entire step list per backend and requires a parallel e2e-skip mirror per backend. With dynamic matrix viafromJSON, one e2e job + one e2e-skip job stays intact regardless of N.Out of scope for this PR (tracked on #594):
Merge strategy
Please prefer rebase-merge over squash so the commit boundary between #595's foundation and the two follow-up commits in this PR is preserved.
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: