Skip to content

feat(e2e): path-aware PR matrix selector + nightly workflow (draft)#600

Draft
brluobt wants to merge 4 commits into
ai-dynamo:mainfrom
brluobt:e2e-path-filter-nightly
Draft

feat(e2e): path-aware PR matrix selector + nightly workflow (draft)#600
brluobt wants to merge 4 commits into
ai-dynamo:mainfrom
brluobt:e2e-path-filter-nightly

Conversation

@brluobt
Copy link
Copy Markdown

@brluobt brluobt commented May 11, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Companion to #595. Adds the two follow-ups #595 deferred to issue #594 that are coupled with N>2 backend scaling:

  1. Path-aware PR matrixhack/e2e-select/main.py, a small Python selector reading changed paths + PR labels + draft state, emits the GHA matrix as JSON. The PR workflow's e2e job consumes .run; e2e-skip consumes the complement (.skip) so branch-protection check names stay constant.

  2. Nightly workflow.github/workflows/e2e-nightly.yaml, runs the exhaustive (suite × backend) matrix daily at 07:00 UTC via the same selector with --mode nightly.

The third follow-up listed on #594 (scheduler version coverage) is intentionally not in this PR — it is a version-dimension concern, orthogonal to backend count N, and the contract for it (which versions to test, capability-table version-awareness, runtime version discovery) is unspecified. It stays on #594 as a known gap.

Selection behavior (N=2 today)

Changed path PR rows that run Rows in e2e-skip mirror
operator/internal/scheduler/kai/** 9 KAI rows 3 default-scheduler rows
operator/internal/scheduler/kube/** 3 default-scheduler rows 9 KAI rows
operator/internal/scheduler/types.go (shared) all 12
operator/api/**, operator/charts/** all 12
operator/e2e/**, .github/**, hack/** all 12
docs/**, *.md only all 12
any path + run-e2e label all 12 (safety escape)
draft PR + no run-e2e label all 12

Adding Volcano / Koordinator in a future PR is a one-line addition to ALL_ROWS (and, if they get their own scheduler/<name>/ subdir, one new path rule). No workflow YAML edits needed for further backends.

Which issue(s) this PR fixes:

Refs #594

(Does not resolve #594. The third follow-up — version coverage — remains tracked on the issue.)

Special notes for your reviewer:

Stacked on #595. This PR includes #595's commits as its foundation; please focus review on the two new commits in this PR:

  • 361a179feat(e2e): add path-aware test matrix selector (hack/e2e-select/ — selector, 22 unit tests, 7 golden testdata samples)
  • 125b674feat(e2e): wire PR matrix to selector + add nightly workflow (build-check-test.yaml rewrite + new e2e-nightly.yaml)

The GitHub PR base is main because the e2e-multi-scheduler-backend branch used to develop this work lives on a fork and cannot be selected as an upstream base. Once #595 lands, this PR will rebase onto main cleanly; if #595 is squash-merged, the rebase will be git rebase --onto main <old-base> followed by a force-push.

Why Python (not shell, not Go). ~150-line CI util; shell hits an early quoting wall (JSON, label parsing, fnmatch). operator/hack/infra_manager/orchestrator.py is the existing Python precedent in grove. Go would mean a new module just for CI, or coupling the selector into the operator package (worse).

Why a selector instead of multiple jobs. GHA matrix entries cannot have per-row if:. Splitting into e2e-kai / e2e-kube / ... per-backend jobs duplicates the entire step list per backend and requires a parallel e2e-skip mirror per backend. With dynamic matrix via fromJSON, one e2e job + one e2e-skip job stays intact regardless of N.

Out of scope for this PR (tracked on #594):

  • Auto-open / Slack-notify on nightly failure. This first cut uploads per-row diagnostic artifacts and writes a job-summary report only — keeps the first weeks of nightlies silent while the matrix is stabilising.
  • Scheduler version coverage. See note above.

Merge strategy

Please prefer rebase-merge over squash so the commit boundary between #595's foundation and the two follow-up commits in this PR is preserved.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

kangclzjc and others added 4 commits May 9, 2026 00:08
Restructures the E2E suite from KAI-only to N-backend ready, landing
KAI (primary) and default-scheduler. KAI rows are functionally
unchanged; default-scheduler rows light up new sensitive coverage
(RU/OD/SO).

Skaffold adds an e2e-default-scheduler profile alongside e2e-kai.
Workload YAMLs lose hardcoded schedulerName: kai-scheduler (22 files);
the operator's PreparePod() assigns from defaultProfileName. The new
hack/e2e-default-scheduler.yaml infra-manager preset disables KAI
install/queues, and _run_prepull's KAI image list is now gated on
cfg.scheduler.kai.enabled so default-scheduler rows skip the unused
pull.

A new RequireCapability(t, ...) gate in operator/e2e/tests/capabilities.go
auto-skips capability-gated tests when the active backend does not
provide the required capability. DiscoverCapabilities reads the live
OperatorConfiguration via grove/config and joins it with a hardcoded
backend->interface table; a unit test cross-checks the table against
actual Go interface assertions for every registered backend.

The CI matrix grows three default-scheduler rows (rolling_updates,
ondelete_updates, startup_ordering) via a new create_flags field that
threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS; e2e-skip syncs
in lockstep.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Signed-off-by: Bruce Luo <brluobt@gmail.com>
Two small polish items on top of the previous commit:

1. CI matrix: rename the three default-scheduler rows from `*_default`
   to `*_default-scheduler` (rolling_updates_default-scheduler,
   ondelete_updates_default-scheduler, startup_ordering_default-scheduler)
   in both the e2e and e2e-skip matrices. The longer suffix matches the
   actual backend name (configv1alpha1.SchedulerNameKube = "default-scheduler")
   and removes ambiguity in the GitHub Actions UI when scanning a long
   matrix at a glance.

2. hack/e2e-default-scheduler.yaml: replace the dangling reference to a
   non-existent design proposal with a concrete activation pointer
   (infra-manager.py setup -f ..., threaded through E2E_CREATE_FLAGS).

Signed-off-by: Bruce Luo <brluobt@gmail.com>
Adds hack/e2e-select/, a Python selector that computes the e2e matrix
based on:

  - --mode pr|nightly
  - --changed-files: list of changed paths (one per line; '-' for stdin)
  - --labels: comma-separated PR labels
  - --draft: PR draft state

It emits a JSON object with both ``run`` (the matrix that should execute)
and ``skip`` (the complement; the e2e-skip mirror emits these as
synthetic passes so branch-protection check names stay stable). The two
sets are disjoint and their union is the full matrix.

Selection logic:

  - nightly                          : run = ALL_ROWS
  - pr + run-e2e label               : run = ALL_ROWS (safety escape)
  - pr + draft + no run-e2e label    : run = []  (draft policy)
  - pr otherwise                     : path filter → affected backends
                                        → matching rows

Path rules (declaration-order, first match per file):

  - docs/**, *.md, ...               : no e2e
  - operator/internal/scheduler/kai/** : kai-scheduler rows only
  - operator/internal/scheduler/kube/**: default-scheduler rows only
  - operator/internal/scheduler/**     : shared framework → all backends
  - operator/api/**, operator/charts/**: all backends + agnostic
  - operator/e2e/**, .github/**, hack/**: all backends + agnostic
  - operator/**                        : fallback → all backends

Adding a new backend: append rows to ALL_ROWS in main.py and (if the
backend has a scheduler/<name>/ subdir) a path rule above the generic
scheduler/** shared rule. testdata/ samples cover the existing cases
(kai-only, kube-only, shared, docs-only, run-e2e label, draft, nightly).
The unittest suite cross-checks generated output against the committed
golden samples (E2E_SELECT_REGENERATE=1 to update).

Run from repo root:

    python3 hack/e2e-select/tests/test_selector.py

Signed-off-by: Bruce Luo <brluobt@gmail.com>
Builds on the selector landed in the previous commit.

build-check-test.yaml
---------------------
Replaces the static ``changes`` job and hardcoded matrix with an
``e2e-select`` job that runs the Python selector and emits matrix JSON
plus per-row run/skip bookkeeping. The ``e2e`` job consumes
``fromJSON(needs.e2e-select.outputs.run)`` as its matrix; ``e2e-skip``
consumes ``.skip``. Draft policy (a non-labelled draft PR runs no real
e2e) is enforced inside the selector, so the workflow ``if:`` conditions
collapse to ``has_run == 'true'`` / ``has_skip == 'true'``.

Effect for N=2 backends today:

  - Touching operator/internal/scheduler/kai/** → 9 KAI rows run,
    3 default-scheduler rows fall to e2e-skip.
  - Touching operator/internal/scheduler/kube/** → reverse.
  - Touching operator/internal/scheduler/types.go (shared) → all 12 run.
  - Docs-only PR → all 12 fall to e2e-skip.
  - Draft PR (no run-e2e label) → all 12 fall to e2e-skip.
  - Any PR with run-e2e label → all 12 run (safety escape).

Required branch-protection check names (``E2E - <test_name>``) stay
constant: every PR resolves all 12 names via either e2e or e2e-skip.

e2e-nightly.yaml (new)
----------------------
Daily exhaustive matrix. Same selector with ``--mode nightly`` (path
filter and labels ignored). Adds:

  - schedule: cron "0 7 * * *" (07:00 UTC; adjust to maintainer pref)
  - workflow_dispatch for manual triggering
  - per-row diagnostic artifact upload on failure (14-day retention)
  - aggregate job-summary report on completion

Failure routing (auto-open issue / Slack) intentionally deferred to a
follow-up PR — keeping the first weeks of nightly silent while the
matrix is stabilising, to avoid spam.

Signed-off-by: Bruce Luo <brluobt@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[E2E] Multi-backend E2E test framework

2 participants