Skip to content

docs(grep): add GREP-531 Workload API gang scheduling for default-scheduler#1

Closed
yankay wants to merge 1 commit into
mainfrom
grep/531-kube-scheduler-workload-gang
Closed

docs(grep): add GREP-531 Workload API gang scheduling for default-scheduler#1
yankay wants to merge 1 commit into
mainfrom
grep/531-kube-scheduler-workload-gang

Conversation

@yankay
Copy link
Copy Markdown
Owner

@yankay yankay commented May 12, 2026

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

Adds GREP-531: Workload API Gang Scheduling for default-scheduler Backend under docs/proposals/531-kube-scheduler-workload-gang/.

The proposal lands upstream scheduling.k8s.io Workload / PodGroup gang admission inside Grove's existing default-scheduler backend (introduced in GREP-375). Highlights:

  • Opt-in only via the existing KubeSchedulerConfig.GangScheduling field. No new backend, no framework change, no new user-facing API.
  • API version strategy: target v1alpha2 semantics (KEP-5832, aligning with LWS KEP-666); v1alpha1 only as a temporary fallback.
  • Forward-compatible with hierarchical gang (KEP-6012 CompositePodGroup) — Grove's PCS → PCSG → role hierarchy maps onto a native upstream hierarchical gang tree, the primary differentiator from LWS.
  • Escape hatch for external owners (Kueue, cross-PCS gang controllers) to take over Workload lifecycle without new API surface.
  • API discovery via cached RESTMapper so a missing upstream API yields a clear admission error instead of silent degradation.
  • Satisfies the Beta criterion already recorded in GREP-375 ("Kube backend support for advanced community features (e.g. Workload API for gang scheduling) as they become available in Kubernetes.").

This is a docs/-only change (one new file). Implementation will follow in a separate PR; existing draft #532 predates this GREP and will be revisited against the design here.

Which issue(s) this PR fixes:

Refs ai-dynamo#531
Umbrella: ai-dynamo#395

Special notes for your reviewer:

Does this PR introduce a API change?

```release-note
NONE
```

Additional documentation e.g., enhancement proposals, usage docs, etc.:

```docs
docs/proposals/531-kube-scheduler-workload-gang/README.md
```

@yankay yankay closed this May 12, 2026
@yankay yankay reopened this May 12, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62ed9d1a79

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +46 to +47
workloads. Upstream `scheduling.k8s.io` (alpha in 1.34+) closes that
gap, and Grove's `PCS → PCSG → role` hierarchy is a natural consumer
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct the Kubernetes version prerequisite

This states the upstream Workload/Gang Scheduling APIs are available in Kubernetes 1.34+, but the current Kubernetes docs list both the Workload API and Gang Scheduling as Kubernetes v1.35 [alpha]. Since Grove's local e2e defaults are still 1.34.x, this would send implementers or users toward a cluster version where the API cannot be enabled or discovered.

Useful? React with 👍 / 👎.

Comment on lines +229 to +230
The upstream `WorkloadScheduling` / `GenericWorkload` feature gate on
`kube-apiserver` and `kube-scheduler` is an install-time prerequisite
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require the actual gang scheduling feature gates

For clusters following these prerequisites, WorkloadScheduling is not a Kubernetes feature gate, and enabling only GenericWorkload is not enough for all-or-nothing admission. The official feature gate reference lists GenericWorkload and GangScheduling, and the gang scheduling docs describe the GangScheduling scheduler plugin as the component that enforces the policy, so the user docs/implementation plan should name GangScheduling instead of WorkloadScheduling and include it for the scheduler.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for multiple ClusterTopology resources and refines the topology-aware scheduling model. Key changes include moving ClusterTopology validation to a new validating webhook, implementing a dedicated reconciler for topology drift detection, and refactoring MNNVL group resolution to follow a hierarchical pattern (clique -> scaling group -> set). The deletion logic for PodCliqueSet and its children has been optimized to leverage Kubernetes garbage collection, and performance improvements were made to the controller's lookup logic using sets and maps. Feedback suggests improving context propagation in validation handlers and addresses a potential performance bottleneck in the PodCliqueSet watch registration logic.

I am having trouble creating individual review comments. Click here to see my feedback.

operator/internal/webhook/admission/pcs/validation/podcliqueset.go (694)

medium

Update this method to accept a context.Context parameter. This allows the internal call to validateTopologyConstraintsOnCreate to use the actual request context instead of context.Background().

func (v *pcsValidator) validateTopologyConstraintsUpdate(ctx context.Context, oldPCS *grovecorev1alpha1.PodCliqueSet) field.ErrorList {

operator/internal/webhook/admission/pcs/validation/podcliqueset.go (706)

medium

Pass the received ctx instead of context.Background() to ensure consistent context propagation during the legacy repair validation path.

				return v.validateTopologyConstraintsOnCreate(ctx)

operator/internal/controller/podcliqueset/register.go (105)

medium

This List operation fetches all PodCliqueSet resources in the cluster whenever any ClusterTopology is updated. While ClusterTopology changes are infrequent, this could become a performance bottleneck in clusters with a very large number of PodCliqueSet objects. Consider using a field selector or an index if the number of workloads scales significantly.

…eduler backend

Introduces GREP-531: Workload API Gang Scheduling for the default-scheduler
backend. Key design decisions:

- Phase 1 (Kubernetes 1.35, v1alpha1): flat Workload/PodGroup mapping,
  Pod.Spec.WorkloadRef membership, immutable gang shape with recreate-workload
  semantics.
- Phase 2 (Kubernetes 1.36+, conditional on KEP-5832): decoupled standalone
  PodGroup, Pod.Spec.SchedulingGroup membership.
- Phase 3 (Kubernetes 1.37+): CompositePodGroup hierarchical gang, TAS.

Also renames the directory from 531-kube-scheduler-workload-gang to
531-default-scheduler-workload-gang to match the backend name.

Relates to ai-dynamo#531, ai-dynamo#395

Signed-off-by: Kay Yan <kay.yan@daocloud.io>
@yankay yankay force-pushed the grep/531-kube-scheduler-workload-gang branch from 4f33495 to 5a8035a Compare May 12, 2026 08:54
@yankay yankay closed this May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant