feat: Merge Decouple PodGroup KEP (#5832) into Gang Scheduling (#4671)#5980
feat: Merge Decouple PodGroup KEP (#5832) into Gang Scheduling (#4671)#5980helayoty wants to merge 1 commit intokubernetes:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: helayoty The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/assign @wojtek-t |
| owning-sig: sig-scheduling | ||
| participating-sigs: | ||
| - sig-apps | ||
| - sig-api-machinery |
There was a problem hiding this comment.
This was a feedback during PodGroup KEP review, since it's new API.
There was a problem hiding this comment.
No - api-machinery is not owning APIs.
api-machinery owns machinery for exposing APIs, which we don't influence/change with this KEP
|
|
||
| see-also: | ||
| - "/keps/sig-scheduling/583-coscheduling" | ||
| - "/keps/sig-scheduling/5832-decouple-podgroup-api" |
There was a problem hiding this comment.
Let's not replace, but add a new entry.
There was a problem hiding this comment.
Sorry, but aren't we going to remove the 5832-decouple-podgroup-api ?
There was a problem hiding this comment.
Good question - I though we should mark it as abandonned?
But maybe we should indeed remove it.
There was a problem hiding this comment.
I'm all for removing (or archiving) it. Moving it to replaces section similar to coscheduling.
| The longer version of this design describing the whole thought process of choosing the | ||
| above described approach can be found in the [extended proposal] document. | ||
|
|
||
| [extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit? |
There was a problem hiding this comment.
I just moved it down with other references.
|
|
||
| see-also: | ||
| - "/keps/sig-scheduling/583-coscheduling" | ||
| - "/keps/sig-scheduling/5832-decouple-podgroup-api" |
There was a problem hiding this comment.
Sorry, but aren't we going to remove the 5832-decouple-podgroup-api ?
|
|
||
| - `Workload` represents long-lived configuration-intent, whereas `PodGroups` represent transient units of scheduling. | ||
| Tying runtime execution units to the persistent definition object violates separation of concerns. | ||
| - Lifecycle coupling prevents standalone `PodGroup` objects from owning other resources (e.g., ResourceClaims) |
There was a problem hiding this comment.
This sentence is a bit confusing. Maybe:
Decoupling the lifecycles allows standalone PodGroup objects to own resources (e.g., ResourceClaims). This enables garbage collection to be scoped to specific scheduling units, rather than tying it to the entire Workload or individual Pods."
?
There was a problem hiding this comment.
The text was copied from the original (approved) KEP so I avoided commenting on things that were simply copied.
|
|
||
| - `Workload` becomes a scheduling policy object that defines scheduling constraints and requirements. | ||
| - `PodGroupTemplate` provides the blueprint for runtime `PodGroup` creation. | ||
| - `PodGroup` is a controller-owned runtime object with its own lifecycle that represents a single scheduling unit. |
There was a problem hiding this comment.
It doesn't have to be controller owned. So maybe: PodGroup is a standalone runtime object with its own lifecycle - typically managed by a controller - that represents a single scheduling unit. ?
| - Introduce a concept of a `PodGroup` positioned as runtime counterparts for the Workload | ||
| - Ensure that decoupled model of `Workload` and `PodGroup` provide clear responsibility split, improved scalability and simplified lifecycle management | ||
| - Enhance status ownership by making `PodGroup` status track podGroup-level runtime state | ||
| - Ensure proper ownership of `PodGroup` objects via controller `ownerReferences` |
There was a problem hiding this comment.
Is this really a goal? It looks like a mean to achieve something else. What about saying here:
Enable automatic lifecycle management and resource cleanup for PodGroup objects through integration with Kubernetes garbage collection.
?
There was a problem hiding this comment.
We own that - let's update it.
| - Ensure proper ownership of `PodGroup` objects via controller `ownerReferences` | ||
| - Ensuring that we can extend `Workload` API in backward compatible way toward north-star API | ||
| - Ensuring that `Workload` API will be usable for both built-in and third-party workload controllers and APIs | ||
| - Simplify integration with `Workload` API and true workload[^6] controllers to make `Workload` API |
There was a problem hiding this comment.
"I'm not sure if 'Simplify integration' is the right goal here. Since this KEP is introducing the Workload API, there isn't an existing integration to simplify yet. This feels like a bit of an 'inception'.
The previous version ('Ensuring that Workload API will be usable...') seemed more accurate as it describes a key property of the new API (its universality). If we want to emphasize ease of use, maybe something like: 'Ensure the Workload & PodGroup API provides a consistent and accessible integration path for both built-in and third-party controllers.'?"
06d9e7a to
8e14592
Compare
Signed-off-by: helayoty <heelayot@microsoft.com>
8e14592 to
2ad2005
Compare
| - Introduce a concept of a `PodGroup` positioned as runtime counterparts for the Workload | ||
| - Ensure that decoupled model of `Workload` and `PodGroup` provide clear responsibility split, improved scalability and simplified lifecycle management | ||
| - Enhance status ownership by making `PodGroup` status track podGroup-level runtime state | ||
| - Ensure proper ownership of `PodGroup` objects via controller `ownerReferences` |
There was a problem hiding this comment.
We own that - let's update it.
| - Ensure proper ownership of `PodGroup` objects via controller `ownerReferences` | ||
| - Ensuring that we can extend `Workload` API in backward compatible way toward north-star API | ||
| - Ensuring that `Workload` API will be usable for both built-in and third-party workload controllers and APIs | ||
| - Simplify integration with `Workload` API and true workload[^6] controllers to make `Workload` API |
| // - Status=False: Scheduling failed (i.e., timeout, unschedulable, etc.). | ||
| // | ||
| // Known reasons for PodGroupScheduled condition: | ||
| // - "Scheduled": All required pods have been successfully scheduled. |
There was a problem hiding this comment.
Given you touch it already, let's unify with what was actually implemented:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/scheduling/v1alpha2/types.go#L512-L522
| // Known condition types: | ||
| // - "PodGroupScheduled": Indicates whether the scheduling requirement has been satisfied. | ||
| // - "DisruptionTarget": Indicates whether the PodGroup is about to be terminated | ||
| // due to disruption such as preemption. |
|
/assign @macsko |
|
|
||
| [^7]: DNS subdomain is a naming convention defined in [RFC 1123](https://tools.ietf.org/html/rfc1123) that Kubernetes uses for most resource names. | ||
|
|
||
| lO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/Kdit? |
There was a problem hiding this comment.
What is this? (lO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/Kdit?)
| [#5501]: https://github.com/kubernetes/enhancements/pull/5501 | ||
| [#5501]: https://github.com/kubernetes/enhancements/pull/5501 | ||
| [extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit? | ||
| [extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit? | ||
| [KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871 | ||
| [KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871 | ||
| [kubernetes.io]: https://kubernetes.io/ | ||
| [kubernetes.io]: https://kubernetes.io/ | ||
| [kubernetes/enhancements]: https://git.k8s.io/enhancements | ||
| [kubernetes/enhancements]: https://git.k8s.io/enhancements |
There was a problem hiding this comment.
Deduplicate:
| [#5501]: https://github.com/kubernetes/enhancements/pull/5501 | |
| [#5501]: https://github.com/kubernetes/enhancements/pull/5501 | |
| [extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit? | |
| [extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit? | |
| [KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871 | |
| [KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871 | |
| [kubernetes.io]: https://kubernetes.io/ | |
| [kubernetes.io]: https://kubernetes.io/ | |
| [kubernetes/enhancements]: https://git.k8s.io/enhancements | |
| [kubernetes/enhancements]: https://git.k8s.io/enhancements | |
| [#5501]: https://github.com/kubernetes/enhancements/pull/5501 | |
| [extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit? | |
| [KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871 | |
| [kubernetes.io]: https://kubernetes.io/ | |
| [kubernetes/enhancements]: https://git.k8s.io/enhancements |
|
|
||
| #### Story 5: Controller Scalability | ||
|
|
||
| As a workload controller author, I want `PodGroup` status to be stored in a separate object, so that per-replica scheduling updates do not require read-modify-write operations on a large, shared `Workload` object, which would otherwise create scalability and contention issues at scale. |
There was a problem hiding this comment.
nit, but helpful for the future:
Could you split the long lines of text into multiple smaller ones? It simplifies the review and further modifications
|
|
||
| **- EventsToRegister (Enqueue)**: The extension will register a new event for when a `PodGroup` object is created. | ||
|
|
||
| **- Update PodGroup Status**: The kube-scheduler sets `PodGroupScheduled=True` after the group passed the Permit phase. If `PodGroup` is unschedulable, the scheduler sets `PodGroupScheduled=False` whenever a gang is conditionally accepted (waiting for preemption) or rejected. |
There was a problem hiding this comment.
Why is this under GangScheduling plugin hooks? It's implemented by the scheduling cycle
| **- WaitOnPermit**: used as a barrier to wait for the pods to be assigned to the nodes before | ||
| initiating potential preemptions and their bindings. The extension waits for all pods in the `PodGroup` to reach permit stage by using each pod's `schedulingGroup.podGroupName` to identify the `PodGroup` that the pod belongs to. | ||
|
|
||
| **- EventsToRegister (Enqueue)**: The extension will register a new event for when a `PodGroup` object is created. |
There was a problem hiding this comment.
And that an unscheduled pod is added
| [X] I/we understand the owners of the involved components may require updates to | ||
| existing tests to make this code solid enough prior to committing the changes necessary | ||
| to implement this enhancement. | ||
|
|
| We will add basic API tests for the new `Workload` and `PodGroup` APIs, that will later be | ||
| promoted to conformance. These tests will cover `PodGroup` creation, | ||
| validation, status updates, and lifecycle management. | ||
| More tests will be added for beta release. |
There was a problem hiding this comment.
Do we really need more e2e tests for beta?
| @@ -1436,17 +1616,26 @@ Watching for workloads: | |||
| - estimated throughput: < XX/s | |||
| - originating component: kube-scheduler, kube-controller-manager (GC) | |||
There was a problem hiding this comment.
kube-scheduler is no longer watching for Workloads:
| - originating component: kube-scheduler, kube-controller-manager (GC) | |
| - originating component: kube-controller-manager (GC) |
| - 2025-09: Initial KEP-4671 proposal. | ||
| - 2026-01-23: KEP-5832 created for PodGroup API alpha release. | ||
| - 2026-02: Structural revision for 1.36 to decouple Policy (Workload) and State (PodGroup). The API remains in Alpha | ||
| to finalize the architecture. | ||
| - 2026-02-06: KEP-5832 updated to sync with API decision of keeping Workload API in alpha release. | ||
| - 2026-03-26: KEP-5832 merged into KEP-4671 as a single consolidated KEP. |
There was a problem hiding this comment.
I think we can keep the months only:
| - 2025-09: Initial KEP-4671 proposal. | |
| - 2026-01-23: KEP-5832 created for PodGroup API alpha release. | |
| - 2026-02: Structural revision for 1.36 to decouple Policy (Workload) and State (PodGroup). The API remains in Alpha | |
| to finalize the architecture. | |
| - 2026-02-06: KEP-5832 updated to sync with API decision of keeping Workload API in alpha release. | |
| - 2026-03-26: KEP-5832 merged into KEP-4671 as a single consolidated KEP. | |
| - 2025-09: Initial KEP-4671 proposal. | |
| - 2026-01: KEP-5832 created for PodGroup API alpha release. | |
| - 2026-02: Structural revision for 1.36 to decouple Policy (Workload) and State (PodGroup). The API remains in Alpha | |
| to finalize the architecture. | |
| - 2026-02: KEP-5832 updated to sync with API decision of keeping Workload API in alpha release. | |
| - 2026-03: KEP-5832 merged into KEP-4671 as a single consolidated KEP. |
Consolidates KEP-5832 (Decouple PodGroup API) into KEP-4671 (Gang Scheduling) so that the decoupled PodGroup design lives in a single, self-contained KEP rather than being split across two documents.
KEP-5832 was created as a companion to KEP-4671 to detail the PodGroup decoupling design. Maintaining two separate KEPs for what is effectively one feature creates confusion for reviewers and implementers. Merging them produces a single authoritative document that is easier to review, approve, and track through the enhancement process.
/sig scheduling
/area workload-aware