[GREP] scheduler plugin - KAI scheduler#553
Conversation
Signed-off-by: Daisy <daiguo@nvidia.com>
|
@daisy-ycguo Could you please create an issue and link it in this PR |
|
|
||
| ## Summary | ||
|
|
||
| This proposal adds a dedicated KAI scheduler backend to Grove's Scheduler Backend Framework so Grove can natively create, update, and delete KAI PodGroup resources for Grove PodGang workloads. This proposal is intentionally limited to PodGroup creation and management; it does not include topology-aware scheduling support or KAI Topology synchronization from Grove ClusterTopology. The change improves maintainability, clarifies ownership boundaries, and enables predictable KAI-specific lifecycle handling for PodGang workloads without relying on legacy external PodGroup management. |
There was a problem hiding this comment.
Could you please check whether now Grove already migrate topology scheduling support in the repo like this PR
There was a problem hiding this comment.
|
|
||
| - Redesigning the Scheduler Backend Framework introduced by GREP-375. | ||
| - Introducing new user-facing scheduling APIs in PodCliqueSet or PodGang for this phase. | ||
| - Covering support for all third-party schedulers; this proposal only scopes KAI backend behavior. |
There was a problem hiding this comment.
Yeah, this is what this GREP do
| - Introducing new user-facing scheduling APIs in PodCliqueSet or PodGang for this phase. | ||
| - Covering support for all third-party schedulers; this proposal only scopes KAI backend behavior. | ||
| - Defining advanced KAI-only scheduling semantics beyond existing PodGang intent. | ||
| - Replacing or deprecating non-KAI backends. |
There was a problem hiding this comment.
Same, obviously this kai backend shouldn't affect other backends.
| - Replacing or deprecating non-KAI backends. | ||
| - Requiring PodGang status-only updates to trigger backend reconciliation. The current backend controller reacts to create, delete, and generation-changing updates. | ||
| - Creating, updating, or deleting KAI Topology resources from Grove `ClusterTopology`. | ||
| - Defining topology-aware scheduling behavior for KAI. That functionality is out of scope for this proposal and should be covered separately. |
There was a problem hiding this comment.
If we have kai backend, then we should still support all functionality of kai which means Grove still support topology-aware scheduling, correct ?
There was a problem hiding this comment.
Topology-ware scheduling is out of scope of this GREP, but in a separated backend file called operator/internal/scheduler/kai/topology.go. I think topology-aware scheduling should be supported but need to verify in tests.
| - All PodClique templates in a PodCliqueSet must resolve to the same scheduler backend. | ||
| - PodGang resources created by Grove carry the resolved scheduler name in the `grove.io/scheduler-name` label. | ||
| - The PodGang backend controller resolves the backend from that label and falls back to the default backend only if the label is absent or invalid. | ||
| - A PodCliqueSet that references an enabled non-default scheduler is admitted; a PodCliqueSet that references a non-enabled scheduler is rejected. |
There was a problem hiding this comment.
The scheduler resolution seems already done in Grove now.
|
|
||
| **Mitigation**: | ||
|
|
||
| - Mark backend-managed PodGang objects with an explicit ignore annotation consumed by legacy KAI paths. |
There was a problem hiding this comment.
Thie means this backend also rely on kai-scheduler, could you please also track the PR status, maybe we should ping kai guys to merge this PR
There was a problem hiding this comment.
this PR Allow Grove plugin to ignore PodGangs with specific annotation is still in draft status. @sanjaychatterjee please let us know if you have any plans to push that PR to be merged.
There was a problem hiding this comment.
@daisy-ycguo We should use the interface from this PR : kai-scheduler/KAI-Scheduler#1552
Adding the annotation kai.scheduler/skip-podgrouper on the pod gang. This is needs the next version of KAI to be released soon.
Adds GREP-525 proposal for KAI scheduler plugin.
Fixes #525