-
Notifications
You must be signed in to change notification settings - Fork 61
[GREP] scheduler plugin - KAI scheduler #553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,244 @@ | ||
| # GREP-525: KAI Scheduler Backend for Scheduler Backend Framework | ||
|
|
||
| <!-- toc --> | ||
| - [Summary](#summary) | ||
| - [Motivation](#motivation) | ||
| - [Goals](#goals) | ||
| - [Non-Goals](#non-goals) | ||
| - [Proposal](#proposal) | ||
| - [User Stories](#user-stories) | ||
| - [Story 1: Platform Operator Enables KAI Backend](#story-1-platform-operator-enables-kai-backend) | ||
| - [Story 2: Workload Owner Uses KAI Scheduler](#story-2-workload-owner-uses-kai-scheduler) | ||
| - [Limitations/Risks & Mitigations](#limitationsrisks--mitigations) | ||
| - [Risk: Duplicate Ownership During Migration](#risk-duplicate-ownership-during-migration) | ||
| - [Design Details](#design-details) | ||
| - [Architecture Overview](#architecture-overview) | ||
| - [Backend Lifecycle Contract](#backend-lifecycle-contract) | ||
| - [Operator Configuration and Scheduler Resolution](#operator-configuration-and-scheduler-resolution) | ||
| - [KAI Backend Responsibilities](#kai-backend-responsibilities) | ||
| - [PodGang to PodGroup Mapping](#podgang-to-podgroup-mapping) | ||
| - [PodGroup Update Semantics](#podgroup-update-semantics) | ||
| - [Reconciliation Flow](#reconciliation-flow) | ||
| - [API and Registration Requirements](#api-and-registration-requirements) | ||
| - [RBAC Matrix](#rbac-matrix) | ||
| - [Test Plan](#test-plan) | ||
| - [Unit Tests](#unit-tests) | ||
| - [E2E Tests](#e2e-tests) | ||
| - [Graduation Criteria](#graduation-criteria) | ||
| - [Alpha](#alpha) | ||
| - [Beta](#beta) | ||
| - [GA](#ga) | ||
| - [Appendix](#appendix) | ||
| <!-- /toc --> | ||
|
|
||
| ## Summary | ||
|
|
||
| This proposal adds a dedicated KAI scheduler backend to Grove's Scheduler Backend Framework so Grove can natively create, update, and delete KAI PodGroup resources for Grove PodGang workloads. This proposal is intentionally limited to PodGroup creation and management; it does not include topology-aware scheduling support or KAI Topology synchronization from Grove ClusterTopology. The change improves maintainability, clarifies ownership boundaries, and enables predictable KAI-specific lifecycle handling for PodGang workloads without relying on legacy external PodGroup management. | ||
|
|
||
| ## Motivation | ||
|
|
||
| GREP-375 introduced a generic Scheduler Backend Framework, but the KAI integration still needs a concrete backend implementation pattern and operational contract for production use. Without this backend, KAI support depends on legacy behavior that can cause ambiguous ownership of PodGroup resources and complicate migration as Grove evolves. | ||
|
|
||
| Defining a KAI backend proposal is important because it: | ||
|
|
||
| - Transitions KAI support into Grove's standardized backend lifecycle. | ||
| - Makes PodGang-to-KAI resource reconciliation explicit and auditable. | ||
| - Reduces risk of duplicate resource management during migration. | ||
| - Preserves KAI runtime-owned state during Grove-driven reconciliation. | ||
| - Aligns KAI support with future multi-backend extensibility goals. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Define the KAI backend behavior under the Scheduler Backend Framework lifecycle. | ||
| - Define how `kai-scheduler` is enabled, selected, and resolved from `OperatorConfiguration`, PodClique templates, Pods, and PodGangs. | ||
| - Define `PreparePod` behavior so Pods are scheduled by KAI consistently with Grove's scheduling gate flow. | ||
| - Specify PodGang to KAI PodGroup translation and reconciliation responsibilities. | ||
| - Define deletion-time cleanup behavior for KAI-owned scheduling resources. | ||
| - Document migration-safe coexistence with legacy KAI integration paths. | ||
| - Clarify required RBAC, scheme registration, and dependency/version expectations for KAI resources. | ||
| - Establish test expectations for pod preparation, PodGroup sync, and delete paths. | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - Redesigning the Scheduler Backend Framework introduced by GREP-375. | ||
| - Introducing new user-facing scheduling APIs in PodCliqueSet or PodGang for this phase. | ||
| - Covering support for all third-party schedulers; this proposal only scopes KAI backend behavior. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this is what this GREP do |
||
| - Defining advanced KAI-only scheduling semantics beyond existing PodGang intent. | ||
| - Replacing or deprecating non-KAI backends. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same, obviously this kai backend shouldn't affect other backends. |
||
| - Requiring PodGang status-only updates to trigger backend reconciliation. The current backend controller reacts to create, delete, and generation-changing updates. | ||
| - Creating, updating, or deleting KAI Topology resources from Grove `ClusterTopology`. | ||
| - Defining topology-aware scheduling behavior for KAI. That functionality is out of scope for this proposal and should be covered separately. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we have kai backend, then we should still support all functionality of kai which means Grove still support topology-aware scheduling, correct ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Topology-ware scheduling is out of scope of this GREP, but in a separated backend file called operator/internal/scheduler/kai/topology.go. I think topology-aware scheduling should be supported but need to verify in tests. |
||
|
|
||
| ## Proposal | ||
|
|
||
| Grove will ship a built-in `kai-scheduler` backend that implements the Scheduler Backend Framework lifecycle hooks needed to manage KAI PodGroups. The backend is responsible for converting Grove PodGang intent to KAI PodGroup resources, preparing Pods to use KAI, participating in admission validation, and keeping KAI PodGroups in sync with Grove lifecycle events. | ||
|
|
||
| This proposal only covers KAI PodGroup creation and management. It does not propose any KAI Topology creation/update flow, does not add startup-time topology synchronization, and does not define topology-aware scheduling behavior. | ||
|
|
||
| At a high level, the proposal introduces: | ||
|
|
||
| 1. **KAI backend ownership model**: Grove backend controller is the single owner of KAI PodGroup reconciliation for PodGang resources that select `kai-scheduler`. | ||
| 2. **Deterministic lifecycle behavior**: backend initialization happens during operator startup, `PreparePod` sets the scheduler name during Pod construction, `SyncPodGang` handles create/update reconciliation, and `OnPodGangDelete` handles cleanup. | ||
| 3. **Migration-safe coexistence**: PodGang resources managed via this backend are annotated so legacy KAI paths can ignore them during migration windows. | ||
| 4. **Operator readiness requirements**: KAI PodGroup API types are registered in Grove scheme and RBAC allows backend operations on KAI PodGroups. | ||
| 5. **Update safety**: Grove preserves fields that KAI runtime components own so backend reconciliation does not erase scheduler decisions or mutable runtime state. | ||
|
|
||
| ### User Stories | ||
|
|
||
| #### Story 1: Platform Operator Enables KAI Backend | ||
|
|
||
| As a platform operator, I want Grove to manage KAI scheduling resources through its backend framework so that KAI integration follows a consistent operator lifecycle and is easier to operate and troubleshoot. | ||
|
|
||
| #### Story 2: Workload Owner Uses KAI Scheduler | ||
|
|
||
| As a workload owner, I want my PodGang workloads targeting KAI to automatically produce and maintain the required KAI PodGroup resources so that gang scheduling intent is enforced without manual intervention. | ||
|
|
||
| ### Limitations/Risks & Mitigations | ||
|
|
||
| #### Risk: Duplicate Ownership During Migration | ||
|
|
||
| If both legacy integration and backend reconcile the same intent, PodGroup conflicts may occur. | ||
|
|
||
| **Mitigation**: | ||
|
|
||
| - Mark backend-managed PodGang objects with an explicit ignore annotation consumed by legacy KAI paths. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thie means this backend also rely on kai-scheduler, could you please also track the PR status, maybe we should ping kai guys to merge this PR
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this PR Allow Grove plugin to ignore PodGangs with specific annotation is still in draft status. @sanjaychatterjee please let us know if you have any plans to push that PR to be merged.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @daisy-ycguo We should use the interface from this PR : kai-scheduler/KAI-Scheduler#1552 |
||
| - Keep ownership boundaries documented and validated in tests. | ||
|
|
||
| ## Design Details | ||
|
|
||
| ### Architecture Overview | ||
|
|
||
| The KAI backend extends GREP-375 by implementing KAI-specific translations and lifecycle handling while preserving framework-level control flow. | ||
|
|
||
| ```mermaid | ||
| flowchart TD | ||
| A[OperatorConfiguration] --> B[Backend Manager initializes kai-scheduler] | ||
| E[PodCliqueSet admission] --> F[Resolve backend and validate scheduler selection] | ||
| G[PodCliqueSet controller] --> H[Create PodGang with scheduler label] | ||
| I[PodClique controller] --> J[PreparePod sets schedulerName to kai-scheduler] | ||
| H --> K[PodGang backend controller] | ||
| K --> L[KAI Backend SyncPodGang] | ||
| L --> M[Patch migration ignore annotation] | ||
| L --> N[Create or update KAI PodGroup] | ||
| K --> O[OnPodGangDelete] | ||
| O --> P[Delete KAI PodGroup] | ||
| ``` | ||
|
|
||
| ### Backend Lifecycle Contract | ||
|
|
||
| The backend must cover the PodGroup-related backend surface from GREP-375: | ||
|
|
||
| | Lifecycle surface | Trigger | KAI backend responsibility | | ||
| | --- | --- | --- | | ||
| | Backend initialization | Operator startup after manager creation | Construct and initialize the `kai-scheduler` backend profile. | | ||
| | Admission validation | PodCliqueSet create/update webhook | Validate scheduler selection and run KAI-specific validation when defined. | | ||
| | Pod preparation | PodClique controller builds a Pod | Set Pod `schedulerName` to `kai-scheduler`. | | ||
| | PodGang sync | PodGang create or generation-changing update | Patch migration annotation and reconcile KAI PodGroup. | | ||
| | PodGang deletion | PodGang delete event | Delete associated KAI PodGroup, ignoring not-found errors. | | ||
|
|
||
| ### Operator Configuration and Scheduler Resolution | ||
|
|
||
| `OperatorConfiguration.scheduler.profiles` enables the KAI backend by including a profile named `kai-scheduler`. The profile name is also the string that Grove writes into `Pod.spec.schedulerName`; this keeps backend lookup and Kubernetes scheduler selection aligned. | ||
|
|
||
| Scheduler resolution follows the framework behavior: | ||
|
|
||
| - Empty PodClique template `schedulerName` resolves to `scheduler.defaultProfileName`. | ||
| - All PodClique templates in a PodCliqueSet must resolve to the same scheduler backend. | ||
| - PodGang resources created by Grove carry the resolved scheduler name in the `grove.io/scheduler-name` label. | ||
| - The PodGang backend controller resolves the backend from that label and falls back to the default backend only if the label is absent or invalid. | ||
| - A PodCliqueSet that references an enabled non-default scheduler is admitted; a PodCliqueSet that references a non-enabled scheduler is rejected. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The scheduler resolution seems already done in Grove now. |
||
|
|
||
| ### KAI Backend Responsibilities | ||
|
|
||
| - Resolve only workloads assigned to `kai-scheduler`. | ||
| - Participate in PodCliqueSet validation through the framework hook. | ||
| - Prepare Pods by setting `schedulerName` to `kai-scheduler`. | ||
| - Translate PodGang group semantics to KAI PodGroup semantics. | ||
| - Reconcile KAI PodGroup state on PodGang create and update. | ||
| - Handle KAI resource cleanup on PodGang delete. | ||
| - Mark migration-safe ignore annotation on managed PodGang resources. | ||
|
|
||
| ### PodGang to PodGroup Mapping | ||
|
|
||
| The KAI backend translates a Grove PodGang to a KAI PodGroup with the following ownership and mapping rules: | ||
|
|
||
| | Grove source | KAI PodGroup target | | ||
| | --- | --- | | ||
| | PodGang name and namespace | PodGroup name and namespace | | ||
| | PodGang labels and annotations | PodGroup labels and annotations, preserving existing target-only keys | | ||
| | Sum of PodGang pod group minimum replicas | PodGroup `minMember` | | ||
| | PodGang priority class | PodGroup priority class | | ||
| | Queue label or annotation | PodGroup queue on initial creation | | ||
| | PodGang pod groups | Leaf KAI subgroups with min member and optional parent | | ||
| | PodGang owner reference | PodGroup controller owner reference | | ||
|
|
||
| This mapping focuses on PodGroup ownership and gang membership. KAI Topology resources and topology-aware scheduling semantics are outside the scope of this proposal. | ||
|
|
||
| ### PodGroup Update Semantics | ||
|
|
||
| After creation, some PodGroup fields are owned or mutated by KAI runtime components. The KAI backend must not blindly overwrite them on every Grove reconciliation. Existing runtime-managed values are inherited before comparison and update. This includes: | ||
|
|
||
| - Scheduler backoff state. | ||
| - Mark-unschedulable state. | ||
| - Existing queue value. | ||
| - Runtime-assigned KAI queue and node-pool labels. | ||
|
|
||
| For source-owned labels and annotations, Grove ensures values from the desired PodGang are present on the PodGroup while preserving unrelated existing keys. | ||
|
|
||
| ### Reconciliation Flow | ||
|
|
||
| 1. Backend controller receives PodGang event and resolves `kai-scheduler` backend. | ||
| 2. KAI backend patches the PodGang with the migration ignore annotation if missing. | ||
| 3. KAI backend computes desired PodGroup representation from PodGang state. | ||
| 4. Backend creates the KAI PodGroup if none exists. | ||
| 5. Backend inherits KAI runtime-managed fields from the existing PodGroup before comparing desired and actual state. | ||
| 6. Backend updates only when source-owned fields or desired scheduling intent changed. | ||
| 7. On PodGang deletion, backend removes the associated KAI PodGroup and ignores not-found errors. | ||
|
|
||
| The backend controller only handles PodGang create, delete, and generation-changing update events. Status-only transitions, such as the PodGang `Initialized` condition, do not trigger backend reconciliation. The KAI backend design must therefore rely on spec and metadata changes for PodGroup reconciliation. | ||
|
|
||
| ### API and Registration Requirements | ||
|
|
||
| - Grove runtime scheme includes KAI PodGroup API types for backend client operations. | ||
| - Operator RBAC grants read/write/delete access for KAI PodGroup resources. | ||
| - Backend initialization should validate required API availability before normal reconciliation where practical. | ||
| - KAI dependency imports should consistently use the same module path and version across backend code, scheme registration, unit tests, and e2e helpers. | ||
|
|
||
| ### RBAC Matrix | ||
|
|
||
| | API group | Resource | Scope | Required verbs | Purpose | | ||
| | --- | --- | --- | --- | --- | | ||
| | `scheduling.run.ai` | `podgroups` | Namespaced | create, get, list, watch, patch, update, delete | PodGang to KAI PodGroup reconciliation and cleanup. | | ||
|
|
||
| ### Test Plan | ||
|
|
||
| #### Unit Tests | ||
|
|
||
| - Validate `PreparePod` sets Pod `schedulerName` to `kai-scheduler`. | ||
| - Validate `SyncPodGang` creates and updates KAI PodGroup state, including required field mapping, migration annotation, and runtime-managed field preservation. | ||
| - Validate `OnPodGangDelete` removes the associated KAI PodGroup and ignores already-deleted resources. | ||
|
|
||
| #### E2E Tests | ||
|
|
||
| - Deploy a minimal PodCliqueSet that uses `schedulerName: kai-scheduler` through the existing e2e `PrepareTest` and `DeployAndVerifyWorkload` flow. Verify the created Pods use `kai-scheduler`, the backend creates a KAI PodGroup for the Grove PodGang, and the PodGroup contains the expected basic fields such as owner reference, `minMember`, queue, and subgroups. | ||
| - Delete the same PodCliqueSet with the existing workload deletion helper and verify the KAI PodGroup for that workload is removed. This covers the `OnPodGangDelete` path without adding topology-specific e2e coverage. | ||
|
|
||
| ### Graduation Criteria | ||
|
|
||
| #### Alpha | ||
|
|
||
| - KAI backend is implemented behind framework lifecycle hooks. | ||
| - Unit tests cover pod preparation, PodGroup translation, sync, delete, and annotation behavior. | ||
|
|
||
| #### Beta | ||
|
|
||
| - E2E coverage validates KAI backend behavior in realistic cluster environments. | ||
|
|
||
| #### GA | ||
|
|
||
| - KAI backend is stable across multiple releases with no unresolved critical issues. | ||
|
|
||
| ## Appendix | ||
|
|
||
| - Scheduler Backend Framework baseline: GREP-375. | ||
| - KAI backend implementation context: [PR #524](https://github.com/ai-dynamo/grove/pull/524). | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check whether now Grove already migrate topology scheduling support in the repo like this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enoodle could you please help check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kangclzjc The PR refactor: Use the scheduler backend to implement topology scheduling
is merged.