Skip to content

implement KAI-Scheduler backend#524

Open
daisy-ycguo wants to merge 1 commit into
ai-dynamo:mainfrom
daisy-ycguo:kai
Open

implement KAI-Scheduler backend#524
daisy-ycguo wants to merge 1 commit into
ai-dynamo:mainfrom
daisy-ycguo:kai

Conversation

@daisy-ycguo
Copy link
Copy Markdown

@daisy-ycguo daisy-ycguo commented Apr 14, 2026

What type of PR is this?

This PR implements the KAI scheduler backend SyncPodGang() flow in Grove’s scheduler backend framework.
Specifically, it:

  • Adds PodGang -> KAI PodGroup translation and reconciliation (create/update) in operator/internal/scheduler/kai/backend.go
  • Adds PodGroup cleanup on PodGang deletion via OnPodGangDelete()
  • Adds migration-safe behavior by annotating PodGang with grove.io/ignore: "true" so legacy KAI Grove podgrouper can ignore these PodGangs
  • Registers KAI PodGroup API types in Grove scheme (scheduling.run.ai/v2alpha2)
  • Adds RBAC permissions for scheduling.run.ai/podgroups in the Helm ClusterRole
  • Adds/extends unit tests for KAI backend sync and delete paths

What this PR does / why we need it:

  • Grove now owns backend-specific scheduling resources directly for KAI.
  • Without this, KAI backend remains incomplete and relies on legacy plugin behavior.
  • The ignore annotation prevents duplicate/competing PodGroup management during migration.

Which issue(s) this PR fixes:

Fixes #525

Special notes for your reviewer:

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Daisy <daiguo@nvidia.com>
@enoodle
Copy link
Copy Markdown
Contributor

enoodle commented Apr 15, 2026

Hi @daisy-ycguo
I think that today KAI doesn't know to read this new PodGroup and it will still create its own PodGroup and ignore Grove's one.
We should work on updating KAI to allow using an external PodGroup like this one and then it will have more effect.
I opened an issue in KAI for that: kai-scheduler/KAI-Scheduler#1420

@enoodle
Copy link
Copy Markdown
Contributor

enoodle commented Apr 15, 2026

Now I see the updated version of the PR message so I think we are on the same page on that 👍

@kangclzjc
Copy link
Copy Markdown
Contributor

kangclzjc commented May 9, 2026

Hi @daisy-ycguo I think that today KAI doesn't know to read this new PodGroup and it will still create its own PodGroup and ignore Grove's one. We should work on updating KAI to allow using an external PodGroup like this one and then it will have more effect. I opened an issue in KAI for that: kai-scheduler/KAI-Scheduler#1420

@enoodle Sanjay has an PR to fix this, could you please check whether it can help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KAI scheduler backend: sync PodGang to KAI PodGroup

3 participants