Skip to content

feat: Add GREP-0376 for Volcano scheduler backend#560

Open
xianlubird wants to merge 2 commits into
ai-dynamo:mainfrom
xianlubird:feature/volcano-gang-proposal
Open

feat: Add GREP-0376 for Volcano scheduler backend#560
xianlubird wants to merge 2 commits into
ai-dynamo:mainfrom
xianlubird:feature/volcano-gang-proposal

Conversation

@xianlubird
Copy link
Copy Markdown
Contributor

@xianlubird xianlubird commented Apr 29, 2026

Summary

  • Proposes Volcano scheduler backend support for Grove through native Volcano PodGroup
  • Uses Volcano 1.14+ subGroupPolicy to map each Grove PodGang role into a Volcano subgroup
  • Documents the end-to-end scheduling flow from PodCliqueSet to Grove PodGang to Volcano PodGroup
  • Adds the Volcano 1.14+ version requirement and backend initialization capability check for PodGroup.spec.subGroupPolicy

Key design decisions

  • Keep Grove PodGang as the portable scheduler API and translate it to Volcano-native resources in the backend
  • Map each Grove PodGroup.minReplicas to the corresponding Volcano subGroupPolicy[].subGroupSize
  • Preserve the overall workload gang requirement with Volcano PodGroup.spec.minMember
  • Require Volcano 1.14+ because subGroupPolicy is needed now, and future topology-aware scheduling will also depend on Volcano 1.14+ capabilities
  • Follow the direction already taken by KubeRay and Volcano Kthena for subgroup-based gang scheduling

Fixes #571

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sanjaychatterjee
Copy link
Copy Markdown
Collaborator

@xianlubird Thanks for the GREP. I agree that targeting Volcano version 1.14+ is the right strategy. Overall the GREP looks good to me. I think it still needs more info regarding the TAS support. Please review PR #496 for the latest TAS updates in Grove.
@enoodle @kangclzjc Can you please review this GREP as well? Will approve once both of you give your LGTMs.

Copy link
Copy Markdown
Contributor

@enoodle enoodle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, but I think we need to address this issue:

Comment thread docs/proposals/376-volcano-scheduler-backend/README.md
Signed-off-by: xianlubird <xianlubird@gmail.com>
@sanjaychatterjee sanjaychatterjee force-pushed the feature/volcano-gang-proposal branch from 00e2305 to d552eca Compare May 1, 2026 07:48
@sanjaychatterjee sanjaychatterjee requested a review from danbar2 as a code owner May 1, 2026 07:48
Comment thread docs/proposals/376-volcano-scheduler-backend/README.md
Comment thread docs/proposals/376-volcano-scheduler-backend/README.md
Signed-off-by: xianlubird <xianlubird@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add volcano scheduler support as a backend in Grove

4 participants