Skip to content

KEP-5963: DRA Device Compatibility Groups#5964

Open
omeryahud wants to merge 3 commits intokubernetes:masterfrom
omeryahud:omeryahud/kep-5963-device-compatibility-groups
Open

KEP-5963: DRA Device Compatibility Groups#5964
omeryahud wants to merge 3 commits intokubernetes:masterfrom
omeryahud:omeryahud/kep-5963-device-compatibility-groups

Conversation

@omeryahud
Copy link
Copy Markdown

@omeryahud omeryahud commented Mar 12, 2026

  • One-line PR description: This KEP proposes extending the Dynamic Resource Allocation (DRA) API to allow devices to define compatibility constraints, enabling the scheduler to prevent the allocation of mutually exclusive hardware partitioning modes
  • Other comments: KEP documents formatting are a WIP

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 12, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @omeryahud!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 12, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @omeryahud. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 12, 2026
@omeryahud omeryahud force-pushed the omeryahud/kep-5963-device-compatibility-groups branch from 81a89cb to 9c9c306 Compare March 17, 2026 18:08
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: omeryahud
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 17, 2026
Signed-off-by: Omer Yahud <oyahud@nvidia.com>
@omeryahud omeryahud force-pushed the omeryahud/kep-5963-device-compatibility-groups branch from 9c9c306 to a31625a Compare March 17, 2026 18:09
Signed-off-by: Omer Yahud <oyahud@nvidia.com>
Signed-off-by: Omer Yahud <oyahud@nvidia.com>
Copy link
Copy Markdown

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Subject to schedulers willing to adopt these suggestions.

Add a `device.consumesCounters[].compatibilityGroups` field. Devices declare which
named groups they belong to. For two devices consuming counters from the same
counter set to be co-allocated, they must share at least one compatibility group.
Devices without this field are considered compatible with all groups. This
Copy link
Copy Markdown

@rajatchopra rajatchopra Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default be 'not compatible with any group'? And for compatibility with all (or some) we can use a regex? Like '*'. Regex may have more benefits like 'fft-accelerator-*' to claim fmm-accelerator compatibility with all fft-accelerators. But mutual exclusivity between intra fmm and fft devices.
Then, an older version slice and a newer scheduler will automatically mean mutual exclusivity.


### Scheduler Changes

The DRA scheduler plugin is enhanced to:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it help the scheduler to know upfront the list of compatibility groups? Or is reaping the list from devices in a slice good enough?
We may want a .sharedCounters[].compatibilityGroups field if it makes it easier for the scheduler. Also makes the spec 'compile-correct'.

@alaypatel07
Copy link
Copy Markdown
Contributor

/assign @alaypatel07

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants