diff --git a/keps/sig-node/5963-device-compatibility-groups/README.md b/keps/sig-node/5963-device-compatibility-groups/README.md new file mode 100644 index 000000000000..fd05db386d9d --- /dev/null +++ b/keps/sig-node/5963-device-compatibility-groups/README.md @@ -0,0 +1,422 @@ +# KEP-5963: DRA Device Compatibility Groups + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://git.k8s.io/enhancements) (not the initial KEP PR) +- (R) KEP approvers have approved the KEP status as `implementable` +- (R) Design details are appropriately documented +- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - e2e Tests for all Beta API Operations (endpoints) + - (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - (R) Minimum Two Week Window for GA e2e tests to prove flake free +- (R) Graduation criteria is in place + - (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA +- (R) Production readiness review completed +- (R) Production readiness review approved +- "Implementation History" section is up-to-date for milestone +- User-facing documentation has been created in [kubernetes/website](https://git.k8s.io/website), for publication to [kubernetes.io](https://kubernetes.io/) +- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +## Summary + +This KEP proposes an extension to the Dynamic Resource Allocation (DRA) API to +support mutually exclusive device allocation constraints. Hardware devices often +support multiple partitioning or virtualization schemes (for example, GPU MIG +slicing vs. MPS sharing) that provide different trade-offs in terms of isolation, +performance, and resource sharing. These schemes are frequently mutually exclusive +at the hardware level: once a physical device is partitioned or configured using +one scheme, it cannot be reconfigured to use a different scheme until all existing +allocations are released. + +The current DRA Partitionable Devices API has no mechanism for drivers to express +these mutual exclusivity constraints. Without it, incompatible allocations are only +detected during resource preparation, after the scheduler has already made its +decisions, leading to pod startup failures and resource thrashing. This KEP +introduces API and scheduler changes so that compatibility constraints can be +declared in ResourceSlice objects and enforced at scheduling time. + +## Motivation + +Hardware devices often support multiple partitioning or virtualization schemes +that are mutually exclusive at the hardware level. For example, an NVIDIA GPU +can be configured for MIG (Multi-Instance GPU) slicing or MPS (Multi-Process +Service) sharing, but not both simultaneously on the same physical device. + +Without a mechanism to express these constraints in DRA, the following problems +arise: + +1. **Late Failure Detection**: Incompatible allocations are only detected during + resource preparation, after scheduling decisions have already been made. +2. **Scheduler Unawareness**: The scheduler may allocate incompatible devices, + leading to pod startup failures. +3. **Poor User Experience**: Users receive cryptic preparation failures instead + of clear scheduling feedback. +4. **Resource Thrashing**: The scheduler may repeatedly attempt incompatible + allocations before giving up. + +The current workaround—having DRA drivers fail resource preparation when +incompatible allocations are attempted—is insufficient because it provides no +mechanism to inform the scheduler, and does not prevent repeated failed attempts. + +### Goals + +- Allow DRA drivers to specify compatibility between virtual devices within a +single physical device. +- Allow the scheduler to make informed allocation decisions that respect +compatibility rules declared in ResourceSlice objects. +- Provide a generic mechanism applicable to any hardware with partitioning +constraints, not just GPUs. +- Maintain backward compatibility with existing ResourceSlice specifications. + +### Non-Goals + +- Allow DRA drivers to specify compatibility between physical or virtual devices +across different physical devices or different device classes. The scope of +compatibility constraints is limited to virtual devices sharing the same +underlying physical device. + +## Proposal + +**CompatibilityGroups Assignment** + +Add a `device.consumesCounters[].compatibilityGroups` field. Devices declare which +named groups they belong to. For two devices consuming counters from the same +counter set to be co-allocated, they must share at least one compatibility group. +Devices without this field are considered compatible with all groups. This +approach is simpler and has minimal API surface. + +### User Stories + +#### Story 1 + +As a GPU operator using NVIDIA GPUs, I want to express in my ResourceSlice +that MIG-partitioned virtual devices and MPS-sharing virtual devices on the +same physical GPU are mutually exclusive. When a pod requesting a MIG partition +is already running on a GPU, I want the scheduler to automatically exclude all +MPS devices on that same GPU from consideration for new allocations, rather than +allowing an allocation that will fail at device preparation time. + +#### Story 2 + +As a hardware vendor publishing DRA drivers for an accelerator that supports +multiple exclusive operating modes (for example, exclusive mode, software +partitioning, and hardware partitioning), I want to declare the compatibility +constraints directly in my ResourceSlice, so that the Kubernetes scheduler +can enforce those constraints without requiring my driver to fail pod startup +with cryptic error messages. + +### Notes/Constraints/Caveats + +The compatibility constraint is bidirectional and transitive: if device A +specifies a constraint that excludes device B, then allocating A must prevent +B from being allocated, and vice versa. Both proposals implement this +bidirectional check in the scheduler. + +### Risks and Mitigations + +**Scheduler performance impact**: Evaluating compatibility constraints during +device selection adds work to each scheduling cycle that involves DRA devices. + +**Older schedulers ignoring new field**: A kube-scheduler that does not +understand `compatibilityGroups` will ignore this +field and may allocate incompatible devices. This degrades to the current +behavior (driver fails at preparation time). Mitigation: document the version +skew behavior clearly; drivers must still validate at preparation time even +when the scheduler enforces constraints. + +**Incorrect driver declarations**: If a driver declares incorrect compatibility +constraints, the scheduler may either reject valid allocations or permit invalid +ones. Mitigation: the API is driver-authored and opt-in; drivers are responsible +for correctness and documentation of their compatibility matrix. + +## Design Details + +### API + +#### CompatibilityGroups Assignment + +A new field `compatibilityGroups` is added inside each entry of +`device.consumesCounters[]`. It contains a list of string group names. +For two devices consuming counters from the same counter set to be allocated +together, they must share at least one group name. Devices that omit this +field are considered compatible with all groups. + +Example showing MIG and FOO partitions on the same physical GPU: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +spec: + sharedCounters: + - name: gpu-1-cs + counters: + multiprocessors: + value: "152" + devices: + - name: gpu-1-mig1 + consumesCounters: + - counterSet: gpu-1-cs + compatibilityGroups: + - mig + counters: + multiprocessors: + value: "2" + - name: gpu-1-foo-part + consumesCounters: + - counterSet: gpu-1-cs + compatibilityGroups: + - foo + - bar + counters: + multiprocessors: + value: "17" + - name: gpu-1-bar-part + consumesCounters: + - counterSet: gpu-1-cs + compatibilityGroups: + - foo + - bar + counters: + multiprocessors: + value: "17" +``` + +- `gpu-1-mig1` and `gpu-1-foo-part` share no compatibility group (`mig` vs +`foo`/`bar`), so they cannot be co-allocated on the same counter set. +- `gpu-1-foo-part` and `gpu-1-bar-part` share compatibility groups (`foo`, `bar`), +so they can be co-allocated on the same counter set. + +### Scheduler Changes + +The DRA scheduler plugin is enhanced to: + +1. Maintain a cache of allocated devices per node, including their compatibility + fields (`compatibilityGroups` values). +2. For each candidate device during allocation, evaluate whether it is compatible + with all currently allocated devices on the node, and whether all allocated + devices are compatible with it (bidirectional check). +3. Remove candidate devices from consideration if they violate compatibility + constraints. +4. Emit clear scheduling events when a device is rejected due to compatibility. + +### Driver Responsibilities + +Resource drivers are responsible for: + +1. Populating `compatibilityGroups` for all devices with compatibility requirements. +2. Ensuring compatibility rules are symmetric and consistent across all devices + in a ResourceSlice. +3. Documenting their compatibility matrix. +4. Continuing to validate at resource preparation time for version-skew safety. + +### Test Plan + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +##### Unit tests + +- TBD + +##### Integration tests + +- TBD + +##### e2e tests + +- TBD + +### Graduation Criteria +#### Alpha +- API defined and implemented +- All relevant code is merged and placed behind a feature flag +- Unit and integration tests +- Documentation + +#### Beta +- E2E tests passing in CI +- Validated with at least one production DRA driver (out-of-tree testing) + +#### GA +- At least 2 releases as beta + +### Upgrade / Downgrade Strategy +#### Upgrade +Upon upgrading, no `ResourceSlice` leverages the new optional fields yet, so the current behavior remains as-is + +#### Downgrade +If downgrading to a version that does not have this enhancement implemented, older schedulers and api-servers do not know of the added optional field, and revert to their defined behavior prior to this enhancement + +Allocated devices that leveraged this new field will remain allocated, and future allocations will not take `compatibilityGroups` into consideration. + + +### Version Skew Strategy +No version skew concerns + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- Feature gate + - Feature gate name: DRADeviceCompatibilityGroups + - Components depending on the feature gate: kube-scheduler, kube-apiserver +- Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? +No, this KEP proposes an additional optional field to the `ResourceSlice` API + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? +Yes, rolling back the enablement will revert the cluster to its pre-enablemend behavior + +###### What happens if we reenable the feature if it was previously rolled back? +Existing `compatibilityGroup` configurations in `ResourceSlice`s will become effective again + +###### Are there any tests for feature enablement/disablement? +Yes, there will be integration tests to verify feature enablement/disablement + +### Rollout, Upgrade and Rollback Planning + +###### How can a rollout or rollback fail? Can it impact already running workloads? +I expect code changes in `kube-apiserver` and `kube-scheduler`, so something can go wrong with those. +No impact on already running workloads. + +###### What specific metrics should inform a rollback? +TBD + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +TBD + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +Removal of a field from the `ResourceSlice` API + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? +This feature is not intended for use by workload usage, it is intended for DRA Drivers + +###### How can someone using this feature know that it is working for their instance? + +- Events + - Scheduling events: + - When all allocated devices in all Nodes are not compatible with any device that is considered for allocation the following event will be emitted by the scheduler for each Node: "No available nodes found: claim violates device conpatibility constraints" +- Pod.status + - Condition name: Unschedulable + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? +N/A + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? +N/A + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? +No + +### Dependencies +DRA Partitionable Devices enabled + +###### Does this feature depend on any specific services running in the cluster? +No + +### Scalability + +###### Will enabling / using this feature result in any new API calls? +No + +###### Will enabling / using this feature result in introducing new API types? +No, only a new API field + +###### Will enabling / using this feature result in any new calls to the cloud provider? +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? +Yes, additional field to the `ResourceSlice` API + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? +Scheduling cycles will take longer to complete due to the additional responsibility the scheduler will recieve, I expect it to be negligible + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +No + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +No + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? +No new side effects + +###### What are other known failure modes? +N/A + +###### What steps should be taken if SLOs are not being met to determine the problem? +TBD + +## Implementation History + +## Drawbacks + +Adding compatibility constraint support to the scheduler increases the +complexity of the DRA scheduling logic. The new field must be evaluated for +every device candidate during every scheduling cycle that involves DRA +resources, which adds latency and memory overhead. + +## Alternatives + +### Current Workaround: Driver-level Preparation Failure + +The existing workaround is for DRA drivers to fail resource preparation when +incompatible allocations are attempted. This approach is insufficient because: + +- It detects incompatibilities only after scheduling has committed to the +allocation, leading to pod startup failures. +- It provides no mechanism to inform the scheduler so it can try other nodes +or device combinations. +- It results in resource thrashing as the scheduler retries the same failing +combination. + +## Infrastructure Needed (Optional) + diff --git a/keps/sig-node/5963-device-compatibility-groups/kep.yaml b/keps/sig-node/5963-device-compatibility-groups/kep.yaml new file mode 100644 index 000000000000..4ccbeae4913b --- /dev/null +++ b/keps/sig-node/5963-device-compatibility-groups/kep.yaml @@ -0,0 +1,40 @@ +title: DRA Device Compatibility Groups +kep-number: 5963 +authors: + - "@omeryahud" +owning-sig: sig-node +participating-sigs: + - sig-scheduling +status: provisional +creation-date: 2026-03-17 +reviewers: + - TBD +approvers: + - TBD + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: v1.37 + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: v1.37 + beta: v1.38 + stable: v1.39 + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: DRADeviceCompatibilityGroups + components: + - kube-scheduler + - kube-apiserver +disable-supported: true + +metrics: []