Skip to content

DRA: Contextual Capacity Management via Standardized Opaque Hints #5993

@ashvindeodhar

Description

@ashvindeodhar

Enhancement Description

For high-density hardware (Multi-Tenant SmartNICs, MIG GPUs, and FPGAs), the Total Capacity is often not a static property, but a variable dictated by the context of the first pod scheduled (e.g., a specific subnet or partition profile). While KEP-5075 handles the accounting, the node-side ResourceSlice updates currently suffer from an "Informer Consistency Gap" during high-concurrency bursts, leading to physical over-subscription before the scheduler's cache can reconcile with the driver.

Adding scheduler decodable generic schema under OpaqueDeviceConfiguration provides an ideal way to solve this. By embedding a synchronous CapacityHint in the claim, we can perform capacity accounting within the scheduler, bypassing the informer lag without requiring core API changes.

/assign @ashvindeodhar
/cc @johnbelamaric @pohly @sunya-ch

  • One-line enhancement description (can be used as a release note): Enable transactional capacity updates in DRA via standardized opaque hints to resolve context-dependent hardware capacity management and informer consistency gaps.
  • Kubernetes Enhancement Proposal:
  • Discussion Link: TBD
  • PRs by stage and milestone:
    • Alpha - v1.xx
      • KEP (k/enhancements) update PR(s):
      • Code (k/k) update PR(s):
      • Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

Metadata

Metadata

Assignees

Labels

sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

Type

No type

Projects

Status

🏗 In progress

Status

Needs Triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions