-
Notifications
You must be signed in to change notification settings - Fork 63
feat: GREP-0368 Preferred Topology Constraint API #523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shayasoolin
wants to merge
9
commits into
ai-dynamo:main
Choose a base branch
from
shayasoolin:preferred-topology
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+201
−0
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
6b97511
feat: add GREP-0368 for preferred topology constraint API
shayasoolin ca920da
docs: clarify wording and webhook validation in GREP-0368
shayasoolin 3d990e2
docs: remove redundant limitations entry in GREP-0368
shayasoolin 5f03adc
docs: update preferred topology GREP design
shayasoolin feb2afa
docs: generalize preferred topology examples
shayasoolin 9ceff8f
docs: restore preferred topology GREP API design (field as part of PC…
shayasoolin b0bfe63
docs: refine preferred topology GREP validation guidance
shayasoolin 633b4aa
docs: clarify preferred topology API field names
shayasoolin 8baa198
docs: align preferred topology GREP with topology constraint API
shayasoolin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
201 changes: 201 additions & 0 deletions
201
docs/proposals/0368-preferred-topology-constraint/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,201 @@ | ||
| # GREP-0368: Preferred Topology Constraint API | ||
|
|
||
| <!-- toc --> | ||
| - [Summary](#summary) | ||
| - [Motivation](#motivation) | ||
| - [Goals](#goals) | ||
| - [Non-Goals](#non-goals) | ||
| - [Proposal](#proposal) | ||
| - [User Stories](#user-stories) | ||
| - [Story 1: Preferred host-level locality for a workload](#story-1-preferred-host-level-locality-for-a-workload) | ||
| - [Story 2: Automated preferred topology for externally managed workloads](#story-2-automated-preferred-topology-for-externally-managed-workloads) | ||
| - [Limitations/Risks & Mitigations](#limitationsrisks--mitigations) | ||
| - [Design Details](#design-details) | ||
| - [API Change](#api-change) | ||
| - [PodGang Propagation](#podgang-propagation) | ||
| - [Webhook Validation](#webhook-validation) | ||
| - [Monitoring](#monitoring) | ||
| - [Test Plan](#test-plan) | ||
| - [Graduation Criteria](#graduation-criteria) | ||
| - [Alpha](#alpha) | ||
| - [Beta](#beta) | ||
| - [GA](#ga) | ||
| - [Alternatives](#alternatives) | ||
| - [Always-on preferred topology at the lowest level](#always-on-preferred-topology-at-the-lowest-level) | ||
| - [Boolean opt-in on <code>PodCliqueSet</code>](#boolean-opt-in-on-podcliqueset) | ||
| <!-- /toc --> | ||
|
|
||
| ## Summary | ||
|
|
||
| Grove currently supports topology-aware scheduling through `packDomain`, a required constraint that leaves workloads pending if the requested topology cannot be satisfied. This GREP makes the API more explicit by renaming that field to `requiredPackDomain` and adding `preferredPackDomain`, a best-effort constraint that requests topology-aware placement without making it a hard requirement. If the preferred constraint cannot be met, the workload still schedules. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Grove is increasingly used as an orchestration layer by higher-level tools that manage workload submission on behalf of users. These tools often want to apply topology preferences automatically, without exposing scheduling decisions to end users, but cannot do so today because the only available topology constraint is required (`packDomain`). | ||
|
|
||
| Additionally, preferred topology scheduling is a compute-intensive operation for the scheduler. It therefore cannot be enabled by default for all workloads — it must be an explicit opt-in. Exposing a preferred constraint API gives tools and users a way to request best-effort topology placement, decoupled from the all-or-nothing semantics of the current required constraint. Renaming the required field from `packDomain` to `requiredPackDomain` makes the API clearer when required and preferred packing can be specified side by side. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Introduce a preferred (best-effort) topology constraint API on `PodClique`, `PodCliqueScalingGroup`, and `PodCliqueSet` | ||
| - Rename the required topology constraint field from `packDomain` to `requiredPackDomain` while preserving existing workloads that already use `packDomain` | ||
| - Ensure workloads with a preferred constraint are never blocked — if topology cannot be satisfied, scheduling proceeds without it | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - Automatically applying preferred topology to all workloads (it remains an explicit opt-in) | ||
| - Changing the semantics of the existing required constraint, which remains a hard scheduling requirement under the new `requiredPackDomain` name | ||
| - Guaranteeing topology-optimal placement — preferred is best-effort only | ||
|
|
||
| ## Proposal | ||
|
|
||
| This GREP proposes adding `requiredPackDomain` and `preferredPackDomain` fields to `TopologyConstraint`, which is used by `PodClique`, `PodCliqueScalingGroup`, and `PodCliqueSet`. `requiredPackDomain` is the explicit replacement for the existing `packDomain` field and keeps the same hard scheduling semantics. `preferredPackDomain` signals to the scheduler that topology-aware placement is desired at the specified domain level on a best-effort basis — the workload is never blocked if the preferred constraint cannot be satisfied. | ||
|
|
||
| `preferredPackDomain` and `requiredPackDomain` can coexist on the same resource. For example, a user may set `preferredPackDomain=host` and `requiredPackDomain=rack` to express that host-level packing is desired but not required, while rack-level packing is mandatory. When both are set, the scheduler attempts preferred placement first and falls back toward the required level if the preferred constraint cannot be satisfied. It is recommended that `preferredPackDomain` be stricter than or equal to `requiredPackDomain`. Setting a coarser preferred level than the required one is semantically redundant, so admission allows it but returns a warning to give users feedback without blocking the workload. | ||
|
|
||
| The existing `packDomain` field is deprecated. It remains in the API only for compatibility with existing workloads that were created before this GREP is implemented. The operator continues to honor `packDomain` for those workloads as the required packing domain. New workload creation that uses `packDomain` is rejected; new workloads must use `requiredPackDomain`. Updates to existing workloads that still use `packDomain` are allowed when they do not otherwise violate immutability rules, but admission returns a warning directing users to `requiredPackDomain` for future workloads. | ||
|
|
||
| ### User Stories | ||
|
|
||
| #### Story 1: Preferred host-level locality for a workload | ||
|
|
||
| As an ML engineer submitting a workload with Grove, I want to set `preferredPackDomain: host` to signal that my pods benefit from being placed close together, so that Grove and the scheduler attempt host-level locality without blocking the workload when that placement is unavailable. | ||
|
|
||
| #### Story 2: Automated preferred topology for externally managed workloads | ||
|
|
||
| As a platform operator managing workloads over Grove, I want to automatically apply a preferred topology constraint to submitted workloads — without requiring end users to configure scheduling parameters — so that topology-aware placement is attempted as a best-effort optimization while ensuring no workload is ever blocked due to an unsatisfiable constraint. | ||
|
|
||
| ### Limitations/Risks & Mitigations | ||
|
|
||
| - **Scheduler cost:** Preferred topology placement can be compute-intensive for scheduler backends. *Mitigation: preferred placement is explicit opt-in through `preferredPackDomain`; Grove does not apply it by default.* | ||
| - **API transition complexity:** Keeping deprecated `packDomain` for existing workloads while rejecting it for new workloads adds validation and controller complexity. *Mitigation: controller code resolves one effective required domain, and admission tests cover create, update, warning, and ambiguity cases.* | ||
| - **Admission warnings may be missed:** Some clients may not surface warnings for deprecated `packDomain` or redundant preferred domains. *Mitigation: warnings are advisory only; invalid or ambiguous configurations are still rejected.* | ||
|
|
||
| ## Design Details | ||
|
|
||
| ### API Change | ||
|
|
||
| The existing `TopologyConstraint` struct in the operator API (`operator/api/core/v1alpha1/podcliqueset.go`) is updated. The struct is already shared across `PodCliqueSet`, `PodCliqueTemplateSpec`, and `PodCliqueScalingGroupConfig`: | ||
|
|
||
| ```go | ||
| // TopologyConstraint defines topology placement requirements. | ||
| type TopologyConstraint struct { | ||
| // TopologyName is the name of the ClusterTopology resource to use for topology-aware scheduling. | ||
| // If topologyConstraint is set, topologyName must be specified. | ||
| // Immutable after creation. | ||
| // +required | ||
| TopologyName string `json:"topologyName"` | ||
|
|
||
| // RequiredPackDomain specifies the required topology domain for grouping replicas. | ||
| // Controls placement constraint for EACH individual replica instance. | ||
| // The workload will not be scheduled if this constraint cannot be satisfied. | ||
| // Must reference a domain in the topology levels defined in the ClusterTopology | ||
| // CR named by TopologyName. | ||
| // Example: "rack" means each replica is independently placed within one rack. | ||
| // Note: Does NOT constrain all replicas to the same rack together. | ||
| // Different replicas can be in different topology domains. | ||
| // +optional | ||
| RequiredPackDomain *TopologyDomain `json:"requiredPackDomain,omitempty"` | ||
|
|
||
| // PreferredPackDomain specifies a preferred (best-effort) topology domain. | ||
| // If the constraint cannot be satisfied, the workload is scheduled anyway. | ||
| // When set alongside RequiredPackDomain, it is recommended that | ||
| // PreferredPackDomain be stricter than or equal to RequiredPackDomain. | ||
| // +optional | ||
| PreferredPackDomain *TopologyDomain `json:"preferredPackDomain,omitempty"` | ||
|
|
||
| // PackDomain specifies the required topology domain using the legacy field name. | ||
| // Deprecated: use RequiredPackDomain. | ||
| // This field is honored for existing workloads, rejected on new workload | ||
| // creation, and warned on update when still in use. | ||
| // +optional | ||
| PackDomain *TopologyDomain `json:"packDomain,omitempty"` | ||
| } | ||
| ``` | ||
|
|
||
| `TopologyName` remains part of `TopologyConstraint` and continues to select the `ClusterTopology` used for domain-to-key resolution. `RequiredPackDomain` preserves the required placement behavior of the legacy `PackDomain` field, but is optional so a workload can use `PreferredPackDomain` without a hard topology requirement. `PreferredPackDomain` is optional and can be used alone for preferred-only topology placement. Admission requires a `TopologyConstraint` to include `topologyName` and at least one packing domain field. | ||
|
|
||
| The legacy `PackDomain` field changes from a required value to an optional pointer and remains in the API only for compatibility with existing workloads. New workloads must use `RequiredPackDomain` for hard placement requirements; `PackDomain` is honored for existing workloads, rejected on new workload creation, and warned on update when still in use. Controller code must resolve the effective required domain from `RequiredPackDomain` first, then fall back to deprecated `PackDomain` for existing workloads. | ||
|
|
||
| ### PodGang Propagation | ||
|
|
||
| The scheduler API (`scheduler/api/core/v1alpha1/podgang.go`) already supports both constraints via `TopologyPackConstraint.Required` and `TopologyPackConstraint.Preferred`. No changes are needed on the scheduler side. | ||
|
|
||
| The required implementation change is in `createTopologyPackConstraint` (`operator/internal/controller/podcliqueset/components/podgang/syncflow.go`), which translates Grove's `TopologyConstraint` into the scheduler's `TopologyPackConstraint`. It currently populates `Required` from `packDomain`. It must instead resolve the effective required domain from `requiredPackDomain`, falling back to deprecated `packDomain` for existing workloads, and populate `Required` with the resolved topology key. It must also resolve `preferredPackDomain` to its topology key and populate `Preferred`. | ||
|
|
||
| The domain-to-key translation follows the same existing path: looking up the domain name in the `ClusterTopology` levels to obtain the corresponding node label key. Admission validation rejects invalid domains on create and update. If a previously valid required or preferred domain is no longer found during reconciliation because the referenced `ClusterTopology` changed after admission, the missing scheduler constraint is silently skipped without blocking workload creation (same behavior as the existing required constraint). | ||
|
|
||
| ### Webhook Validation | ||
|
|
||
| The admission webhook (`operator/internal/webhook/admission/pcs/validation/topologyconstraints.go`) must be extended to validate the new field names and the deprecated `packDomain` compatibility path: | ||
|
|
||
| - **Domain existence:** Creation or update with a `requiredPackDomain` or `preferredPackDomain` value that does not exist in the referenced `ClusterTopology` CR is rejected at admission time, same as `packDomain` is today. Accepted legacy `packDomain` values on existing workloads continue to be validated against the same domain list on update. | ||
| - **Deprecated field on create:** New workload creation that sets `packDomain` is rejected. New workloads must use `requiredPackDomain`. | ||
| - **Deprecated field on update:** Existing workloads that already use `packDomain` remain valid and continue to use it as the required packing domain, but admission returns a warning on update while the deprecated field remains in use. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we need to support an update of legacy PCS with packDomain to requiredPackDomain with same value |
||
| - **Required domain ambiguity:** A single `TopologyConstraint` must not set both `requiredPackDomain` and deprecated `packDomain`. | ||
| - **Hierarchy:** `requiredPackDomain` follows the same hierarchy rules as the existing required `packDomain`. `preferredPackDomain` on a child resource must be stricter than or equal to the parent's `preferredPackDomain`. | ||
| - **Same-resource preferred vs. required domain:** When `preferredPackDomain` is coarser than the effective required domain on the same resource, the request is allowed but admission returns a warning because the preferred domain is semantically redundant. | ||
| - **Update immutability:** `requiredPackDomain`, `preferredPackDomain`, and deprecated `packDomain` cannot be changed on update, matching the existing immutability behavior for `packDomain`. | ||
|
|
||
| ### Monitoring | ||
|
|
||
| There is no scheduler-side feedback on whether preferred placement was achieved. This is consistent with the existing behavior for required topology placement, which also has no placement-outcome indicator. | ||
|
|
||
| The existing `TopologyLevelUnavailable` status condition on `PodCliqueSet`, which checks whether referenced topology domains exist in the `ClusterTopology` CR, will be extended to cover `requiredPackDomain` and `preferredPackDomain` domains. For existing workloads that still use deprecated `packDomain`, the condition continues to treat `packDomain` as the effective required domain. | ||
|
|
||
| ### Test Plan | ||
|
|
||
| **Unit — webhook validation:** | ||
| - `requiredPackDomain` and `preferredPackDomain` domain names not present in `ClusterTopology` are rejected | ||
| - Creation with deprecated `packDomain` is rejected | ||
| - Updates to existing workloads that still use deprecated `packDomain` are allowed with an admission warning | ||
| - Updates that change `requiredPackDomain`, `preferredPackDomain`, or deprecated `packDomain` are rejected | ||
| - A single `TopologyConstraint` that sets both `requiredPackDomain` and deprecated `packDomain` is rejected | ||
| - `preferredPackDomain` stricter than parent is accepted; coarser than the parent's `preferredPackDomain` is rejected | ||
| - `preferredPackDomain` coarser than the effective required domain on the same resource is allowed with an admission warning | ||
| - `preferredPackDomain` can be set without `requiredPackDomain`, and vice versa | ||
|
|
||
| **Unit — syncflow (`TestComputeExpectedPodGangsWithTopologyConstraints`):** | ||
| - `preferredPackDomain` alone → `TopologyPackConstraint.Preferred` set, `Required` nil | ||
| - `requiredPackDomain` alone → `TopologyPackConstraint.Required` set, `Preferred` nil | ||
| - Deprecated `packDomain` on an existing workload → `TopologyPackConstraint.Required` set from `packDomain` | ||
| - `requiredPackDomain` and `preferredPackDomain` set together → both `Required` and `Preferred` populated correctly | ||
| - Previously valid `requiredPackDomain` or `preferredPackDomain` domain missing from `ClusterTopology` during reconciliation after a topology change → the missing scheduler constraint is silently skipped, workload still created | ||
|
|
||
| **E2E:** | ||
| - Preferred constraint that can be satisfied → workload schedules successfully and scheduler receives the preferred constraint (validates the full propagation path including the scheduler) | ||
| - Preferred constraint that cannot be satisfied → workload schedules successfully (key behavioral difference from required) | ||
| - Existing workload created with deprecated `packDomain` before upgrade continues to reconcile and propagate the required scheduler constraint after upgrade | ||
|
|
||
| ### Graduation Criteria | ||
|
|
||
| #### Alpha | ||
| - `requiredPackDomain` and `preferredPackDomain` fields available on all three resources (`PodCliqueSet`, `PodCliqueScalingGroupConfig`, `PodCliqueTemplateSpec`) | ||
| - Deprecated `packDomain` remains honored for existing workloads and rejected for new workloads | ||
| - Unit and e2e tests passing | ||
|
|
||
| #### Beta | ||
| - Validated in at least one production workload | ||
| - No breaking API changes since alpha | ||
|
|
||
| #### GA | ||
| - Stable API | ||
| - No open issues related to the feature | ||
|
|
||
| ## Alternatives | ||
|
|
||
| Two coarser-grained alternatives were considered and ruled out. Grove, as a generic infrastructure layer, should remain neutral — exposing fine-grained, explicit controls and leaving opinionated defaults to higher-level tools. | ||
|
|
||
| ### Always-on preferred topology at the lowest level | ||
|
|
||
| The scheduler could automatically apply a preferred constraint at the finest available topology level for every workload, with no user opt-in. | ||
|
|
||
| - **Pro:** Zero configuration required. | ||
| - **Con:** Preferred topology scheduling is compute-intensive and unsuitable as a blanket default. Removes user control entirely, and is incompatible with workloads that have no topology preference. | ||
|
|
||
| ### Boolean opt-in on `PodCliqueSet` | ||
|
|
||
| A single boolean field on `PodCliqueSet` (e.g. `preferredTopologyEnabled`) would enable preferred placement at the lowest topology level for all cliques, without allowing per-resource or per-level control. | ||
|
|
||
| - **Pro:** Simpler API surface. | ||
| - **Con:** Provides no control over which topology level is preferred, and cannot express different preferences across `PodCliqueSet`, `PodCliqueScalingGroup`, and `PodClique`. A literal boolean is also technically insufficient because the workload must still select a `ClusterTopology` through `topologyName`; this alternative would need a topology reference plus the boolean opt-in. Higher-level tools can trivially implement this coarser interface on top of `preferredPackDomain`, but the reverse is not true. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be an additional user story for the case where the user really cares about some piece of the workload being scheduled at the lowest topology level even though preferred is not the lowest and decides to provide a packDomain that is lower than the preferred level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's such a user story, the proposed API should be changed pretty much - from ClusterTopology-driven to PCS-driven. Or both.