diff --git a/docs/api-reference/operator-api.md b/docs/api-reference/operator-api.md index 38cf36e6c..b816abc73 100644 --- a/docs/api-reference/operator-api.md +++ b/docs/api-reference/operator-api.md @@ -824,8 +824,8 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `topologyName` _string_ | TopologyName is the name of the ClusterTopologyBinding resource to use for topology-aware scheduling.
If topologyConstraint is set, topologyName and packDomain must both be specified.
Immutable after creation. | | | -| `packDomain` _[TopologyDomain](#topologydomain)_ | PackDomain specifies the topology domain for grouping replicas.
Controls placement constraint for EACH individual replica instance.
Must reference a domain in the topology levels defined in the ClusterTopologyBinding CR name as set in TopologyName
Example: "rack" means each replica independently placed within one rack.
Note: Does NOT constrain all replicas to the same rack together.
Different replicas can be in different topology domains. | | MaxLength: 63
MinLength: 1
Pattern: `^[a-z][a-z0-9-]*$`
| +| `topologyName` _string_ | TopologyName is the name of the ClusterTopologyBinding resource to use for topology-aware scheduling.
Setting TopologyName may be optional if the name can be inherited from a higher level scope.
When TopologyName is specified at a PCS/PCSG/PCLQ resource constraint, it will also be inherited
as the default ClusterTopologyBinding name on all sub-resources, unless overridden by another TopologyName
at a sub-resource.
For example, setting TopologyName at a PCS level makes it optional for child PCSG or PCLQ levels
when the sub-resources reuse the same ClusterTopologyBinding.
Immutable after creation. | | | +| `packDomain` _[TopologyDomain](#topologydomain)_ | PackDomain specifies the topology domain for grouping replicas.
Controls placement constraint for EACH individual replica instance.
Must reference a domain in the topology levels defined in the ClusterTopologyBinding named by TopologyName.
Example: "rack" means each replica independently placed within one rack.
Note: Does NOT constrain all replicas to the same rack together.
Different replicas can be in different topology domains. | | MaxLength: 63
MinLength: 1
Pattern: `^[a-z][a-z0-9-]*$`
| #### TopologyDomain diff --git a/docs/proposals/244-topology-aware-scheduling/README.md b/docs/proposals/244-topology-aware-scheduling/README.md index 421a11c88..153383914 100644 --- a/docs/proposals/244-topology-aware-scheduling/README.md +++ b/docs/proposals/244-topology-aware-scheduling/README.md @@ -135,15 +135,15 @@ spec: ``` **User Layer:** -Workload developers can specify topology constraints at the `PodCliqueSet`, `PodCliqueScalingGroup`, and `PodClique` levels using domain names. The detailed topology reference model, including the explicit `TopologyConstraint` semantics and the current Phase 1 restriction to one effective topology per workload, is defined in [Topology Reference](#topology-reference). +Workload developers can specify topology constraints at the `PodCliqueSet`, `PodCliqueScalingGroup`, and `PodClique` levels using domain names. The detailed topology reference model, including inherited `topologyName` semantics and the current Phase 1 restriction to one effective topology per workload, is defined in [Topology Reference](#topology-reference). The operator validates these constraints against the referenced ClusterTopologyBinding using three key validation rules: 1. *Domain existence*: All topology domains referenced in workload's topology constraints must exist in the ClusterTopologyBinding CR. This ensures workloads only reference valid, configured topology levels. 2. *Topology Constraint Hierarchy*: Topology levels are ordered by their position in the ClusterTopologyBinding's levels array (index 0 = broadest scope). When topology constraints are hierarchically applied to a workload from PodCliqueSet → PodCliqueScalingGroup → PodClique, each level's constraints must reference a domain that is equal to or narrower (higher index) than the parent level's domain. A child resource cannot specify a broader topology domain than its parent. For example, if the referenced ClusterTopologyBinding defines levels `[zone, block, host]` and the PodCliqueSet specifies `block`, then PodCliqueScalingGroup can specify `block` (equal) or `host` (narrower), but not `zone` (broader). -3. *Topology reference*: when a `TopologyConstraint` is used, both `topologyName` and `packDomain` must be set together, and `topologyName` must reference an existing `ClusterTopologyBinding`. The field is immutable after creation. +3. *Topology reference*: `packDomain` is required whenever a `TopologyConstraint` is used. `topologyName` is required at the `PodCliqueSet` level and may be inherited by child constraints from the nearest constrained ancestor. The effective topology must reference an existing `ClusterTopologyBinding` and cannot change after creation. -After validation, the operator translates the topology domain names (e.g., "rack", "host") into cluster-specific topology keys (e.g., "topology.kubernetes.io/zone", "kubernetes.io/hostname") using the referenced ClusterTopologyBinding and configures these hierarchical topology keys in the `PodGang` API. The `PodGang` serves as an intermediate representation that will eventually be mapped to the specific types that the configured scheduler backend understands. This abstraction allows workload portability across clusters with different topology configurations and scheduler implementations. +After validation, the operator translates the topology domain names (e.g., "rack", "host") into cluster-specific topology keys (e.g., "topology.kubernetes.io/zone", "kubernetes.io/hostname") using the effective referenced `ClusterTopologyBinding` and configures these hierarchical topology keys in the `PodGang` API. The `PodGang` serves as an intermediate representation that will eventually be mapped to the specific types that the configured scheduler backend understands. This abstraction allows workload portability across clusters with different topology configurations and scheduler implementations. **Workload portability across clusters** @@ -683,8 +683,12 @@ The shared API type is: type TopologyConstraint struct { // TopologyName is the name of the ClusterTopologyBinding resource // to use for topology-aware scheduling. - // +required - TopologyName string `json:"topologyName"` + // Required on PodCliqueSet constraints. + // Optional on PodCliqueScalingGroup and PodClique constraints, where it may + // be inherited from the nearest ancestor that also defines TopologyConstraint. + // In Phase 1, all effective topology names within one PodCliqueSet must match. + // +optional + TopologyName string `json:"topologyName,omitempty"` // PackDomain specifies the topology domain for grouping replicas. // Must reference a domain defined in the ClusterTopologyBinding's levels. // Controls placement constraint for EACH individual replica instance. @@ -732,6 +736,7 @@ type PodCliqueScalingGroupConfig struct { * Equal to or narrower (higher index) than the constraint defined at the `PodCliqueSet` level. * Equal to or broader (lower index) than the constraints defined for each constituent `PodClique`. +* Either set `topologyName` explicitly or inherit it from the `PodCliqueSet` constraint. At `PodClique` level, the constraint is attached using the same `TopologyConstraint` type: @@ -747,7 +752,7 @@ type PodCliqueTemplateSpec struct { } ``` -`TopologyConstraint` defined at the `PodClique` level should be equal to or narrower (higher index) than the constraints defined for the parent resources (`PodCliqueScalingGroup` and `PodCliqueSet`). +`TopologyConstraint` defined at the `PodClique` level should be equal to or narrower (higher index) than the constraints defined for the parent resources (`PodCliqueScalingGroup` and `PodCliqueSet`). It may omit `topologyName` when the effective topology can be inherited from the nearest constrained ancestor. Example PodCliqueSet with topology constraints (For brevity many parts of the PodCliqueSet spec has been omitted): @@ -765,7 +770,6 @@ spec: cliques: - name: router topologyConstraint: - topologyName: h100-topology packDomain: "block" spec: roleName: router @@ -774,7 +778,6 @@ spec: ... - name: p-leader topologyConstraint: - topologyName: h100-topology packDomain: "host" spec: roleName: prefill-leader @@ -783,7 +786,6 @@ spec: ... - name: p-worker topologyConstraint: - topologyName: h100-topology packDomain: "host" spec: roleName: prefill-worker @@ -792,7 +794,6 @@ spec: ... - name: d-leader topologyConstraint: - topologyName: h100-topology packDomain: "host" spec: roleName: decode-leader @@ -801,7 +802,6 @@ spec: ... - name: d-worker topologyConstraint: - topologyName: h100-topology packDomain: "host" spec: roleName: decode-worker @@ -811,7 +811,6 @@ spec: podCliqueScalingGroups: - name: prefill topologyConstraint: - topologyName: h100-topology packDomain: "block" replicas: 2 minAvailable: 1 @@ -820,7 +819,6 @@ spec: - p-leader - name: decode topologyConstraint: - topologyName: h100-topology packDomain: "host" replicas: 2 minAvailable: 1 @@ -833,7 +831,14 @@ The above example is only a representation of how users can set topology constra #### Topology Reference -The API uses the shared `TopologyConstraint` type at the `PodCliqueSet`, `PodCliqueScalingGroup`, and `PodClique` levels. A topology reference is always explicit: if a resource sets `TopologyConstraint`, it must specify both `topologyName` and `packDomain`. `PodCliqueScalingGroup` and `PodClique` constraints do not inherit `topologyName` from the `PodCliqueSet` or from any ancestor. In Phase 1, all explicit `topologyName` values within one `PodCliqueSet` must still be identical. +The API uses the shared `TopologyConstraint` type at the `PodCliqueSet`, `PodCliqueScalingGroup`, and `PodClique` levels. `packDomain` is always explicit. `topologyName` is resolved using the nearest constrained ancestor: + +* A `PodCliqueSet` `TopologyConstraint` must specify both `topologyName` and `packDomain`. +* A `PodCliqueScalingGroup` `TopologyConstraint` must specify `packDomain`; it may omit `topologyName` only if the `PodCliqueSet` has a `TopologyConstraint` from which it can inherit. +* A `PodClique` `TopologyConstraint` must specify `packDomain`; it may omit `topologyName` only if its nearest constrained ancestor (`PodCliqueScalingGroup` first, otherwise `PodCliqueSet`) provides one. +* A `TopologyConstraint` with `topologyName` but no `packDomain` is invalid. The field is not used as a pure propagation mechanism. + +In Phase 1, all effective topology names within one `PodCliqueSet` must still be identical. A child may restate the same `topologyName` explicitly, but it may not override the effective topology established for that workload. **Example YAML:** @@ -844,10 +849,12 @@ metadata: name: my-inference spec: template: + topologyConstraint: + topologyName: gb200-topology + packDomain: rack cliques: - name: worker topologyConstraint: - topologyName: gb200-topology # must match an existing ClusterTopologyBinding name packDomain: host ``` @@ -857,7 +864,7 @@ Existing validating webhook which validates `PodCliqueSet`, has been enhanced to *Rule-1: Check for supported TopologyDomains* -* All topology domains that are referenced in the `PodCliqueSet` must be amongst the defined topology levels in the ClusterTopologyBinding referenced by `topologyName`. If a non-supported topology domain is found then creation or update of the `PodCliqueSet` will be rejected. +* All topology domains that are referenced in the `PodCliqueSet` must be amongst the defined topology levels in the `ClusterTopologyBinding` referenced by the effective `topologyName`. If a non-supported topology domain is found then creation or update of the `PodCliqueSet` will be rejected. *Rule-2: Check for Hierarchical strictness* @@ -869,17 +876,20 @@ Example (assuming levels `[zone, block, rack, host, numa]`): a parent with `rack *Rule-3: Topology reference validation* -* A `TopologyConstraint` is valid only when it sets both `topologyName` and `packDomain`. It is invalid to set one without the other. -* In the current implementation, all `TopologyConstraint.topologyName` values within a single `PodCliqueSet` must match. -* Each `TopologyConstraint.topologyName` must reference an existing `ClusterTopologyBinding`. -* `topologyName` is immutable after creation. Updates that change the value are rejected. +* `packDomain` is required whenever a `TopologyConstraint` is set. +* A PCS-level `TopologyConstraint` must set `topologyName`. +* A PCSG- or PCLQ-level `TopologyConstraint` may omit `topologyName` only when the nearest constrained ancestor provides one. +* A `TopologyConstraint` with `topologyName` but no `packDomain` is invalid. +* In the current implementation, all effective topology names within a single `PodCliqueSet` must match. +* The effective `topologyName` must reference an existing `ClusterTopologyBinding`. +* The effective topology selection is immutable after creation. Updates that would change the resolved topology are rejected. * Reject if `topologyName` or any `TopologyConstraint` is set but TAS is disabled cluster-wide. -Rules 1 and 2 apply to `TopologyConstraint` fields. Rule-3 validates the explicit topology reference carried by each constraint. Together, these three rules ensure that workloads can only reference valid topology levels, maintain logical topology nesting throughout the resource hierarchy, and target existing `ClusterTopologyBinding` resources without relying on inherited topology names. +Rules 1 and 2 apply to `TopologyConstraint` fields. Rule-3 validates how the effective topology reference is established after inheritance is applied. Together, these three rules ensure that workloads can only reference valid topology levels, maintain logical topology nesting throughout the resource hierarchy, and target an existing `ClusterTopologyBinding` without allowing per-child topology overrides in Phase 1. ### PodGang: Scheduler API Enhancements -Grove operator translates the hierarchical topology constraints to infrastructure specific node labels in the `PodGang` scheduler API. For each explicit `TopologyConstraint`, the operator resolves the referenced `ClusterTopologyBinding` using that constraint's own `topologyName` and translates its `packDomain` to the corresponding node-label key. If no `TopologyConstraint` is present at a given level, no topology is applied at that level. +Grove operator translates the hierarchical topology constraints to infrastructure specific node labels in the `PodGang` scheduler API. For each `TopologyConstraint`, the operator first resolves the effective `topologyName` using the inheritance rules above, then translates that constraint's `packDomain` to the corresponding node-label key from the referenced `ClusterTopologyBinding`. If no `TopologyConstraint` is present at a given level, no topology is applied at that level. Phase 1 still resolves a single effective topology for the `PodCliqueSet` as a whole, so the existing `PodGang` topology identity model does not change. The following additional types have been defined to capture the topology constraints: @@ -966,16 +976,16 @@ The addition of multiple ClusterTopologyBinding resources and the `topologyName` #### Existing PodCliqueSets with topology constraints but no `topologyName` -Under the previous single-topology design, PodCliqueSets could specify `topologyConstraint` without a `topologyName` — the operator implicitly resolved constraints against the single `grove-topology` ClusterTopologyBinding. After upgrading to the multi-topology design, these resources are structurally invalid because every `TopologyConstraint` must now specify both `topologyName` and `packDomain`. These objects may already exist in the cluster — the validating webhook only runs on create and update, not on existing resources. +Under the previous single-topology design, PodCliqueSets could specify `topologyConstraint` without a `topologyName` — the operator implicitly resolved constraints against the single `grove-topology` ClusterTopologyBinding. After upgrading to the multi-topology design, these resources may be missing the explicit topology reference needed to establish an effective topology. These objects may already exist in the cluster — the validating webhook only runs on create and update, not on existing resources. This upgrade case is specifically about constraints that have `packDomain` but not `topologyName`. The inverse shape (`topologyName` without `packDomain`) is not expected from older versions, because `packDomain` was the only topology field exposed before this change. The PCS reconciler must handle these gracefully during upgrade: -* **Detection**: On each reconciliation, if the PCS contains any `TopologyConstraint` that does not fully specify both `topologyName` and `packDomain`, the reconciler treats this as an invalid state. -* **Condition**: The reconciler sets the `TopologyLevelsUnavailable` condition to `Unknown` with reason `TopologyNameMissing` and a message indicating that topology constraints must specify both fields explicitly. +* **Detection**: On each reconciliation, if the PCS contains any `TopologyConstraint` whose effective `topologyName` cannot be resolved from an explicit value or inheritance, the reconciler treats this as an invalid state. +* **Condition**: The reconciler sets the `TopologyLevelsUnavailable` condition to `Unknown` with reason `TopologyNameMissing` and a message indicating that a constrained resource is missing an effective `topologyName`. * **PodGang topology removal**: The reconciler removes topology constraints from the PodGang resources created for the PCS, since incomplete topology references cannot be resolved to node label keys. -* **Resolution**: The administrator must update each affected `TopologyConstraint` by adding the missing `topologyName` while keeping the existing `packDomain`. The validating webhook will then validate the update normally (domain existence, hierarchy rules, and topology reference existence). Once the fields are valid and the referenced ClusterTopologyBinding exists with the required domains, the reconciler clears the condition and restores topology constraints on the PodGang. +* **Resolution**: The administrator must update the highest constrained level that lacks a resolvable topology reference. In many upgraded objects, adding `topologyName` to the PCS-level constraint is sufficient because descendants can then inherit it. If a constrained child has no constrained ancestor, that child must carry `topologyName` explicitly. The validating webhook will then validate the update normally (domain existence, hierarchy rules, and topology reference existence). Once the effective topology is valid and the referenced `ClusterTopologyBinding` exists with the required domains, the reconciler clears the condition and restores topology constraints on the PodGang. This ensures that existing workloads are not silently broken — the condition makes the issue visible and the reconciler degrades gracefully by stripping unresolvable topology constraints rather than failing reconciliation entirely. @@ -1040,7 +1050,7 @@ Condition States: | --------- | ----------------------------------- | ------------------------------------------------------------ | | `True` | `ClusterTopologyNotFound` | When `ClusterTopologyBinding` CR is no longer existing — levels are definitively unavailable | | `Unknown` | `TopologyAwareSchedulingDisabled` | When TAS has been disabled cluster-wide while the PCS still has topology constraints | -| `Unknown` | `TopologyNameMissing` | When a topology constraint is incomplete and does not explicitly specify both `topologyName` and `packDomain` | +| `Unknown` | `TopologyNameMissing` | When a topology constraint has `packDomain` but no effective `topologyName` can be resolved from an explicit value or inherited ancestor | | `True` | `ClusterTopologyLevelsUnavailable` | When one or more topology levels used by a deployed `PodCliqueSet` are no longer present in `ClusterTopologyBinding` (e.g., the ClusterTopologyBinding levels were updated and no longer include the domains used by this PCS). The reconciler removes the invalid topology constraints from the PodGang resources so the scheduler no longer enforces them. Already-running pods are not evicted — they continue running at their current placement. | | `False` | `AllClusterTopologyLevelsAvailable` | All topology levels used by a deployed `PodCliqueSet` are amongst the supported topology levels as defined in `ClusterTopologyBinding` | @@ -1067,7 +1077,7 @@ This was the original topology-aware scheduling model prior to the changes in th **Phase 1: Multiple ClusterTopologyBinding resources, single effective topology per PodCliqueSet** -Phase 1 adds support for multiple admin-created `ClusterTopologyBinding` resources and uses the explicit `TopologyConstraint` model described in [Topology Reference](#topology-reference). Runtime behavior still requires all `topologyName` values within a single `PodCliqueSet` to match. This keeps the implementation compatible with the current `PodGang` API and with current scheduler backend capabilities, both of which assume a single resolved topology for the entire `PodCliqueSet`. +Phase 1 adds support for multiple admin-created `ClusterTopologyBinding` resources and uses the inherited `TopologyConstraint` model described in [Topology Reference](#topology-reference). Runtime behavior still requires all effective topology names within a single `PodCliqueSet` to match. Allowing child constraints to omit `topologyName` is therefore only a manifest simplification; it does not change the fact that Phase 1 still resolves one topology context for the workload. This keeps the implementation compatible with the current `PodGang` API and with current scheduler backend capabilities, both of which assume a single resolved topology for the entire `PodCliqueSet`. **Phase 2: Different topologies for different parts of a workload** @@ -1092,9 +1102,9 @@ Until those changes exist end-to-end, the implementation remains intentionally i **Unit tests** for multi-topology: * ClusterTopologyBinding validating webhook: domain uniqueness, key uniqueness on create and update -* PCS validating webhook: `topologyName` existence check on create, immutability on update, required when `TopologyConstraint` is set, rejection when TAS is disabled -* PCS reconciler: `TopologyLevelsUnavailable` condition logic — set when ClusterTopologyBinding is missing, topologyName is missing, or domains are unavailable, cleared when all domains are available -* PCS reconciler: topology resolution logic that resolves the PCS-level `topologyName` to the correct ClusterTopologyBinding when building the PodGang +* PCS validating webhook: inherited `topologyName` resolution on create, immutability of the effective topology on update, rejection when TAS is disabled +* PCS reconciler: `TopologyLevelsUnavailable` condition logic — set when `ClusterTopologyBinding` is missing, effective topology name is missing, or domains are unavailable, cleared when all domains are available +* PCS reconciler: topology resolution logic that resolves the effective `topologyName` to the correct `ClusterTopologyBinding` when building the `PodGang` **E2E tests** are defined in [Issue#305](https://github.com/ai-dynamo/grove/issues/305). diff --git a/operator/api/common/constants/constants.go b/operator/api/common/constants/constants.go index 83fe02032..612219099 100644 --- a/operator/api/common/constants/constants.go +++ b/operator/api/common/constants/constants.go @@ -103,7 +103,7 @@ const ( ConditionReasonTopologyNotFound = "TopologyNotFound" // ConditionReasonTopologyNameMissing is the reason when a PodCliqueSet has incomplete - // topology constraints or otherwise cannot resolve an explicit topology reference. + // topology constraints or otherwise cannot resolve one effective topology reference. ConditionReasonTopologyNameMissing = "TopologyNameMissing" // ConditionReasonTopologyAwareSchedulingDisabled is the reason when a PodCliqueSet has topology diff --git a/operator/api/core/v1alpha1/crds/grove.io_podcliquesets.yaml b/operator/api/core/v1alpha1/crds/grove.io_podcliquesets.yaml index 95a457eb4..8ff6c3bdb 100644 --- a/operator/api/core/v1alpha1/crds/grove.io_podcliquesets.yaml +++ b/operator/api/core/v1alpha1/crds/grove.io_podcliquesets.yaml @@ -9317,7 +9317,7 @@ spec: description: |- PackDomain specifies the topology domain for grouping replicas. Controls placement constraint for EACH individual replica instance. - Must reference a domain in the topology levels defined in the ClusterTopologyBinding CR name as set in TopologyName + Must reference a domain in the topology levels defined in the ClusterTopologyBinding named by TopologyName. Example: "rack" means each replica independently placed within one rack. Note: Does NOT constrain all replicas to the same rack together. Different replicas can be in different topology domains. @@ -9328,12 +9328,16 @@ spec: topologyName: description: |- TopologyName is the name of the ClusterTopologyBinding resource to use for topology-aware scheduling. - If topologyConstraint is set, topologyName and packDomain must both be specified. + Setting TopologyName may be optional if the name can be inherited from a higher level scope. + When TopologyName is specified at a PCS/PCSG/PCLQ resource constraint, it will also be inherited + as the default ClusterTopologyBinding name on all sub-resources, unless overridden by another TopologyName + at a sub-resource. + For example, setting TopologyName at a PCS level makes it optional for child PCSG or PCLQ levels + when the sub-resources reuse the same ClusterTopologyBinding. Immutable after creation. type: string required: - packDomain - - topologyName type: object required: - name @@ -9975,7 +9979,7 @@ spec: description: |- PackDomain specifies the topology domain for grouping replicas. Controls placement constraint for EACH individual replica instance. - Must reference a domain in the topology levels defined in the ClusterTopologyBinding CR name as set in TopologyName + Must reference a domain in the topology levels defined in the ClusterTopologyBinding named by TopologyName. Example: "rack" means each replica independently placed within one rack. Note: Does NOT constrain all replicas to the same rack together. Different replicas can be in different topology domains. @@ -9986,12 +9990,16 @@ spec: topologyName: description: |- TopologyName is the name of the ClusterTopologyBinding resource to use for topology-aware scheduling. - If topologyConstraint is set, topologyName and packDomain must both be specified. + Setting TopologyName may be optional if the name can be inherited from a higher level scope. + When TopologyName is specified at a PCS/PCSG/PCLQ resource constraint, it will also be inherited + as the default ClusterTopologyBinding name on all sub-resources, unless overridden by another TopologyName + at a sub-resource. + For example, setting TopologyName at a PCS level makes it optional for child PCSG or PCLQ levels + when the sub-resources reuse the same ClusterTopologyBinding. Immutable after creation. type: string required: - packDomain - - topologyName type: object required: - cliqueNames @@ -10771,7 +10779,7 @@ spec: description: |- PackDomain specifies the topology domain for grouping replicas. Controls placement constraint for EACH individual replica instance. - Must reference a domain in the topology levels defined in the ClusterTopologyBinding CR name as set in TopologyName + Must reference a domain in the topology levels defined in the ClusterTopologyBinding named by TopologyName. Example: "rack" means each replica independently placed within one rack. Note: Does NOT constrain all replicas to the same rack together. Different replicas can be in different topology domains. @@ -10782,12 +10790,16 @@ spec: topologyName: description: |- TopologyName is the name of the ClusterTopologyBinding resource to use for topology-aware scheduling. - If topologyConstraint is set, topologyName and packDomain must both be specified. + Setting TopologyName may be optional if the name can be inherited from a higher level scope. + When TopologyName is specified at a PCS/PCSG/PCLQ resource constraint, it will also be inherited + as the default ClusterTopologyBinding name on all sub-resources, unless overridden by another TopologyName + at a sub-resource. + For example, setting TopologyName at a PCS level makes it optional for child PCSG or PCLQ levels + when the sub-resources reuse the same ClusterTopologyBinding. Immutable after creation. type: string required: - packDomain - - topologyName type: object required: - cliques diff --git a/operator/api/core/v1alpha1/podcliqueset.go b/operator/api/core/v1alpha1/podcliqueset.go index b2ffa247a..19ba9b36f 100644 --- a/operator/api/core/v1alpha1/podcliqueset.go +++ b/operator/api/core/v1alpha1/podcliqueset.go @@ -264,13 +264,18 @@ type PodCliqueTemplateSpec struct { // TopologyConstraint defines topology placement requirements. type TopologyConstraint struct { // TopologyName is the name of the ClusterTopologyBinding resource to use for topology-aware scheduling. - // If topologyConstraint is set, topologyName and packDomain must both be specified. + // Setting TopologyName may be optional if the name can be inherited from a higher level scope. + // When TopologyName is specified at a PCS/PCSG/PCLQ resource constraint, it will also be inherited + // as the default ClusterTopologyBinding name on all sub-resources, unless overridden by another TopologyName + // at a sub-resource. + // For example, setting TopologyName at a PCS level makes it optional for child PCSG or PCLQ levels + // when the sub-resources reuse the same ClusterTopologyBinding. // Immutable after creation. - // +required - TopologyName string `json:"topologyName"` + // +optional + TopologyName string `json:"topologyName,omitempty"` // PackDomain specifies the topology domain for grouping replicas. // Controls placement constraint for EACH individual replica instance. - // Must reference a domain in the topology levels defined in the ClusterTopologyBinding CR name as set in TopologyName + // Must reference a domain in the topology levels defined in the ClusterTopologyBinding named by TopologyName. // Example: "rack" means each replica independently placed within one rack. // Note: Does NOT constrain all replicas to the same rack together. // Different replicas can be in different topology domains. diff --git a/operator/e2e/tests/topology_test.go b/operator/e2e/tests/topology_test.go index 871e339ed..761e5a2ac 100644 --- a/operator/e2e/tests/topology_test.go +++ b/operator/e2e/tests/topology_test.go @@ -33,12 +33,14 @@ import ( "github.com/ai-dynamo/grove/operator/e2e/grove/topology" "github.com/ai-dynamo/grove/operator/e2e/setup" "github.com/ai-dynamo/grove/operator/e2e/testctx" + testutils "github.com/ai-dynamo/grove/operator/test/utils" "github.com/ai-dynamo/grove/operator/e2e/waiter" groveschedulerv1alpha1 "github.com/ai-dynamo/grove/scheduler/api/core/v1alpha1" kaischedulingv2alpha2 "github.com/kai-scheduler/KAI-scheduler/pkg/apis/scheduling/v2alpha2" "github.com/samber/lo" v1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/util/uuid" "sigs.k8s.io/controller-runtime/pkg/client" ) @@ -1636,12 +1638,14 @@ func Test_TAS20_PCSTopologyLevelsUnavailableCondition(t *testing.T) { Logger.Info("TAS20: PCS TopologyLevelsUnavailable Condition test completed successfully!") } -// Test_TAS21_ClusterTopologyValidationWebhook verifies that the ClusterTopologyBinding validating webhook -// rejects invalid topology definitions and invalid schedulerTopologyReferences. -func Test_TAS21_ClusterTopologyValidationWebhook(t *testing.T) { +// Test_TAS21_TopologyValidationWebhooks verifies validation behavior for topology-related resources: +// 1. The ClusterTopologyBinding validating webhook rejects invalid topology definitions and scheduler references. +// 2. The PodCliqueSet validating webhook allows a child topologyConstraint without topologyName when it +// can inherit from the PCS topologyConstraint and the referenced ClusterTopologyBinding exists. +func Test_TAS21_TopologyValidationWebhooks(t *testing.T) { ctx := context.Background() - Logger.Info("1. Initialize a Grove cluster for ClusterTopologyBinding webhook validation testing") + Logger.Info("1. Initialize a Grove cluster for topology webhook validation testing") tc, cleanup := testctx.PrepareTest(ctx, t, 0) defer cleanup() @@ -1722,5 +1726,33 @@ func Test_TAS21_ClusterTopologyValidationWebhook(t *testing.T) { }) } - Logger.Info("TAS21: ClusterTopologyBinding validating webhook test completed successfully!") + topologyVerifier := topology.NewTopologyVerifier(tc.Client, Logger) + + Logger.Info("2. Ensure grove-topology ClusterTopologyBinding exists for PodCliqueSet validation") + ensureGroveTopology(ctx, t, topologyVerifier) + + Logger.Info("3. Create PodCliqueSet with PCS topologyName and child inherited topology constraint") + pcs := testutils.NewPodCliqueSetBuilder("tas21-pcs-optional-topology-name", "default", uuid.NewUUID()). + WithReplicas(1). + WithTopologyConstraint(&corev1alpha1.TopologyConstraint{ + TopologyName: "grove-topology", + PackDomain: corev1alpha1.TopologyDomainZone, + }). + WithPodCliqueTemplateSpec( + testutils.NewPodCliqueTemplateSpecBuilder("worker"). + WithReplicas(1). + WithRoleName("worker-role"). + WithMinAvailable(1). + WithTopologyConstraint(&corev1alpha1.TopologyConstraint{ + PackDomain: corev1alpha1.TopologyDomainHost, + }). + Build(), + ). + Build() + + if err := tc.Client.Create(ctx, pcs); err != nil { + t.Fatalf("Expected PodCliqueSet create to succeed with inherited optional topologyName, got: %v", err) + } + + Logger.Info("TAS21: Topology validation webhook test completed successfully!") } diff --git a/operator/internal/controller/common/component/utils/topologyconstraint.go b/operator/internal/controller/common/component/utils/topologyconstraint.go index a4243bdef..5149e78cb 100644 --- a/operator/internal/controller/common/component/utils/topologyconstraint.go +++ b/operator/internal/controller/common/component/utils/topologyconstraint.go @@ -25,8 +25,10 @@ import ( ) var ( - // ErrTopologyNameMissing indicates that a topology constraint is incomplete and does not specify both topologyName and packDomain. - ErrTopologyNameMissing = errors.New("topology constraints require both topologyName and packDomain") + // ErrTopologyNameMissing indicates that a constrained level has no effective topologyName after inheritance is applied. + ErrTopologyNameMissing = errors.New("topology constraints require an explicit or inherited topologyName") + // ErrPackDomainMissing indicates that a topologyConstraint exists but does not specify packDomain. + ErrPackDomainMissing = errors.New("topology constraints require packDomain") // ErrMultipleTopologyNamesUnsupported indicates that topology constraints within a single PCS reference different topology names. ErrMultipleTopologyNamesUnsupported = errors.New("multiple topology names within a single PodCliqueSet are not supported") ) @@ -49,25 +51,120 @@ func HasAnyTopologyConstraint(pcs *grovecorev1alpha1.PodCliqueSet) bool { return false } -// ResolveTopologyNameForPodCliqueSet resolves the single topologyName used by all explicit topology constraints in the PCS. +// ResolveEffectiveTopologyNameForConstraint resolves the topologyName for a single constrained level. +func ResolveEffectiveTopologyNameForConstraint(explicitTopologyName, inheritedTopologyName string) (string, error) { + if explicitTopologyName != "" { + if inheritedTopologyName != "" && explicitTopologyName != inheritedTopologyName { + return "", ErrMultipleTopologyNamesUnsupported + } + return explicitTopologyName, nil + } + if inheritedTopologyName != "" { + return inheritedTopologyName, nil + } + return "", ErrTopologyNameMissing +} + +// ResolveEffectiveTopologyNameForPodCliqueSet resolves the single effective topologyName for the PCS after inheritance is applied. // Callers that need to distinguish "no topology constraints at all" from invalid topology constraints // must first call HasAnyTopologyConstraint. -func ResolveTopologyNameForPodCliqueSet(pcs *grovecorev1alpha1.PodCliqueSet) (string, error) { - topologyNames := sets.New[string]() - for _, tc := range getAllTopologyConstraintsInPodCliqueSet(pcs) { - if tc.TopologyName == "" || tc.PackDomain == "" { - return "", ErrTopologyNameMissing +func ResolveEffectiveTopologyNameForPodCliqueSet(pcs *grovecorev1alpha1.PodCliqueSet) (string, error) { + resolvedTopologyName := "" + recordTopologyName := func(topologyName string) error { + if resolvedTopologyName == "" { + resolvedTopologyName = topologyName + return nil + } + if resolvedTopologyName != topologyName { + return ErrMultipleTopologyNamesUnsupported + } + return nil + } + + pcsEffectiveTopologyName := "" + if tc := pcs.Spec.Template.TopologyConstraint; tc != nil { + if tc.PackDomain == "" { + return "", ErrPackDomainMissing + } + effectiveTopologyName, err := ResolveEffectiveTopologyNameForConstraint(tc.TopologyName, "") + if err != nil { + return "", err + } + pcsEffectiveTopologyName = effectiveTopologyName + if err := recordTopologyName(effectiveTopologyName); err != nil { + return "", err + } + } + + pcsgTopologyNameByCliqueName := make(map[string]string) + for _, pcsgConfig := range pcs.Spec.Template.PodCliqueScalingGroupConfigs { + if pcsgConfig.TopologyConstraint == nil { + continue + } + if pcsgConfig.TopologyConstraint.PackDomain == "" { + return "", ErrPackDomainMissing + } + effectiveTopologyName, err := ResolveEffectiveTopologyNameForConstraint(pcsgConfig.TopologyConstraint.TopologyName, pcsEffectiveTopologyName) + if err != nil { + return "", err + } + if err := recordTopologyName(effectiveTopologyName); err != nil { + return "", err + } + for _, cliqueName := range pcsgConfig.CliqueNames { + if _, exists := pcsgTopologyNameByCliqueName[cliqueName]; !exists { + pcsgTopologyNameByCliqueName[cliqueName] = effectiveTopologyName + } } - topologyNames.Insert(tc.TopologyName) } - switch topologyNames.Len() { - case 0: + + for _, pclqTemplateSpec := range pcs.Spec.Template.Cliques { + if pclqTemplateSpec.TopologyConstraint == nil { + continue + } + if pclqTemplateSpec.TopologyConstraint.PackDomain == "" { + return "", ErrPackDomainMissing + } + + inheritedTopologyName := pcsEffectiveTopologyName + if pcsgTopologyName, exists := pcsgTopologyNameByCliqueName[pclqTemplateSpec.Name]; exists { + inheritedTopologyName = pcsgTopologyName + } + + effectiveTopologyName, err := ResolveEffectiveTopologyNameForConstraint(pclqTemplateSpec.TopologyConstraint.TopologyName, inheritedTopologyName) + if err != nil { + return "", err + } + if err := recordTopologyName(effectiveTopologyName); err != nil { + return "", err + } + } + + if resolvedTopologyName == "" { return "", ErrTopologyNameMissing - case 1: - return sets.List(topologyNames)[0], nil - default: - return "", ErrMultipleTopologyNamesUnsupported } + return resolvedTopologyName, nil +} + +// FindExplicitTopologyNameForPodCliqueSet returns one explicit topologyName from the PCS. +// It is intended for callers operating on already-validated PCS objects where all explicit topologyName +// values are expected to match. Callers that need to distinguish "no topology constraints at all" from +// missing explicit topologyName values must first call HasAnyTopologyConstraint. +func FindExplicitTopologyNameForPodCliqueSet(pcs *grovecorev1alpha1.PodCliqueSet) (string, error) { + if tc := pcs.Spec.Template.TopologyConstraint; tc != nil && tc.TopologyName != "" { + return tc.TopologyName, nil + } + for _, pcsgConfig := range pcs.Spec.Template.PodCliqueScalingGroupConfigs { + if tc := pcsgConfig.TopologyConstraint; tc != nil && tc.TopologyName != "" { + return tc.TopologyName, nil + } + } + for _, pclqTemplateSpec := range pcs.Spec.Template.Cliques { + if tc := pclqTemplateSpec.TopologyConstraint; tc != nil && tc.TopologyName != "" { + return tc.TopologyName, nil + } + } + return "", ErrTopologyNameMissing } // GetUniqueTopologyDomainsInPodCliqueSet returns all unique, non-empty pack domains referenced by the PCS. diff --git a/operator/internal/controller/common/component/utils/topologyconstraint_test.go b/operator/internal/controller/common/component/utils/topologyconstraint_test.go index b3c7c8e45..dfdb6ab6d 100644 --- a/operator/internal/controller/common/component/utils/topologyconstraint_test.go +++ b/operator/internal/controller/common/component/utils/topologyconstraint_test.go @@ -25,7 +25,7 @@ import ( "github.com/stretchr/testify/assert" ) -func TestResolveTopologyNameForPodCliqueSet(t *testing.T) { +func TestResolveEffectiveTopologyNameForPodCliqueSet(t *testing.T) { makePCS := func(mutate func(*grovecorev1alpha1.PodCliqueSet)) *grovecorev1alpha1.PodCliqueSet { pcs := &grovecorev1alpha1.PodCliqueSet{ Spec: grovecorev1alpha1.PodCliqueSetSpec{ @@ -86,6 +86,79 @@ func TestResolveTopologyNameForPodCliqueSet(t *testing.T) { wantTopology: "topo-a", wantHasAny: true, }, + { + name: "child inherits topology name from pcs", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { + return makePCS(func(pcs *grovecorev1alpha1.PodCliqueSet) { + pcs.Spec.Template.TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + TopologyName: "topo-a", + PackDomain: grovecorev1alpha1.TopologyDomainRack, + } + pcs.Spec.Template.Cliques[0].TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + PackDomain: grovecorev1alpha1.TopologyDomainHost, + } + }) + }, + wantTopology: "topo-a", + wantHasAny: true, + }, + { + name: "pclq inherits topology name from constrained pcsg", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { + return makePCS(func(pcs *grovecorev1alpha1.PodCliqueSet) { + pcs.Spec.Template.Cliques[0].TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + PackDomain: grovecorev1alpha1.TopologyDomainHost, + } + pcs.Spec.Template.PodCliqueScalingGroupConfigs = []grovecorev1alpha1.PodCliqueScalingGroupConfig{ + { + Name: "sg1", + CliqueNames: []string{"worker"}, + TopologyConstraint: &grovecorev1alpha1.TopologyConstraint{ + TopologyName: "topo-a", + PackDomain: grovecorev1alpha1.TopologyDomainRack, + }, + }, + } + }) + }, + wantTopology: "topo-a", + wantHasAny: true, + }, + { + name: "standalone clique does not inherit topology from unrelated pcsg", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { + return makePCS(func(pcs *grovecorev1alpha1.PodCliqueSet) { + pcs.Spec.Template.Cliques = []*grovecorev1alpha1.PodCliqueTemplateSpec{ + { + Name: "standalone", + TopologyConstraint: &grovecorev1alpha1.TopologyConstraint{ + PackDomain: grovecorev1alpha1.TopologyDomainHost, + }, + Spec: grovecorev1alpha1.PodCliqueSpec{Replicas: 1}, + }, + { + Name: "grouped", + TopologyConstraint: &grovecorev1alpha1.TopologyConstraint{ + PackDomain: grovecorev1alpha1.TopologyDomainHost, + }, + Spec: grovecorev1alpha1.PodCliqueSpec{Replicas: 1}, + }, + } + pcs.Spec.Template.PodCliqueScalingGroupConfigs = []grovecorev1alpha1.PodCliqueScalingGroupConfig{ + { + Name: "sg1", + CliqueNames: []string{"grouped"}, + TopologyConstraint: &grovecorev1alpha1.TopologyConstraint{ + TopologyName: "topo-a", + PackDomain: grovecorev1alpha1.TopologyDomainRack, + }, + }, + } + }) + }, + wantErr: ErrTopologyNameMissing, + wantHasAny: true, + }, { name: "child-only explicit topology name resolves", setupPCS: func() *grovecorev1alpha1.PodCliqueSet { @@ -104,11 +177,29 @@ func TestResolveTopologyNameForPodCliqueSet(t *testing.T) { setupPCS: func() *grovecorev1alpha1.PodCliqueSet { return makePCS(func(pcs *grovecorev1alpha1.PodCliqueSet) { pcs.Spec.Template.Cliques[0].TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ - PackDomain: grovecorev1alpha1.TopologyDomainHost, + TopologyName: "topo-a", } }) }, - wantErr: ErrTopologyNameMissing, + wantErr: ErrPackDomainMissing, + wantHasAny: true, + }, + { + name: "pcsg topology constraint without packDomain is rejected", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { + return makePCS(func(pcs *grovecorev1alpha1.PodCliqueSet) { + pcs.Spec.Template.PodCliqueScalingGroupConfigs = []grovecorev1alpha1.PodCliqueScalingGroupConfig{ + { + Name: "sg1", + CliqueNames: []string{"worker"}, + TopologyConstraint: &grovecorev1alpha1.TopologyConstraint{ + TopologyName: "topo-a", + }, + }, + } + }) + }, + wantErr: ErrPackDomainMissing, wantHasAny: true, }, { @@ -134,7 +225,105 @@ func TestResolveTopologyNameForPodCliqueSet(t *testing.T) { t.Run(tc.name, func(t *testing.T) { pcs := tc.setupPCS() assert.Equal(t, tc.wantHasAny, HasAnyTopologyConstraint(pcs)) - topologyName, err := ResolveTopologyNameForPodCliqueSet(pcs) + topologyName, err := ResolveEffectiveTopologyNameForPodCliqueSet(pcs) + if tc.wantErr != nil { + assert.True(t, errors.Is(err, tc.wantErr)) + } else { + assert.NoError(t, err) + } + assert.Equal(t, tc.wantTopology, topologyName) + }) + } +} + +func TestFindExplicitTopologyNameForPodCliqueSet(t *testing.T) { + makePCS := func(mutate func(*grovecorev1alpha1.PodCliqueSet)) *grovecorev1alpha1.PodCliqueSet { + pcs := &grovecorev1alpha1.PodCliqueSet{ + Spec: grovecorev1alpha1.PodCliqueSetSpec{ + Template: grovecorev1alpha1.PodCliqueSetTemplateSpec{ + Cliques: []*grovecorev1alpha1.PodCliqueTemplateSpec{ + {Name: "worker", Spec: grovecorev1alpha1.PodCliqueSpec{Replicas: 1}}, + }, + }, + }, + } + if mutate != nil { + mutate(pcs) + } + return pcs + } + + tests := []struct { + name string + setupPCS func() *grovecorev1alpha1.PodCliqueSet + wantTopology string + wantErr error + }{ + { + name: "no constraints", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { return makePCS(nil) }, + wantTopology: "", + wantErr: ErrTopologyNameMissing, + }, + { + name: "pcs topology name is returned first", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { + return makePCS(func(pcs *grovecorev1alpha1.PodCliqueSet) { + pcs.Spec.Template.TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + TopologyName: "topo-a", + } + pcs.Spec.Template.Cliques[0].TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + TopologyName: "topo-a", + PackDomain: grovecorev1alpha1.TopologyDomainHost, + } + }) + }, + wantTopology: "topo-a", + }, + { + name: "pcsg topology name is returned when pcs topology name is absent", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { + return makePCS(func(pcs *grovecorev1alpha1.PodCliqueSet) { + pcs.Spec.Template.PodCliqueScalingGroupConfigs = []grovecorev1alpha1.PodCliqueScalingGroupConfig{ + { + Name: "sg1", + CliqueNames: []string{"worker"}, + TopologyConstraint: &grovecorev1alpha1.TopologyConstraint{ + TopologyName: "topo-a", + }, + }, + } + }) + }, + wantTopology: "topo-a", + }, + { + name: "clique topology name is returned when it is the first explicit topology name", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { + return makePCS(func(pcs *grovecorev1alpha1.PodCliqueSet) { + pcs.Spec.Template.Cliques[0].TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + TopologyName: "topo-a", + } + }) + }, + wantTopology: "topo-a", + }, + { + name: "constraints without any explicit topology name are rejected", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { + return makePCS(func(pcs *grovecorev1alpha1.PodCliqueSet) { + pcs.Spec.Template.Cliques[0].TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + PackDomain: grovecorev1alpha1.TopologyDomainHost, + } + }) + }, + wantErr: ErrTopologyNameMissing, + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + topologyName, err := FindExplicitTopologyNameForPodCliqueSet(tc.setupPCS()) if tc.wantErr != nil { assert.True(t, errors.Is(err, tc.wantErr)) } else { diff --git a/operator/internal/controller/podcliqueset/components/podgang/podgang.go b/operator/internal/controller/podcliqueset/components/podgang/podgang.go index 551e96a40..c602d3567 100644 --- a/operator/internal/controller/podcliqueset/components/podgang/podgang.go +++ b/operator/internal/controller/podcliqueset/components/podgang/podgang.go @@ -144,7 +144,10 @@ func (r _resource) buildResource(pcs *grovecorev1alpha1.PodCliqueSet, pgi *podGa } pg.Annotations = mirrorPCSMetadata(pg.Annotations, pcs.Annotations, nil) if r.tasConfig.Enabled && podGangHasTranslatedTopologyConstraints(pgi) { - if topologyName, err := componentutils.ResolveTopologyNameForPodCliqueSet(pcs); err == nil && topologyName != "" { + if topologyName, err := componentutils.FindExplicitTopologyNameForPodCliqueSet(pcs); err == nil && topologyName != "" { + if pg.Annotations == nil { + pg.Annotations = make(map[string]string) + } pg.Annotations[apicommonconstants.AnnotationTopologyName] = topologyName } else { delete(pg.Annotations, apicommonconstants.AnnotationTopologyName) diff --git a/operator/internal/controller/podcliqueset/components/podgang/syncflow.go b/operator/internal/controller/podcliqueset/components/podgang/syncflow.go index b1b0fff6c..6b40e64fa 100644 --- a/operator/internal/controller/podcliqueset/components/podgang/syncflow.go +++ b/operator/internal/controller/podcliqueset/components/podgang/syncflow.go @@ -74,7 +74,7 @@ func (r _resource) prepareSyncFlow(ctx context.Context, logger logr.Logger, pcs sc.tasEnabled = r.tasConfig.Enabled if r.tasConfig.Enabled && componentutils.HasAnyTopologyConstraint(pcs) { - topologyName, resolveErr := componentutils.ResolveTopologyNameForPodCliqueSet(pcs) + topologyName, resolveErr := componentutils.FindExplicitTopologyNameForPodCliqueSet(pcs) if resolveErr == nil && topologyName != "" { sc.topologyLevels, err = clustertopology.GetClusterTopologyLevels(ctx, r.client, topologyName) if err != nil { @@ -92,7 +92,7 @@ func (r _resource) prepareSyncFlow(ctx context.Context, logger logr.Logger, pcs sc.topologyLevels = nil } } - // If topologyName resolution fails, sc.topologyLevels stays nil — the PCS reconciler + // If explicit topologyName lookup fails, sc.topologyLevels stays nil — the PCS reconciler // handles this via the TopologyNameMissing condition. } diff --git a/operator/internal/controller/podcliqueset/components/podgang/syncflow_test.go b/operator/internal/controller/podcliqueset/components/podgang/syncflow_test.go index 301c62aa6..8d3eedfb4 100644 --- a/operator/internal/controller/podcliqueset/components/podgang/syncflow_test.go +++ b/operator/internal/controller/podcliqueset/components/podgang/syncflow_test.go @@ -1671,7 +1671,7 @@ func TestPrepareSyncFlowTopologyResolution(t *testing.T) { var objs []client.Object objs = append(objs, pcs) if tc.clusterTopologyExists { - topologyName, err := componentutils.ResolveTopologyNameForPodCliqueSet(pcs) + topologyName, err := componentutils.FindExplicitTopologyNameForPodCliqueSet(pcs) require.NoError(t, err) objs = append(objs, makeClusterTopologyWithLevels(topologyName, ctLevels)) } diff --git a/operator/internal/controller/podcliqueset/reconcilestatus.go b/operator/internal/controller/podcliqueset/reconcilestatus.go index 72c1866a9..9a44a63eb 100644 --- a/operator/internal/controller/podcliqueset/reconcilestatus.go +++ b/operator/internal/controller/podcliqueset/reconcilestatus.go @@ -316,30 +316,19 @@ func (r *Reconciler) computeTopologyLevelsUnavailableCondition(ctx context.Conte }, nil } - topologyName, err := componentutils.ResolveTopologyNameForPodCliqueSet(pcs) + topologyName, err := componentutils.FindExplicitTopologyNameForPodCliqueSet(pcs) if err != nil { - switch { - case errors.Is(err, componentutils.ErrTopologyNameMissing): + if errors.Is(err, componentutils.ErrTopologyNameMissing) { return metav1.Condition{ Type: apicommonconstants.ConditionTopologyLevelsUnavailable, Status: metav1.ConditionUnknown, Reason: apicommonconstants.ConditionReasonTopologyNameMissing, - Message: "topology constraints must specify both topologyName and packDomain", + Message: "PodCliqueSet topology constraints must include topologyName", ObservedGeneration: pcs.Generation, LastTransitionTime: metav1.Now(), }, nil - case errors.Is(err, componentutils.ErrMultipleTopologyNamesUnsupported): - return metav1.Condition{ - Type: apicommonconstants.ConditionTopologyLevelsUnavailable, - Status: metav1.ConditionUnknown, - Reason: apicommonconstants.ConditionReasonTopologyNameMissing, - Message: "all topologyConstraint.topologyName values within a PodCliqueSet must match in the current implementation", - ObservedGeneration: pcs.Generation, - LastTransitionTime: metav1.Now(), - }, nil - default: - return metav1.Condition{}, fmt.Errorf("failed to resolve topologyName: %w", err) } + return metav1.Condition{}, fmt.Errorf("failed to find explicit topologyName: %w", err) } topologyLevels, err := clustertopology.GetClusterTopologyLevels(ctx, r.client, topologyName) diff --git a/operator/internal/controller/podcliqueset/reconcilestatus_test.go b/operator/internal/controller/podcliqueset/reconcilestatus_test.go index eafdf98fc..d1097ca4c 100644 --- a/operator/internal/controller/podcliqueset/reconcilestatus_test.go +++ b/operator/internal/controller/podcliqueset/reconcilestatus_test.go @@ -432,7 +432,7 @@ func TestMutateTopologyLevelUnavailableConditions(t *testing.T) { }, wantStatus: metav1.ConditionUnknown, wantReason: apicommonconstants.ConditionReasonTopologyNameMissing, - wantMsgContain: "both topologyName and packDomain", + wantMsgContain: "include topologyName", }, { name: "TAS enabled, no constraints at all — False/AllClusterTopologyLevelsAvailable with no-constraints message", @@ -444,7 +444,7 @@ func TestMutateTopologyLevelUnavailableConditions(t *testing.T) { wantMsgContain: "No topology constraints defined", }, { - name: "TAS enabled, incomplete PCS topology constraint with valid PCSG constraint — Unknown/TopologyNameMissing", + name: "TAS enabled, incomplete PCS topology constraint with valid PCSG constraint — False/AllClusterTopologyLevelsAvailable", tasEnabled: true, setupPCS: func() *grovecorev1alpha1.PodCliqueSet { pcs := basePCS("") @@ -466,9 +466,9 @@ func TestMutateTopologyLevelUnavailableConditions(t *testing.T) { extraObjects: []client.Object{ clusterTopology("my-topology", standardLevels), }, - wantStatus: metav1.ConditionUnknown, - wantReason: apicommonconstants.ConditionReasonTopologyNameMissing, - wantMsgContain: "both topologyName and packDomain", + wantStatus: metav1.ConditionFalse, + wantReason: apicommonconstants.ConditionReasonAllTopologyLevelsAvailable, + wantMsgContain: "All topology levels are available", }, } diff --git a/operator/internal/controller/podcliqueset/register.go b/operator/internal/controller/podcliqueset/register.go index 44f1f0897..7c3466890 100644 --- a/operator/internal/controller/podcliqueset/register.go +++ b/operator/internal/controller/podcliqueset/register.go @@ -109,7 +109,7 @@ func mapClusterTopologyToPodCliqueSets(cl client.Client) handler.MapFunc { requests := make([]reconcile.Request, 0, len(pcsList.Items)) for i := range pcsList.Items { pcs := &pcsList.Items[i] - topologyName, err := componentutils.ResolveTopologyNameForPodCliqueSet(pcs) + topologyName, err := componentutils.FindExplicitTopologyNameForPodCliqueSet(pcs) if err != nil || topologyName != ct.Name { continue } diff --git a/operator/internal/webhook/admission/pcs/validation/podcliqueset.go b/operator/internal/webhook/admission/pcs/validation/podcliqueset.go index 9edeb7df7..395a1631f 100644 --- a/operator/internal/webhook/admission/pcs/validation/podcliqueset.go +++ b/operator/internal/webhook/admission/pcs/validation/podcliqueset.go @@ -696,7 +696,7 @@ func (v *pcsValidator) validateTopologyConstraintsUpdate(oldPCS *grovecorev1alph // packDomain without topologyName, because topologyName did not exist in the API yet. Other invalid shapes // are not treated as repairable legacy state here. if componentutils.HasAnyTopologyConstraint(oldPCS) && hasRepairableLegacyTopologyConstraint(oldPCS) { - if _, err := componentutils.ResolveTopologyNameForPodCliqueSet(oldPCS); err != nil { + if _, err := componentutils.ResolveEffectiveTopologyNameForPodCliqueSet(oldPCS); err != nil { if errors.Is(err, componentutils.ErrTopologyNameMissing) { if allErrs := immutabilityValidator.validateTopologyConstraintImmutability(oldPCS, field.NewPath("spec").Child("template"), true); len(allErrs) > 0 { return allErrs @@ -715,31 +715,17 @@ func (v *pcsValidator) validateTopologyConstraintsUpdate(oldPCS *grovecorev1alph } // resolveTopologyDomains resolves the ordered list of topology domains from the ClusterTopologyBinding -// referenced by the PCS's topologyName. Returns nil domains (no validation) if no topology constraints exist. -func (v *pcsValidator) resolveTopologyDomains(ctx context.Context) ([]string, field.ErrorList) { +// referenced by the PCS's effective topologyName. Returns nil domains (no validation) if no topology constraints exist. +func (v *pcsValidator) resolveTopologyDomains(ctx context.Context) (domains []string, allErrs field.ErrorList) { // No constraints at all — nothing to validate. if !componentutils.HasAnyTopologyConstraint(v.pcs) { return nil, nil } - if completenessErrs := topologyConstraintCompletenessFieldErrors(v.pcs); len(completenessErrs) > 0 { - return nil, completenessErrs - } - if mismatchErrs := multipleTopologyNamesFieldErrors(v.pcs); len(mismatchErrs) > 0 { - return nil, mismatchErrs - } - topologyName, err := componentutils.ResolveTopologyNameForPodCliqueSet(v.pcs) - if err != nil { - fldPath := field.NewPath("spec", "template", "topologyConstraint") - if errors.Is(err, componentutils.ErrTopologyNameMissing) { - return nil, field.ErrorList{field.Required(fldPath, - "topologyConstraint must specify both topologyName and packDomain")} - } - if errors.Is(err, componentutils.ErrMultipleTopologyNamesUnsupported) { - return nil, field.ErrorList{field.Invalid(fldPath, nil, - "all topologyConstraint.topologyName values within a PodCliqueSet must match in the current implementation")} - } - return nil, field.ErrorList{field.InternalError(fldPath, err)} + var topologyName string + topologyName, allErrs = resolveEffectiveTopologyNameFieldErrors(v.pcs) + if len(allErrs) > 0 { + return nil, allErrs } fldPath := field.NewPath("spec", "template", "topologyConstraint", "topologyName") @@ -755,89 +741,203 @@ func (v *pcsValidator) resolveTopologyDomains(ctx context.Context) ([]string, fi fmt.Errorf("failed to fetch ClusterTopologyBinding %q: %w", topologyName, err))} } - domains := make([]string, len(levels)) + domains = make([]string, len(levels)) for i, level := range levels { domains[i] = string(level.Domain) } return domains, nil } -func topologyConstraintCompletenessFieldErrors(pcs *grovecorev1alpha1.PodCliqueSet) field.ErrorList { - var allErrs field.ErrorList - - validateConstraint := func(tc *grovecorev1alpha1.TopologyConstraint, tcPath *field.Path) { - if tc == nil { - return - } - if tc.TopologyName == "" { - allErrs = append(allErrs, field.Required(tcPath.Child("topologyName"), - "topologyName is required when topologyConstraint is set")) - } - if tc.PackDomain == "" { - allErrs = append(allErrs, field.Required(tcPath.Child("packDomain"), - "packDomain is required when topologyConstraint is set")) - } +func validateResolvableTopologyConstraint( + tc *grovecorev1alpha1.TopologyConstraint, + tcPath *field.Path, + inheritedTopologyName string, + canInherit bool, +) (effectiveTopologyName string, resolved bool, allErrs field.ErrorList) { + if tc == nil { + return "", false, nil } - - validateConstraint(pcs.Spec.Template.TopologyConstraint, field.NewPath("spec", "template", "topologyConstraint")) - for i, clique := range pcs.Spec.Template.Cliques { - validateConstraint(clique.TopologyConstraint, field.NewPath("spec", "template", "cliques").Index(i).Child("topologyConstraint")) + if tc.PackDomain == "" { + return "", false, field.ErrorList{field.Required(tcPath.Child("packDomain"), "packDomain is required when topologyConstraint is set")} } - for i, pcsg := range pcs.Spec.Template.PodCliqueScalingGroupConfigs { - validateConstraint(pcsg.TopologyConstraint, field.NewPath("spec", "template", "podCliqueScalingGroups").Index(i).Child("topologyConstraint")) + if tc.TopologyName == "" && (!canInherit || inheritedTopologyName == "") { + return "", false, field.ErrorList{field.Required( + tcPath.Child("topologyName"), + "topologyName is required when topologyConstraint is set and cannot be inherited", + )} } + var err error + effectiveTopologyName, err = componentutils.ResolveEffectiveTopologyNameForConstraint(tc.TopologyName, inheritedTopologyName) + if err == nil { + return effectiveTopologyName, true, nil + } + if errors.Is(err, componentutils.ErrMultipleTopologyNamesUnsupported) { + return "", false, field.ErrorList{field.Invalid( + tcPath.Child("topologyName"), + tc.TopologyName, + "all topologyConstraint.topologyName values within a PodCliqueSet must match in the current implementation", + )} + } + return "", false, field.ErrorList{field.InternalError(tcPath.Child("topologyName"), err)} +} - return allErrs +type topologyNameObserver struct { + resolvedTopologyName string + hasConflictingTopology bool } -func multipleTopologyNamesFieldErrors(pcs *grovecorev1alpha1.PodCliqueSet) field.ErrorList { - var allErrs field.ErrorList - topologyNames := map[string]struct{}{} +func (o *topologyNameObserver) Observe(effectiveTopologyName string) { + if o.resolvedTopologyName == "" { + o.resolvedTopologyName = effectiveTopologyName + return + } + if o.resolvedTopologyName != effectiveTopologyName { + o.hasConflictingTopology = true + } +} - recordTopologyName := func(name string) { - if name != "" { - topologyNames[name] = struct{}{} +func resolvePCSAndPCSGTopologyNames( + pcs *grovecorev1alpha1.PodCliqueSet, + topologyObserver *topologyNameObserver, +) ( + pcsEffectiveTopologyName string, + pcsResolvable bool, + pcsgEffectiveTopologyNameByCliqueName map[string]string, + allErrs field.ErrorList, +) { + if pcs.Spec.Template.TopologyConstraint != nil { + effectiveTopologyName, resolved, errs := validateResolvableTopologyConstraint( + pcs.Spec.Template.TopologyConstraint, + field.NewPath("spec", "template", "topologyConstraint"), + "", + false, + ) + allErrs = append(allErrs, errs...) + if resolved { + pcsEffectiveTopologyName = effectiveTopologyName + pcsResolvable = true + topologyObserver.Observe(effectiveTopologyName) } } - if pcs.Spec.Template.TopologyConstraint != nil { - recordTopologyName(pcs.Spec.Template.TopologyConstraint.TopologyName) + pcsgEffectiveTopologyNameByCliqueName = make(map[string]string) + for i, pcsg := range pcs.Spec.Template.PodCliqueScalingGroupConfigs { + if pcsg.TopologyConstraint == nil { + continue + } + effectiveTopologyName, resolved, errs := validateResolvableTopologyConstraint( + pcsg.TopologyConstraint, + field.NewPath("spec", "template", "podCliqueScalingGroups").Index(i).Child("topologyConstraint"), + pcsEffectiveTopologyName, + pcsResolvable, + ) + allErrs = append(allErrs, errs...) + if resolved { + topologyObserver.Observe(effectiveTopologyName) + for _, cliqueName := range pcsg.CliqueNames { + if _, exists := pcsgEffectiveTopologyNameByCliqueName[cliqueName]; !exists { + pcsgEffectiveTopologyNameByCliqueName[cliqueName] = effectiveTopologyName + } + } + } } + return pcsEffectiveTopologyName, pcsResolvable, pcsgEffectiveTopologyNameByCliqueName, allErrs +} + +func resolvePCLQTopologyNames( + pcs *grovecorev1alpha1.PodCliqueSet, + pcsEffectiveTopologyName string, + pcsResolvable bool, + pcsgEffectiveTopologyNameByCliqueName map[string]string, + topologyObserver *topologyNameObserver, +) field.ErrorList { + var allErrs field.ErrorList + for i, clique := range pcs.Spec.Template.Cliques { - if clique.TopologyConstraint != nil { - recordTopologyName(clique.TopologyConstraint.TopologyName) + if clique.TopologyConstraint == nil { + continue } - if clique.TopologyConstraint != nil && len(topologyNames) > 1 { - allErrs = append(allErrs, field.Invalid( - field.NewPath("spec", "template", "cliques").Index(i).Child("topologyConstraint", "topologyName"), - clique.TopologyConstraint.TopologyName, - "all topologyConstraint.topologyName values within a PodCliqueSet must match in the current implementation")) + + inheritedTopologyName := pcsEffectiveTopologyName + canInherit := pcsResolvable + if pcsgTopologyName, exists := pcsgEffectiveTopologyNameByCliqueName[clique.Name]; exists { + inheritedTopologyName = pcsgTopologyName + canInherit = true + } + + effectiveTopologyName, resolved, errs := validateResolvableTopologyConstraint( + clique.TopologyConstraint, + field.NewPath("spec", "template", "cliques").Index(i).Child("topologyConstraint"), + inheritedTopologyName, + canInherit, + ) + allErrs = append(allErrs, errs...) + if resolved { + topologyObserver.Observe(effectiveTopologyName) } } + return allErrs +} + +func topologyNameConflictFieldErrors(pcs *grovecorev1alpha1.PodCliqueSet) field.ErrorList { + var allErrs field.ErrorList + + if tc := pcs.Spec.Template.TopologyConstraint; tc != nil && tc.TopologyName != "" { + allErrs = append(allErrs, field.Invalid( + field.NewPath("spec", "template", "topologyConstraint", "topologyName"), + tc.TopologyName, + "all topologyConstraint.topologyName values within a PodCliqueSet must match in the current implementation", + )) + } + for i, pcsg := range pcs.Spec.Template.PodCliqueScalingGroupConfigs { - if pcsg.TopologyConstraint != nil { - recordTopologyName(pcsg.TopologyConstraint.TopologyName) - } - if pcsg.TopologyConstraint != nil && len(topologyNames) > 1 { + if tc := pcsg.TopologyConstraint; tc != nil && tc.TopologyName != "" { allErrs = append(allErrs, field.Invalid( field.NewPath("spec", "template", "podCliqueScalingGroups").Index(i).Child("topologyConstraint", "topologyName"), - pcsg.TopologyConstraint.TopologyName, - "all topologyConstraint.topologyName values within a PodCliqueSet must match in the current implementation")) + tc.TopologyName, + "all topologyConstraint.topologyName values within a PodCliqueSet must match in the current implementation", + )) } } - if len(topologyNames) > 1 && pcs.Spec.Template.TopologyConstraint != nil { - allErrs = append(allErrs, field.Invalid( - field.NewPath("spec", "template", "topologyConstraint", "topologyName"), - pcs.Spec.Template.TopologyConstraint.TopologyName, - "all topologyConstraint.topologyName values within a PodCliqueSet must match in the current implementation")) + for i, clique := range pcs.Spec.Template.Cliques { + if tc := clique.TopologyConstraint; tc != nil && tc.TopologyName != "" { + allErrs = append(allErrs, field.Invalid( + field.NewPath("spec", "template", "cliques").Index(i).Child("topologyConstraint", "topologyName"), + tc.TopologyName, + "all topologyConstraint.topologyName values within a PodCliqueSet must match in the current implementation", + )) + } } return allErrs } +func resolveEffectiveTopologyNameFieldErrors(pcs *grovecorev1alpha1.PodCliqueSet) (resolvedTopologyName string, allErrs field.ErrorList) { + topologyObserver := &topologyNameObserver{} + + pcsEffectiveTopologyName, pcsResolvable, pcsgEffectiveTopologyNames, allErrs := resolvePCSAndPCSGTopologyNames(pcs, topologyObserver) + allErrs = append(allErrs, resolvePCLQTopologyNames( + pcs, + pcsEffectiveTopologyName, + pcsResolvable, + pcsgEffectiveTopologyNames, + topologyObserver, + )...) + if topologyObserver.hasConflictingTopology { + allErrs = append(allErrs, topologyNameConflictFieldErrors(pcs)...) + } + if len(allErrs) > 0 { + return "", allErrs + } + if topologyObserver.resolvedTopologyName == "" { + return "", nil + } + return topologyObserver.resolvedTopologyName, nil +} + // requiresOrderValidation checks if the StartupType requires clique order validation. func requiresOrderValidation(startupType *grovecorev1alpha1.CliqueStartupType) bool { return startupType != nil && (*startupType == grovecorev1alpha1.CliqueStartupTypeInOrder || *startupType == grovecorev1alpha1.CliqueStartupTypeExplicit) diff --git a/operator/internal/webhook/admission/pcs/validation/podcliqueset_test.go b/operator/internal/webhook/admission/pcs/validation/podcliqueset_test.go index e8baac0e4..a8e353c79 100644 --- a/operator/internal/webhook/admission/pcs/validation/podcliqueset_test.go +++ b/operator/internal/webhook/admission/pcs/validation/podcliqueset_test.go @@ -1723,6 +1723,32 @@ func TestValidateTopologyConstraintsPCSTopologyName(t *testing.T) { }, clusterObjs: []client.Object{createTestClusterTopology()}, }, + { + name: "update repairs legacy child topologyName through PCS inheritance", + operation: admissionv1.Update, + setupOldPCS: func() *grovecorev1alpha1.PodCliqueSet { + pcs := createTestPodCliqueSet("pcs-repair-inherited-child") + pcs.Spec.Template.TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + PackDomain: grovecorev1alpha1.TopologyDomainZone, + } + pcs.Spec.Template.Cliques[0].TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + PackDomain: grovecorev1alpha1.TopologyDomainHost, + } + return pcs + }, + setupNewPCS: func() *grovecorev1alpha1.PodCliqueSet { + pcs := createTestPodCliqueSet("pcs-repair-inherited-child") + pcs.Spec.Template.TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + TopologyName: "topo-a", + PackDomain: grovecorev1alpha1.TopologyDomainZone, + } + pcs.Spec.Template.Cliques[0].TopologyConstraint = &grovecorev1alpha1.TopologyConstraint{ + PackDomain: grovecorev1alpha1.TopologyDomainHost, + } + return pcs + }, + clusterObjs: []client.Object{createTestClusterTopology()}, + }, { name: "update does not treat topologyName-only PCS constraint as repairable legacy shape", operation: admissionv1.Update, diff --git a/operator/internal/webhook/admission/pcs/validation/topologyconstraints.go b/operator/internal/webhook/admission/pcs/validation/topologyconstraints.go index afce59abe..7a6dcda28 100644 --- a/operator/internal/webhook/admission/pcs/validation/topologyconstraints.go +++ b/operator/internal/webhook/admission/pcs/validation/topologyconstraints.go @@ -56,7 +56,7 @@ func (v *topologyConstraintsValidator) validate() field.ErrorList { return v.disallowConstraintsForCreateWhenTASIsDisabled(pcsTemplateFLDPath) } - if errs := v.validateTopologyConstraintCompleteness(pcsTemplateFLDPath); len(errs) > 0 { + if errs := v.validateTopologyConstraintPackDomains(pcsTemplateFLDPath); len(errs) > 0 { return errs } @@ -117,19 +117,15 @@ func (v *topologyConstraintsValidator) disallowConstraintsForCreateWhenTASIsDisa return allErrs } -// validateTopologyConstraintCompleteness ensures that a topology constraint never specifies only one of -// topologyName or packDomain. If the constraint is present, both fields are required. -func (v *topologyConstraintsValidator) validateTopologyConstraintCompleteness(fldPath *field.Path) field.ErrorList { +// validateTopologyConstraintPackDomains ensures that any declared topology constraint specifies packDomain. +// topologyName resolution, inheritance, and single-topology validation are handled earlier by pcsValidator. +func (v *topologyConstraintsValidator) validateTopologyConstraintPackDomains(fldPath *field.Path) field.ErrorList { var allErrs field.ErrorList validateConstraint := func(tc *grovecorev1alpha1.TopologyConstraint, tcPath *field.Path) { if tc == nil { return } - if tc.TopologyName == "" { - allErrs = append(allErrs, field.Required(tcPath.Child("topologyName"), - "topologyName is required when topologyConstraint is set")) - } if tc.PackDomain == "" { allErrs = append(allErrs, field.Required(tcPath.Child("packDomain"), "packDomain is required when topologyConstraint is set")) diff --git a/operator/internal/webhook/admission/pcs/validation/topologyconstraints_test.go b/operator/internal/webhook/admission/pcs/validation/topologyconstraints_test.go index c6ded8a9c..ab066a579 100644 --- a/operator/internal/webhook/admission/pcs/validation/topologyconstraints_test.go +++ b/operator/internal/webhook/admission/pcs/validation/topologyconstraints_test.go @@ -712,6 +712,78 @@ func TestResolveTopologyDomains(t *testing.T) { expectedErrorMatchers: []testutils.ErrorMatcher{}, setupClient: nil, }, + { + name: "Child inherits topologyName from PCS", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { + return testutils.NewPodCliqueSetBuilder("test-pcs", "default", uuid.NewUUID()). + WithReplicas(1). + WithTopologyConstraint(&grovecorev1alpha1.TopologyConstraint{ + TopologyName: "my-topo", + PackDomain: "zone", + }). + WithPodCliqueTemplateSpec(&grovecorev1alpha1.PodCliqueTemplateSpec{ + Name: "worker", + TopologyConstraint: &grovecorev1alpha1.TopologyConstraint{ + PackDomain: "host", + }, + Spec: grovecorev1alpha1.PodCliqueSpec{Replicas: 1, RoleName: "worker-role"}, + }). + Build() + }, + clusterTopologyObjects: []client.Object{ + &grovecorev1alpha1.ClusterTopologyBinding{ + ObjectMeta: v1.ObjectMeta{Name: "my-topo"}, + Spec: grovecorev1alpha1.ClusterTopologyBindingSpec{ + Levels: []grovecorev1alpha1.TopologyLevel{ + {Domain: "zone", Key: "topology.kubernetes.io/zone"}, + {Domain: "host", Key: "kubernetes.io/hostname"}, + }, + }, + }, + }, + expectedDomains: []string{"zone", "host"}, + expectedErrorMatchers: []testutils.ErrorMatcher{}, + setupClient: nil, + }, + { + name: "Standalone clique does not inherit topologyName from unrelated PCSG", + setupPCS: func() *grovecorev1alpha1.PodCliqueSet { + return testutils.NewPodCliqueSetBuilder("test-pcs", "default", uuid.NewUUID()). + WithReplicas(1). + WithPodCliqueTemplateSpec(&grovecorev1alpha1.PodCliqueTemplateSpec{ + Name: "standalone", + TopologyConstraint: &grovecorev1alpha1.TopologyConstraint{ + PackDomain: "host", + }, + Spec: grovecorev1alpha1.PodCliqueSpec{Replicas: 1, RoleName: "standalone-role"}, + }). + WithPodCliqueTemplateSpec(&grovecorev1alpha1.PodCliqueTemplateSpec{ + Name: "grouped", + TopologyConstraint: &grovecorev1alpha1.TopologyConstraint{ + PackDomain: "host", + }, + Spec: grovecorev1alpha1.PodCliqueSpec{Replicas: 1, RoleName: "grouped-role"}, + }). + WithPodCliqueScalingGroupConfig(grovecorev1alpha1.PodCliqueScalingGroupConfig{ + Name: "workers", + CliqueNames: []string{"grouped"}, + TopologyConstraint: &grovecorev1alpha1.TopologyConstraint{ + TopologyName: "my-topo", + PackDomain: "zone", + }, + }). + Build() + }, + clusterTopologyObjects: []client.Object{}, + expectedDomains: nil, + expectedErrorMatchers: []testutils.ErrorMatcher{ + { + ErrorType: field.ErrorTypeRequired, + Field: "spec.template.cliques[0].topologyConstraint.topologyName", + }, + }, + setupClient: nil, + }, { name: "Incomplete PCS topology constraint is rejected", setupPCS: func() *grovecorev1alpha1.PodCliqueSet { @@ -780,10 +852,6 @@ func TestResolveTopologyDomains(t *testing.T) { ErrorType: field.ErrorTypeInvalid, Field: "spec.template.cliques[0].topologyConstraint.topologyName", }, - { - ErrorType: field.ErrorTypeInvalid, - Field: "spec.template.topologyConstraint.topologyName", - }, }, setupClient: nil, },