feat: GREP-0368 Preferred Topology Constraint API#523
Conversation
|
What is the reason for providing preferred as an option over just packing everything as close as possible if a topology constraint is provided. I remember we discussed earlier that preferred really seems to just be trying to save the scheduler cycles by allowing it to just skip levels. Main case I can see is if we dont want to pack everything as close as possible by default but then that really sounds more like a spread constraint. |
@nvrohanv please see the short discussion in the issue #368 |
|
@shayasoolin question on "Single field addition to the shared TopologyConstraint struct — covers |
@gal-revach it will be possible to specify it in each level (with auto-propagation from each level to its children) - just like with |
gflarity
left a comment
There was a problem hiding this comment.
Looks good, just a few things to clarify.
9d98063 to
1c7700a
Compare
519e32a to
97af3ae
Compare
nvrohanv
left a comment
There was a problem hiding this comment.
Great design! Overall I agree with it but I think there are two changes that should occur:
- I think all
ClusterTopologyCR should have default value ofpreferredPackDomainand have it be set to the lowest level. I explain my rationale in one of the comments. If you agree I think we should explain the rationale for this in the motivations - In the motivations section I would like to address that the reason we don't provide this as a user API is because the whole point of not setting preferred to the lowest level always is because it might destroy scheduler performance and therefore it is difficult to trust the user to care about this and therefore we delegate the responsibility to the cluster admin.
| The resulting behavior is: | ||
|
|
||
| - Workload has no `topologyName`: Grove emits no topology constraint for that workload. | ||
| - Workload has `topologyName`, and the referenced `ClusterTopology` has no `preferredPackDomain`: existing required `packDomain` behavior is preserved. |
There was a problem hiding this comment.
This is an interesting point - under this regime it is not guaranteed that if you provide a topologyName but no packDomain that you will always get some best effort scheduling. I am thinking we should instead default preferredPackDomain on a ClusterTopology to the lowest level unless the cluster admin explicitly opts out. Without doing this it seems unfair to the user to not be able to get best-effort topology packing without risking their workload not being scheduled via packDomain requirements.
I believe the preferredPackDomain should be viewed more as an escape hatch for the cluster admin to avoid making the scheduler slow in cases where the ideal behavior - preferring to pack at lowest level - is not feasible, and therefore it is the cluster admin's responsibility to decide to take that escape hatch.
There was a problem hiding this comment.
I think those questions boil down to who's responsible for the preferred-topology thing, who takes care of applying it -- without risking the cluster operation.
If we agree it's the cluster admin (as we discussed so far), then "unfair to the user to not be able to get best-effort topology" is irrelevant. The user only controls required constraints, not preferred ones.
If we make the user responsible, then it should be a workload-level API field.
If it's the user, with some policies enforced by the admin, then again it should be a workload-level API field, that can be controlled by admin policy tools.
If we make preferred topology the default behavior, then not only that we need preferredPackDomain as an "escape hatch" as you said, but we also need some boolean preferredPackEnabled field to opt-out altogether. IMHO setting the highest topology level as preferred is NOT the same as opting-out. Opting out means preferred level is not used at all, not mention as part of the PodGang created for the PCS. While setting the highest level has different semantics and can have different implications in different scheduler backeds.
Apologize the long reply, I think there are some fundamental decisions to make wrt persona we're coming to serve with this new feature.
|
|
||
| #### Story 3: Admin-controlled preferred topology | ||
|
|
||
| As a cluster administrator, I want to configure the preferred packing domain on each `ClusterTopology` so that workloads using that topology receive consistent best-effort placement behavior without each workload choosing infrastructure-specific policy. |
There was a problem hiding this comment.
I think rather than "each workload choosing infrastructure-specific policy" its more that as the cluster admin you know and care about the compute load on the scheduler and want to prevent the scheduler from becoming slow while enabling ideal best-effort placement as much as possible
| packDomain: rack | ||
| ``` | ||
|
|
||
| ### User Stories |
There was a problem hiding this comment.
I think there should be an additional user story for the case where the user really cares about some piece of the workload being scheduled at the lowest topology level even though preferred is not the lowest and decides to provide a packDomain that is lower than the preferred level.
There was a problem hiding this comment.
If there's such a user story, the proposed API should be changed pretty much - from ClusterTopology-driven to PCS-driven. Or both.
|
|
||
| ### Limitations/Risks & Mitigations | ||
|
|
||
| - **Scheduler performance:** Preferred topology placement can trigger compute-intensive scheduler work. *Mitigation: preferred placement is only emitted for workloads that set `topologyName` and reference a `ClusterTopology` that defines `preferredPackDomain`; it is not a blanket default for all workloads.* |
There was a problem hiding this comment.
Here I think the main point is 1. not all workloads do topology aware scheduling (already covered) and 2. cluster admin can set preferredPackDomain to something higher than the lowest level, guarding against people forcing the scheduler to do a lot more extra work unless they care enough about the level to set it as a packDomain. In the current design 2 is not relevant because theres no default value for preferredPackDomain but if we agree to change that then I think this limitation should include 2. in its explanation.
| template: | ||
| topologyConstraint: | ||
| topologyName: h100-topology | ||
| packDomain: rack |
There was a problem hiding this comment.
IMHO, we should rename packDomain to requiredPackDomain.
There was a problem hiding this comment.
As long as we keep packDomain alone, without the preferred one specified alongside it, I disagree with you. When you only have 'pack domain', it's obvious that it's 'required'. Like any other field in spec, specifying it means it's required.
If we add preferred to this struct, I definitely agree they should be called preferred and required.
Otherwise, when we add spread constraints, we'll have packDomain and spreadDomain.
|
non-blocking: i like the move to make preferred topology admin-controlled instead of another workload knob. one thing i’d love to make more explicit is the user/debug story: when preferred placement is requested but not achieved, how does a platform/operator explain that to the user? today the monitoring section says to inspect the generated PodGang / placement score, but from a PM/operator lens this optimization is otherwise pretty invisible. should alpha/beta include an event, condition, or at least a documented kubectl path that says “preferred constraint was propagated” vs “preferred placement was not satisfied”? |
Proposes adding packDomainPreferred alongside the existing packDomain field in TopologyConstraint, enabling best-effort topology scheduling without risking workload failure when the constraint cannot be satisfied. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace 'fail/failing' with 'block/blocking' throughout — the workload remains pending, not failed, when a required topology constraint cannot be satisfied - Expand Webhook Validation section to clarify that unknown packDomainPreferred domains are rejected at admission time, same as packDomain Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The packDomain/packDomainPreferred coexistence semantics are already fully described in the Proposal section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…S, not ClusterTopology)
97af3ae to
9ceff8f
Compare
Summary
ClusterTopology.spec.preferredPackDomainas an administrator-controlled best-effort topology packing domaintopologyNamewith or without a requiredpackDomainClusterTopologyand propagates it to the scheduler asTopologyPackConstraint.Preferred;packDomaincontinues to map toTopologyPackConstraint.RequiredKey design decisions
ClusterTopology, so administrators control the scheduler-cost tradeoff per topologytopologyNamedo not receive preferred topology constraintspreferredPackDomainis unset on the referenced topology, Grove preserves the existing required-only behaviorpackDomainCloses #368