Skip to content

feat: GREP-0368 Preferred Topology Constraint API#523

Open
shayasoolin wants to merge 9 commits into
ai-dynamo:mainfrom
shayasoolin:preferred-topology
Open

feat: GREP-0368 Preferred Topology Constraint API#523
shayasoolin wants to merge 9 commits into
ai-dynamo:mainfrom
shayasoolin:preferred-topology

Conversation

@shayasoolin
Copy link
Copy Markdown
Contributor

@shayasoolin shayasoolin commented Apr 13, 2026

Summary

  • Proposes adding ClusterTopology.spec.preferredPackDomain as an administrator-controlled best-effort topology packing domain
  • Allows workloads to opt into topology-aware placement by setting topologyName with or without a required packDomain
  • Grove resolves the preferred domain from the referenced ClusterTopology and propagates it to the scheduler as TopologyPackConstraint.Preferred; packDomain continues to map to TopologyPackConstraint.Required

Key design decisions

  • Preferred topology policy lives on ClusterTopology, so administrators control the scheduler-cost tradeoff per topology
  • Workloads that do not set topologyName do not receive preferred topology constraints
  • If preferredPackDomain is unset on the referenced topology, Grove preserves the existing required-only behavior
  • Preferred topology scheduling is best-effort and may be compute-intensive; hard placement requirements continue to use packDomain

Closes #368

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nvrohanv
Copy link
Copy Markdown
Contributor

What is the reason for providing preferred as an option over just packing everything as close as possible if a topology constraint is provided. I remember we discussed earlier that preferred really seems to just be trying to save the scheduler cycles by allowing it to just skip levels. Main case I can see is if we dont want to pack everything as close as possible by default but then that really sounds more like a spread constraint.

@shayasoolin
Copy link
Copy Markdown
Contributor Author

What is the reason for providing preferred as an option over just packing everything as close as possible if a topology constraint is provided. I remember we discussed earlier that preferred really seems to just be trying to save the scheduler cycles by allowing it to just skip levels. Main case I can see is if we dont want to pack everything as close as possible by default but then that really sounds more like a spread constraint.

@nvrohanv please see the short discussion in the issue #368
The cons of having 'prefer lowest level' by default are 1) it might be compute-intensive for the scheduler and 2) now with multiple topologies we have to get the topology name from the user in order to get the 'lowest level' name.
Generally speaking, the approach is to look at Grove as an infrastructure level component, that allows the higher level layers to make decisions on the appropriate default behaviors etc.

@gal-revach
Copy link
Copy Markdown

@shayasoolin question on "Single field addition to the shared TopologyConstraint struct — covers PodCliqueSet, PodCliqueScalingGroupConfig, and PodCliqueTemplateSpec in one change" - does it mean that the level determined on packDomainPreferred will apply to all workload levels (the user will not be able to specify different levels for PodCliqueSet and PodCliqueScalingGroupConfig for example)? Is it the same for the required packDomain?

Comment thread docs/proposals/0368-preferred-topology-constraint/README.md Outdated
Comment thread docs/proposals/0368-preferred-topology-constraint/README.md Outdated
@shayasoolin
Copy link
Copy Markdown
Contributor Author

@shayasoolin question on "Single field addition to the shared TopologyConstraint struct — covers PodCliqueSet, PodCliqueScalingGroupConfig, and PodCliqueTemplateSpec in one change" - does it mean that the level determined on packDomainPreferred will apply to all workload levels (the user will not be able to specify different levels for PodCliqueSet and PodCliqueScalingGroupConfig for example)? Is it the same for the required packDomain?

@gal-revach it will be possible to specify it in each level (with auto-propagation from each level to its children) - just like with packDomain.
It's a single addition to that struct, but the struct is part of both PCS, PCSG and PCLQ levels so you get the full control you expect.

Copy link
Copy Markdown
Contributor

@gflarity gflarity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few things to clarify.

Comment thread docs/proposals/0368-preferred-topology-constraint/README.md Outdated
Comment thread docs/proposals/0368-preferred-topology-constraint/README.md Outdated
Comment thread docs/proposals/0368-preferred-topology-constraint/README.md Outdated
Comment thread docs/proposals/0368-preferred-topology-constraint/README.md Outdated
Comment thread docs/proposals/0368-preferred-topology-constraint/README.md Outdated
Comment thread docs/proposals/0368-preferred-topology-constraint/README.md Outdated
Comment thread docs/proposals/0368-preferred-topology-constraint/README.md Outdated
danbar2
danbar2 previously approved these changes Apr 28, 2026
@shayasoolin shayasoolin force-pushed the preferred-topology branch from 519e32a to 97af3ae Compare May 3, 2026 05:50
Copy link
Copy Markdown
Contributor

@nvrohanv nvrohanv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great design! Overall I agree with it but I think there are two changes that should occur:

  1. I think all ClusterTopology CR should have default value of preferredPackDomain and have it be set to the lowest level. I explain my rationale in one of the comments. If you agree I think we should explain the rationale for this in the motivations
  2. In the motivations section I would like to address that the reason we don't provide this as a user API is because the whole point of not setting preferred to the lowest level always is because it might destroy scheduler performance and therefore it is difficult to trust the user to care about this and therefore we delegate the responsibility to the cluster admin.

The resulting behavior is:

- Workload has no `topologyName`: Grove emits no topology constraint for that workload.
- Workload has `topologyName`, and the referenced `ClusterTopology` has no `preferredPackDomain`: existing required `packDomain` behavior is preserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point - under this regime it is not guaranteed that if you provide a topologyName but no packDomain that you will always get some best effort scheduling. I am thinking we should instead default preferredPackDomain on a ClusterTopology to the lowest level unless the cluster admin explicitly opts out. Without doing this it seems unfair to the user to not be able to get best-effort topology packing without risking their workload not being scheduled via packDomain requirements.

I believe the preferredPackDomain should be viewed more as an escape hatch for the cluster admin to avoid making the scheduler slow in cases where the ideal behavior - preferring to pack at lowest level - is not feasible, and therefore it is the cluster admin's responsibility to decide to take that escape hatch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those questions boil down to who's responsible for the preferred-topology thing, who takes care of applying it -- without risking the cluster operation.
If we agree it's the cluster admin (as we discussed so far), then "unfair to the user to not be able to get best-effort topology" is irrelevant. The user only controls required constraints, not preferred ones.
If we make the user responsible, then it should be a workload-level API field.
If it's the user, with some policies enforced by the admin, then again it should be a workload-level API field, that can be controlled by admin policy tools.
If we make preferred topology the default behavior, then not only that we need preferredPackDomain as an "escape hatch" as you said, but we also need some boolean preferredPackEnabled field to opt-out altogether. IMHO setting the highest topology level as preferred is NOT the same as opting-out. Opting out means preferred level is not used at all, not mention as part of the PodGang created for the PCS. While setting the highest level has different semantics and can have different implications in different scheduler backeds.

Apologize the long reply, I think there are some fundamental decisions to make wrt persona we're coming to serve with this new feature.


#### Story 3: Admin-controlled preferred topology

As a cluster administrator, I want to configure the preferred packing domain on each `ClusterTopology` so that workloads using that topology receive consistent best-effort placement behavior without each workload choosing infrastructure-specific policy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than "each workload choosing infrastructure-specific policy" its more that as the cluster admin you know and care about the compute load on the scheduler and want to prevent the scheduler from becoming slow while enabling ideal best-effort placement as much as possible

packDomain: rack
```

### User Stories
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be an additional user story for the case where the user really cares about some piece of the workload being scheduled at the lowest topology level even though preferred is not the lowest and decides to provide a packDomain that is lower than the preferred level.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's such a user story, the proposed API should be changed pretty much - from ClusterTopology-driven to PCS-driven. Or both.


### Limitations/Risks & Mitigations

- **Scheduler performance:** Preferred topology placement can trigger compute-intensive scheduler work. *Mitigation: preferred placement is only emitted for workloads that set `topologyName` and reference a `ClusterTopology` that defines `preferredPackDomain`; it is not a blanket default for all workloads.*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think the main point is 1. not all workloads do topology aware scheduling (already covered) and 2. cluster admin can set preferredPackDomain to something higher than the lowest level, guarding against people forcing the scheduler to do a lot more extra work unless they care enough about the level to set it as a packDomain. In the current design 2 is not relevant because theres no default value for preferredPackDomain but if we agree to change that then I think this limitation should include 2. in its explanation.

template:
topologyConstraint:
topologyName: h100-topology
packDomain: rack
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, we should rename packDomain to requiredPackDomain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we keep packDomain alone, without the preferred one specified alongside it, I disagree with you. When you only have 'pack domain', it's obvious that it's 'required'. Like any other field in spec, specifying it means it's required.
If we add preferred to this struct, I definitely agree they should be called preferred and required.
Otherwise, when we add spread constraints, we'll have packDomain and spreadDomain.

Comment thread docs/proposals/0368-preferred-topology-constraint/README.md Outdated
Copy link
Copy Markdown

athreesh commented May 3, 2026

non-blocking: i like the move to make preferred topology admin-controlled instead of another workload knob.

one thing i’d love to make more explicit is the user/debug story: when preferred placement is requested but not achieved, how does a platform/operator explain that to the user? today the monitoring section says to inspect the generated PodGang / placement score, but from a PM/operator lens this optimization is otherwise pretty invisible. should alpha/beta include an event, condition, or at least a documented kubectl path that says “preferred constraint was propagated” vs “preferred placement was not satisfied”?

shayasoolin and others added 6 commits May 13, 2026 11:17
Proposes adding packDomainPreferred alongside the existing packDomain
field in TopologyConstraint, enabling best-effort topology scheduling
without risking workload failure when the constraint cannot be satisfied.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace 'fail/failing' with 'block/blocking' throughout — the workload
  remains pending, not failed, when a required topology constraint cannot
  be satisfied
- Expand Webhook Validation section to clarify that unknown
  packDomainPreferred domains are rejected at admission time, same as
  packDomain

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The packDomain/packDomainPreferred coexistence semantics are already
fully described in the Proposal section.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shayasoolin shayasoolin force-pushed the preferred-topology branch from 97af3ae to 9ceff8f Compare May 13, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support API for requesting Preferred topology constraint

7 participants