Skip to content

fix: publish scale selectors for user-managed HPAs#591

Open
yankay wants to merge 1 commit into
ai-dynamo:mainfrom
yankay:feat/improve-scale-target-discoverability
Open

fix: publish scale selectors for user-managed HPAs#591
yankay wants to merge 1 commit into
ai-dynamo:mainfrom
yankay:feat/improve-scale-target-discoverability

Conversation

@yankay
Copy link
Copy Markdown
Contributor

@yankay yankay commented May 8, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

Grove currently publishes the /scale selector for PodClique and PodCliqueScalingGroup only when Grove-managed autoscaling is configured. That blocks user-managed HPAs, because the HPA controller reads the selector from the target's /scale subresource and fails when it is missing.

This PR makes PodCliqueSet, standalone PodClique, and PodCliqueScalingGroup always publish their scale selector. PodCliques that belong to a PodCliqueScalingGroup still do not publish their own selector, because they should be scaled through the PCSG target.

How to use:

  1. Create a PodCliqueSet without autoscaling configured for the target that should be managed by a user-created HPA.
  2. List the PodCliqueSet and generated targets:
    • Whole workload: kubectl get pcs <pcs-name>
    • PCSGs: kubectl get pcsg -l app.kubernetes.io/part-of=<pcs-name>
    • Standalone PodCliques: kubectl get pclq -l app.kubernetes.io/part-of=<pcs-name>
  3. Use the selected target name in the HPA scaleTargetRef with kind: PodCliqueSet, PodCliqueScalingGroup, or PodClique.

See docs/user-guide/02_pod-and-resource-naming-conventions/02_naming-conventions.md for the full example.

Which issue(s) this PR fixes:

Fixes #158

Special notes for your reviewer:

Scope intentionally avoids new API fields, labels, or printer columns.

PCS-level handling was added in response to review feedback. The test matrix in reconcilestatus_test.go for each of PodCliqueSet / PodClique / PodCliqueScalingGroup documents which resources now (or still do not) publish a selector.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

docs/user-guide/02_pod-and-resource-naming-conventions/02_naming-conventions.md

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yankay yankay changed the title WIP: publish scale selectors for user-managed HPAs publish scale selectors for user-managed HPAs May 8, 2026
@yankay yankay marked this pull request as ready for review May 8, 2026 14:20
@yankay yankay changed the title publish scale selectors for user-managed HPAs fix: publish scale selectors for user-managed HPAs May 8, 2026
@yankay yankay force-pushed the feat/improve-scale-target-discoverability branch from d140bbb to fb99983 Compare May 8, 2026 14:38
Copy link
Copy Markdown
Contributor

@shayasoolin shayasoolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for your contribution!
What about the PCS level? It's also scalable, does it need a similar handling or is its scale resource always published already?

@yankay yankay force-pushed the feat/improve-scale-target-discoverability branch from fb99983 to 6d43732 Compare May 14, 2026 14:01
@yankay
Copy link
Copy Markdown
Contributor Author

yankay commented May 14, 2026

Thanks @shayasoolin — good catch. The PCS /scale selector was also unset, so the latest revision publishes one for PodCliqueSet, with TestMutateSelector covering it and the doc example extended. PR description updated to reflect the broader scope.

@yankay yankay force-pushed the feat/improve-scale-target-discoverability branch 3 times, most recently from 1cd7d9e to bea9fd3 Compare May 14, 2026 14:36
Grove only published the /scale subresource selector on PodClique and PodCliqueScalingGroup when Grove-managed autoscaling was configured. That blocks user-managed HPAs, because the HPA controller reads the selector from the target's /scale subresource and fails when it is missing.

PodCliqueSet, standalone PodClique, and PodCliqueScalingGroup now publish their scale selectors. PodCliques that belong to a PodCliqueScalingGroup still do not publish their own selector, because they should be scaled through the PCSG target.

Adds user-guide documentation for using generated names in user-managed HPAs, and unit tests covering:

- PCS publishes a selector

- standalone PCLQ with/without ScaleConfig publishes a selector

- PCSG-member PCLQ never publishes a selector, regardless of ScaleConfig

- PCSG with/without ScaleConfig publishes a selector

- PCSG whose name is not present in the PCS template does not publish

Fixes ai-dynamo#158

Signed-off-by: Kay Yan <kay.yan@daocloud.io>
@yankay yankay force-pushed the feat/improve-scale-target-discoverability branch from bea9fd3 to 2620f03 Compare May 14, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

It should be easy for users to construct name for scaleTargetRef

2 participants