fix: stabilize PodCliqueSet hashes against map slice reorder by gflarity · Pull Request #566 · ai-dynamo/grove

gflarity · 2026-04-29T18:06:49Z

Summary

Fixes two related bugs that together caused Grove to gang-roll a PodCliqueSet on every spec update where the upstream operator happened to emit +listType=map slices in a different order. See #565 for the full repro and pod-lifecycle trace.

Root cause: dump.ForHash (used by both computeGenerationHash and ComputeHash) sorts map keys but preserves slice order. Several Kubernetes API slices that are declared +listType=map (and PodCliqueSet.Spec.Template.Cliques itself) are name-keyed sets with no inherent order, but they were being treated as ordered by the hash. Non-deterministic Go map iteration in upstream code (e.g. the Dynamo operator) flipped the hashes on every reconcile, which scheduled real rolling updates and silently destroyed in-flight scale state.

Fixes:

computeGenerationHash (operator/internal/controller/podcliqueset/reconcilespec.go) sorts Spec.Template.Cliques by name before hashing. Slice order is mixed back into the hash only for CliqueStartupTypeInOrder (via a startupOrderingMarker), where the slice index does encode startup-chain semantics. AnyOrder and Explicit no longer leak slice order into the hash.
ComputeHash (operator/internal/utils/kubernetes/pod.go) canonicalizes order-independent +listType=map slices in PodSpec before hashing:
- PodSpec: Containers, Volumes, ImagePullSecrets, HostAliases, TopologySpreadConstraints, ResourceClaims, EphemeralContainers
- Container (regular, init, ephemeral): Ports, VolumeMounts, VolumeDevices, ResizePolicy
InitContainers (sequential execution), Container.Env ($(VAR) substitution), EnvFrom / Tolerations (+listType=atomic), and Args / Command are intentionally NOT sorted — their order is part of the desired state.

Test plan

go vet ./... clean on the affected packages
go test ./internal/utils/kubernetes/... ./internal/controller/common/component/utils/... ./internal/controller/podcliqueset/... ./internal/controller/podclique/components/pod/... all pass
Sort-invariance tests for every canonicalized slice (pod_test.go::TestComputeHash_AdditionalListTypeMapSlices + 8 sub-tests; mirrored at the per-PCLQ layer in podclique_test.go::TestComputePCLQPodTemplateHash_AdditionalListTypeMapSlices)
Reverse-direction regression tests (TestComputeHash_RealSpecChangesStillFlipHash, TestComputePCLQPodTemplateHash_RealSpecChangesStillFlipHash, TestComputeGenerationHash_RealCliqueTemplateChangeFlipsHash) — pin that real desired-state changes still flip the hash, so a future over-canonicalization can't silently break rolling updates
StartupType boundary cases (TestComputeGenerationHash_InOrderToAnyOrderFlipsHash, TestComputeGenerationHash_StartupTypeChangeFlipsHash, TestComputeGenerationHash_InOrderStartupIsSensitiveToCliqueOrder)
Realistic Dynamo-shaped PodSpec end-to-end test (TestComputePCLQPodTemplateHash_RealisticDynamoLikePodSpec) — same desired state, every +listType=map slice (including VolumeMounts) shuffled, hash must be stable
Nil-safety (TestComputeHash_NilSafety)
Combined replica-bump + clique-reorder no-op (TestComputeGenerationHash_CombinedReplicaChangeAndCliqueReorderIsNoOp) — the realistic scale-up patch shape from the bug report
E2E tests which upgrade Grove and confirms everything works as expected
Resolves Rapid PodCliqueSet spec updates trigger full gang roll instead of in-place replica scale #565

Original NVIDIA bug report: https://nvbugspro.nvidia.com/bug/6109874 (internal)

copy-pr-bot · 2026-04-29T18:06:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

gflarity · 2026-04-29T22:49:54Z

/ok to test e21f27e

copy-pr-bot · 2026-04-29T22:49:56Z

/ok to test 85411b5

@gflarity, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

gflarity · 2026-04-29T22:50:50Z

/ok to test e21f27e

gflarity · 2026-04-30T18:43:41Z

After discussing with the group I see two viable options. See below for more information.

I've decided to implement the two-hash compatibility window fix because it keeps the upgrade self-contained in Grove: no external migration job, no need to coordinate with upstream writers such as Dynamo, and no requirement to block CR changes during the operator restart.

Option 1: Pre-Start Migration During Operator Restart

During upgrade, the old Grove pod stops. The new Grove pod starts but runs a migration before starting reconcilers. The migration computes canonical hashes and patches existing stored hashes/labels so normal reconciliation sees a consistent canonical state.

It must patch PCS, PCLQ, PCSG, and Pod hash fields, and should fail if an update is actively in progress.

Pros:

Avoids long-term controller complexity.
No legacy hash path in normal reconcile logic.
Can avoid the one-time roll if migration completes before reconcilers start.

Cons:

Upgrade path becomes more operationally complex.
Migration must be complete and idempotent.
Upstream spec writers, including Dynamo, should be quiesced or at least not racing PCS spec changes during migration.
If any hash-bearing field is missed, the new operator may still roll or report stale status.

Option 2: Two-Hash Compatibility

A transition Grove release computes both hashes from the current desired spec:

canonicalHash(current spec)
legacyHash(current spec)

If stored hash matches either, the object is treated as current. If it matches legacy, Grove patches it to canonical during reconciliation. If it matches neither, normal rolling-update behaviour applies.

Pros:

No special upgrade choreography.
Avoids the one-time upgrade roll for steady-state workloads.
Safer than blind relabeling because it only accepts legacyHash(current spec).

Cons:

Requires preserving the old hash algorithm temporarily.
Needs a follow-up release to remove legacy support.

sanjaychatterjee · 2026-04-30T19:10:04Z


+func podTemplateSpecForGenerationHash(pclqTemplateSpec *grovecorev1alpha1.PodCliqueTemplateSpec, priorityClassName string) *corev1.PodTemplateSpec {
+	podTemplateSpec := &corev1.PodTemplateSpec{
+		ObjectMeta: metav1.ObjectMeta{


I am not sure if updating the PodCliqueTemplateSpec labels and annotations should necessarily trigger rolling updates. They need to propagated to all the resources not trigger updates. This is ok for now since it preserves current behavior.
cc @renormalize who is also looking into the fix.

Ya, I wanted to preserve the current behvaiour as much as is practical right since it's a pretty impactful change.

sanjaychatterjee · 2026-04-30T19:24:18Z

+// so that two specs representing the same desired state always produce the
+// same byte-for-byte serialization. See the doc on ComputeHash for the full
+// list of slices canonicalized and the rationale for the slices left alone.
+func canonicalizePodTemplateSpecForHashing(in *corev1.PodTemplateSpec) *corev1.PodTemplateSpec {


I agree the canonicalization is necessary to avoid triggering unnecessary rolling updates but I have a couple of questions here:

I am concerned how much performance hit this code will now introduce.

Is there a standard utility that is used by Kubernetes operators that we can reuse here?

cc @unmarshall @kangclzjc

I am concerned how much performance hit this code will now introduce.

The existing hash path already reflects over the full template via dump.ForHash, so I expect the extra deep copy and bounded slice sorts to be small. I'm not sure what the state of the performance testings suite is, but a before and after run might make sense when it's ready.

Is there a standard utility that is used by Kubernetes operators that we can reuse here?

The thought occurred to me, but a quick search didn't find much. I also wanted to keep the canonical hashing logic close to current hashing logic to make reviewing easy.

the infra for scale test is ready, I think its a good use case to add a scale test with a scenario that highlight this issue. Run it on main as baseline and on your branch to compare the results - I did a similar thing here you can use it as an example

Done, no noticeable degradation in performance during the scale test. I even made one that looked more "Dynamo-like" with more volumes, containers, mounts etc.

danbar2 · 2026-05-03T08:32:01Z

After discussing with the group I see two viable options. See below for more information.

I've decided to implement the two-hash compatibility window fix because it keeps the upgrade self-contained in Grove: no external migration job, no need to coordinate with upstream writers such as Dynamo, and no requirement to block CR changes during the operator restart.

Option 1: Pre-Start Migration During Operator Restart

During upgrade, the old Grove pod stops. The new Grove pod starts but runs a migration before starting reconcilers. The migration computes canonical hashes and patches existing stored hashes/labels so normal reconciliation sees a consistent canonical state.

It must patch PCS, PCLQ, PCSG, and Pod hash fields, and should fail if an update is actively in progress.

Pros:

Avoids long-term controller complexity.

No legacy hash path in normal reconcile logic.

Can avoid the one-time roll if migration completes before reconcilers start.

Cons:

Upgrade path becomes more operationally complex.

Migration must be complete and idempotent.

Upstream spec writers, including Dynamo, should be quiesced or at least not racing PCS spec changes during migration.

If any hash-bearing field is missed, the new operator may still roll or report stale status.

Option 2: Two-Hash Compatibility

A transition Grove release computes both hashes from the current desired spec:

canonicalHash(current spec) legacyHash(current spec)

If stored hash matches either, the object is treated as current. If it matches legacy, Grove patches it to canonical during reconciliation. If it matches neither, normal rolling-update behaviour applies.

Pros:

No special upgrade choreography.

Avoids the one-time upgrade roll for steady-state workloads.

Safer than blind relabeling because it only accepts legacyHash(current spec).

Cons:

Requires preserving the old hash algorithm temporarily.

Needs a follow-up release to remove legacy support.

Im in favor of the first option.
Note that we can never know if the user upgraded Grove on every released version, if he will jump from current behavior to a version where the old annotation is removed completely he will get all its PCS's to be updated

gflarity · 2026-05-09T23:28:58Z

Scale-test follow-up for the performance concern: I ran both 1k-pod scale tests against a fair local main baseline (09da63b, with only the new scale-test harness/YAML applied) and this branch (ce2b190) on the same 100-KWOK node scale profile. The existing ScaleTest_1000 showed no measurable regression, with total time at 547.8s on baseline vs 535.0s on this branch (-2.3%). The "Dynamo-like" variant, which better exercises the hash-canonicalization concern, was also not worse overall: total time was 688.7s baseline vs 605.7s branch (-12.1%); early deploy markers were slightly slower (pods-created +2.1s, pcs-available +3.3s), but pods-ready was 86.1s faster and the steady-state availability check was not slower.

2026-05-09-pr566-performance.tar.gz

gflarity · 2026-05-11T13:19:36Z

/ok to test ce2b190

Grove computed two hashes (PCS-level generation hash, per-PCLQ pod-template hash) by serializing API objects with dump.ForHash, which preserves slice order. The Kubernetes API declares many PodSpec slices +listType=map (Containers, Volumes, ImagePullSecrets, container Ports, VolumeMounts, VolumeDevices, ResizePolicy, HostAliases, TopologySpreadConstraints, ResourceClaims, EphemeralContainers) — they are name-keyed sets with no inherent order. An upstream operator that emits the same content in a different order (e.g. from non-deterministic Go map iteration) flipped both hashes, causing Grove to gang-roll every PodClique and lose any in-flight scale state. Same root cause for PodCliqueSet.Spec.Template.Cliques, also +listType=map +listMapKey=name. Fixes: - computeGenerationHash sorts cliques by name before hashing. Slice order is mixed back in via a startup-ordering marker only for CliqueStartupTypeInOrder, where slice index encodes the startup chain. - ComputeHash canonicalizes order-independent +listType=map slices in PodSpec before hashing. InitContainers (sequential execution), Env ($(VAR) substitution), EnvFrom/Tolerations (atomic) are intentionally not sorted — those orderings are part of the desired state. Resolves ai-dynamo#565

- Document Explicit-mode StartsAfter limitation on startupOrderingMarker: Spec.StartsAfter is not part of Spec.PodSpec, so today neither slice order nor StartsAfter mutations participate in the Explicit-mode generation hash. Tracked as a follow-up. - TestProcessGenerationHashChange_CliqueReorderIsNoOp: also assert Status.CurrentGenerationHash is unchanged on a clique reorder. - TestComputeGenerationHash_AnyOrderEqualsExplicit_WhenPodSpecsMatch: add the Explicit + reorder no-op corollary, which is the property most at risk of regressing if someone later "fixes" the marker logic to also engage on Explicit. - Fix doc preamble for TestComputePCLQPodTemplateHash_PodSpecSliceOrderInvariants: list Container.Env / InitContainers as order-sensitive +listType=map cases (not atomic) and use EnvFrom/Tolerations as the genuine atomic examples instead of VolumeMounts (which is in fact +listType=map). - Rename env_var_reorder_changes_hash_atomic_listtype and resize_policy_reorder_changes_hash_atomic_listtype to drop the incorrect "atomic_listtype" suffix; clarify the assertion messages. - Note in resize_policy subtest that the assertion would need to flip to assert.Equal if canonicalization is later extended to ResizePolicy. - Update dead cross-reference in TestProcessPendingUpdatesPCSHashFlipDoesNotDeletePods comment to point at TestComputePCLQPodTemplateHash_PodSpecSliceOrderInvariants in podclique_test.go.

Two doc-only corrections from review: - pod.go: PodSpec.ReadinessGates and Container.RestartPolicyRules are +listType=atomic slice fields that ComputeHash correctly leaves untouched, but they were neither enumerated in the "Intentionally NOT sorted" list nor covered by the catch-all paragraph (which only mentioned scalar/struct, +listType=set, and absent fields). Add both fields to the enumeration and broaden the catch-all to include un-enumerated +listType=atomic slices. - pod_test.go: TestComputeHash_DoesNotCanonicalizeAtomicSlices was misnamed — Container.Env and PodSpec.InitContainers are both +listType=map by name, not atomic. Their order is preserved due to runtime semantics ($(VAR) substitution / sequential init execution). Rename to TestComputeHash_DoesNotCanonicalizeOrderSensitiveSlices and rewrite the preamble to classify each subtest's field correctly. Also fix the env subtest assertion message which mislabelled Container.Env as +listType=atomic. No behavior change.

TestComputeGenerationHash_AnyOrderEqualsExplicit_WhenPodSpecsMatch built cliques without distinguishing labels, so the three per-clique pod templates were byte-identical. The slice-order-invariance assertion under Explicit would have passed trivially even against the bugged pre-canonicalization code. Add a per-clique WithLabels so each clique hashes distinctly, plus a comment explaining why it's load-bearing.

gflarity changed the title ~~fix: stabilize PodCliqueSet hashes against +listType=map slice reorder~~ fix: stabilize PodCliqueSet hashes against map slice reorder Apr 29, 2026

gflarity added the run-e2e label Apr 29, 2026

gflarity marked this pull request as ready for review April 30, 2026 00:54

gflarity requested review from Ronkahn21, sanjaychatterjee, shayasoolin and unmarshall as code owners April 30, 2026 00:54

danbar2 previously approved these changes Apr 30, 2026

View reviewed changes

sanjaychatterjee reviewed Apr 30, 2026

View reviewed changes

gflarity dismissed danbar2’s stale review via 91b8f31 May 6, 2026 02:24

gflarity force-pushed the nvbug-6109874 branch from ce2b190 to 7f9b6fd Compare May 11, 2026 14:11

gflarity requested review from danbar2 and sanjaychatterjee May 12, 2026 19:45

gflarity force-pushed the nvbug-6109874 branch 2 times, most recently from 1dccd9f to 5a2e834 Compare May 13, 2026 15:26

gflarity added 7 commits May 13, 2026 13:29

fix: address hash canonicalization review feedback

6a929b1

Respect InOrder clique order in generation hash

115529a

Support legacy hash migration window

5eff42a

gflarity added 8 commits May 13, 2026 13:29

Fix update generation hash convergence

d12febb

Improve E2E diagnostics and waiter handling

aaa7c2a

Fix PCS reconcile trigger compatibility

575cbaa

tweaks to documentation

d43443a

makefile tests

13da104

fix lint issue

97c027b

lint fix

6150c35

Fix PodClique updated replica lint

e90b19a

gflarity force-pushed the nvbug-6109874 branch from af4242a to e90b19a Compare May 13, 2026 17:29

Conversation

gflarity commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

gflarity commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

gflarity commented Apr 29, 2026

Uh oh!

gflarity commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Option 1: Pre-Start Migration During Operator Restart

Option 2: Two-Hash Compatibility

Uh oh!

Uh oh!

sanjaychatterjee Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gflarity Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanjaychatterjee Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gflarity Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

danbar2 May 3, 2026

Choose a reason for hiding this comment

Uh oh!

gflarity May 9, 2026

Choose a reason for hiding this comment

Uh oh!

danbar2 commented May 3, 2026

Option 1: Pre-Start Migration During Operator Restart

Option 2: Two-Hash Compatibility

Uh oh!

gflarity commented May 9, 2026

Uh oh!

gflarity commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gflarity commented Apr 29, 2026 •

edited

Loading

gflarity commented Apr 29, 2026 •

edited

Loading

gflarity commented Apr 30, 2026 •

edited

Loading

gflarity Apr 30, 2026 •

edited

Loading