Skip to content

fix: stabilize PodCliqueSet hashes against map slice reorder#566

Open
gflarity wants to merge 15 commits into
ai-dynamo:mainfrom
gflarity:nvbug-6109874
Open

fix: stabilize PodCliqueSet hashes against map slice reorder#566
gflarity wants to merge 15 commits into
ai-dynamo:mainfrom
gflarity:nvbug-6109874

Conversation

@gflarity
Copy link
Copy Markdown
Contributor

@gflarity gflarity commented Apr 29, 2026

Summary

Fixes two related bugs that together caused Grove to gang-roll a PodCliqueSet on every spec update where the upstream operator happened to emit +listType=map slices in a different order. See #565 for the full repro and pod-lifecycle trace.

Root cause: dump.ForHash (used by both computeGenerationHash and ComputeHash) sorts map keys but preserves slice order. Several Kubernetes API slices that are declared +listType=map (and PodCliqueSet.Spec.Template.Cliques itself) are name-keyed sets with no inherent order, but they were being treated as ordered by the hash. Non-deterministic Go map iteration in upstream code (e.g. the Dynamo operator) flipped the hashes on every reconcile, which scheduled real rolling updates and silently destroyed in-flight scale state.

Fixes:

  1. computeGenerationHash (operator/internal/controller/podcliqueset/reconcilespec.go) sorts Spec.Template.Cliques by name before hashing. Slice order is mixed back into the hash only for CliqueStartupTypeInOrder (via a startupOrderingMarker), where the slice index does encode startup-chain semantics. AnyOrder and Explicit no longer leak slice order into the hash.

  2. ComputeHash (operator/internal/utils/kubernetes/pod.go) canonicalizes order-independent +listType=map slices in PodSpec before hashing:

    • PodSpec: Containers, Volumes, ImagePullSecrets, HostAliases, TopologySpreadConstraints, ResourceClaims, EphemeralContainers
    • Container (regular, init, ephemeral): Ports, VolumeMounts, VolumeDevices, ResizePolicy

    InitContainers (sequential execution), Container.Env ($(VAR) substitution), EnvFrom / Tolerations (+listType=atomic), and Args / Command are intentionally NOT sorted — their order is part of the desired state.

Test plan

  • go vet ./... clean on the affected packages
  • go test ./internal/utils/kubernetes/... ./internal/controller/common/component/utils/... ./internal/controller/podcliqueset/... ./internal/controller/podclique/components/pod/... all pass
  • Sort-invariance tests for every canonicalized slice (pod_test.go::TestComputeHash_AdditionalListTypeMapSlices + 8 sub-tests; mirrored at the per-PCLQ layer in podclique_test.go::TestComputePCLQPodTemplateHash_AdditionalListTypeMapSlices)
  • Reverse-direction regression tests (TestComputeHash_RealSpecChangesStillFlipHash, TestComputePCLQPodTemplateHash_RealSpecChangesStillFlipHash, TestComputeGenerationHash_RealCliqueTemplateChangeFlipsHash) — pin that real desired-state changes still flip the hash, so a future over-canonicalization can't silently break rolling updates
  • StartupType boundary cases (TestComputeGenerationHash_InOrderToAnyOrderFlipsHash, TestComputeGenerationHash_StartupTypeChangeFlipsHash, TestComputeGenerationHash_InOrderStartupIsSensitiveToCliqueOrder)
  • Realistic Dynamo-shaped PodSpec end-to-end test (TestComputePCLQPodTemplateHash_RealisticDynamoLikePodSpec) — same desired state, every +listType=map slice (including VolumeMounts) shuffled, hash must be stable
  • Nil-safety (TestComputeHash_NilSafety)
  • Combined replica-bump + clique-reorder no-op (TestComputeGenerationHash_CombinedReplicaChangeAndCliqueReorderIsNoOp) — the realistic scale-up patch shape from the bug report
  • E2E tests which upgrade Grove and confirms everything works as expected
    Resolves Rapid PodCliqueSet spec updates trigger full gang roll instead of in-place replica scale #565

Original NVIDIA bug report: https://nvbugspro.nvidia.com/bug/6109874 (internal)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@gflarity gflarity changed the title fix: stabilize PodCliqueSet hashes against +listType=map slice reorder fix: stabilize PodCliqueSet hashes against map slice reorder Apr 29, 2026
@gflarity
Copy link
Copy Markdown
Contributor Author

gflarity commented Apr 29, 2026

/ok to test e21f27e

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 29, 2026

/ok to test 85411b5

@gflarity, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@gflarity
Copy link
Copy Markdown
Contributor Author

/ok to test e21f27e

@gflarity gflarity marked this pull request as ready for review April 30, 2026 00:54
danbar2
danbar2 previously approved these changes Apr 30, 2026
@gflarity
Copy link
Copy Markdown
Contributor Author

gflarity commented Apr 30, 2026

After discussing with the group I see two viable options. See below for more information.

I've decided to implement the two-hash compatibility window fix because it keeps the upgrade self-contained in Grove: no external migration job, no need to coordinate with upstream writers such as Dynamo, and no requirement to block CR changes during the operator restart.

Option 1: Pre-Start Migration During Operator Restart

During upgrade, the old Grove pod stops. The new Grove pod starts but runs a migration before starting reconcilers. The migration computes canonical hashes and patches existing stored hashes/labels so normal reconciliation sees a consistent canonical state.

It must patch PCS, PCLQ, PCSG, and Pod hash fields, and should fail if an update is actively in progress.

Pros:

  • Avoids long-term controller complexity.
  • No legacy hash path in normal reconcile logic.
  • Can avoid the one-time roll if migration completes before reconcilers start.

Cons:

  • Upgrade path becomes more operationally complex.
  • Migration must be complete and idempotent.
  • Upstream spec writers, including Dynamo, should be quiesced or at least not racing PCS spec changes during migration.
  • If any hash-bearing field is missed, the new operator may still roll or report stale status.

Option 2: Two-Hash Compatibility

A transition Grove release computes both hashes from the current desired spec:

canonicalHash(current spec)
legacyHash(current spec)

If stored hash matches either, the object is treated as current. If it matches legacy, Grove patches it to canonical during reconciliation. If it matches neither, normal rolling-update behaviour applies.

Pros:

  • No special upgrade choreography.
  • Avoids the one-time upgrade roll for steady-state workloads.
  • Safer than blind relabeling because it only accepts legacyHash(current spec).

Cons:

  • Requires preserving the old hash algorithm temporarily.
  • Needs a follow-up release to remove legacy support.

Comment thread operator/internal/controller/podcliqueset/reconcilespec.go Outdated

func podTemplateSpecForGenerationHash(pclqTemplateSpec *grovecorev1alpha1.PodCliqueTemplateSpec, priorityClassName string) *corev1.PodTemplateSpec {
podTemplateSpec := &corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if updating the PodCliqueTemplateSpec labels and annotations should necessarily trigger rolling updates. They need to propagated to all the resources not trigger updates. This is ok for now since it preserves current behavior.
cc @renormalize who is also looking into the fix.

Copy link
Copy Markdown
Contributor Author

@gflarity gflarity Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I wanted to preserve the current behvaiour as much as is practical right since it's a pretty impactful change.

// so that two specs representing the same desired state always produce the
// same byte-for-byte serialization. See the doc on ComputeHash for the full
// list of slices canonicalized and the rationale for the slices left alone.
func canonicalizePodTemplateSpecForHashing(in *corev1.PodTemplateSpec) *corev1.PodTemplateSpec {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the canonicalization is necessary to avoid triggering unnecessary rolling updates but I have a couple of questions here:

  • I am concerned how much performance hit this code will now introduce.
  • Is there a standard utility that is used by Kubernetes operators that we can reuse here?

cc @unmarshall @kangclzjc

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I am concerned how much performance hit this code will now introduce.

The existing hash path already reflects over the full template via dump.ForHash, so I expect the extra deep copy and bounded slice sorts to be small. I'm not sure what the state of the performance testings suite is, but a before and after run might make sense when it's ready.

  • Is there a standard utility that is used by Kubernetes operators that we can reuse here?

The thought occurred to me, but a quick search didn't find much. I also wanted to keep the canonical hashing logic close to current hashing logic to make reviewing easy.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the infra for scale test is ready, I think its a good use case to add a scale test with a scenario that highlight this issue. Run it on main as baseline and on your branch to compare the results - I did a similar thing here you can use it as an example

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, no noticeable degradation in performance during the scale test. I even made one that looked more "Dynamo-like" with more volumes, containers, mounts etc.

@danbar2
Copy link
Copy Markdown
Contributor

danbar2 commented May 3, 2026

After discussing with the group I see two viable options. See below for more information.

I've decided to implement the two-hash compatibility window fix because it keeps the upgrade self-contained in Grove: no external migration job, no need to coordinate with upstream writers such as Dynamo, and no requirement to block CR changes during the operator restart.

Option 1: Pre-Start Migration During Operator Restart

During upgrade, the old Grove pod stops. The new Grove pod starts but runs a migration before starting reconcilers. The migration computes canonical hashes and patches existing stored hashes/labels so normal reconciliation sees a consistent canonical state.

It must patch PCS, PCLQ, PCSG, and Pod hash fields, and should fail if an update is actively in progress.

Pros:

  • Avoids long-term controller complexity.
  • No legacy hash path in normal reconcile logic.
  • Can avoid the one-time roll if migration completes before reconcilers start.

Cons:

  • Upgrade path becomes more operationally complex.
  • Migration must be complete and idempotent.
  • Upstream spec writers, including Dynamo, should be quiesced or at least not racing PCS spec changes during migration.
  • If any hash-bearing field is missed, the new operator may still roll or report stale status.

Option 2: Two-Hash Compatibility

A transition Grove release computes both hashes from the current desired spec:

canonicalHash(current spec) legacyHash(current spec)

If stored hash matches either, the object is treated as current. If it matches legacy, Grove patches it to canonical during reconciliation. If it matches neither, normal rolling-update behaviour applies.

Pros:

  • No special upgrade choreography.
  • Avoids the one-time upgrade roll for steady-state workloads.
  • Safer than blind relabeling because it only accepts legacyHash(current spec).

Cons:

  • Requires preserving the old hash algorithm temporarily.
  • Needs a follow-up release to remove legacy support.

Im in favor of the first option.
Note that we can never know if the user upgraded Grove on every released version, if he will jump from current behavior to a version where the old annotation is removed completely he will get all its PCS's to be updated

@gflarity
Copy link
Copy Markdown
Contributor Author

gflarity commented May 9, 2026

Scale-test follow-up for the performance concern: I ran both 1k-pod scale tests against a fair local main baseline (09da63b, with only the new scale-test harness/YAML applied) and this branch (ce2b190) on the same 100-KWOK node scale profile. The existing ScaleTest_1000 showed no measurable regression, with total time at 547.8s on baseline vs 535.0s on this branch (-2.3%). The "Dynamo-like" variant, which better exercises the hash-canonicalization concern, was also not worse overall: total time was 688.7s baseline vs 605.7s branch (-12.1%); early deploy markers were slightly slower (pods-created +2.1s, pcs-available +3.3s), but pods-ready was 86.1s faster and the steady-state availability check was not slower.

2026-05-09-pr566-performance.tar.gz

@gflarity
Copy link
Copy Markdown
Contributor Author

/ok to test ce2b190

@gflarity gflarity force-pushed the nvbug-6109874 branch 2 times, most recently from 1dccd9f to 5a2e834 Compare May 13, 2026 15:26
gflarity added 7 commits May 13, 2026 13:29
Grove computed two hashes (PCS-level generation hash, per-PCLQ
pod-template hash) by serializing API objects with dump.ForHash, which
preserves slice order. The Kubernetes API declares many PodSpec slices
+listType=map (Containers, Volumes, ImagePullSecrets, container Ports,
VolumeMounts, VolumeDevices, ResizePolicy, HostAliases,
TopologySpreadConstraints, ResourceClaims, EphemeralContainers) — they
are name-keyed sets with no inherent order. An upstream operator that
emits the same content in a different order (e.g. from non-deterministic
Go map iteration) flipped both hashes, causing Grove to gang-roll every
PodClique and lose any in-flight scale state.

Same root cause for PodCliqueSet.Spec.Template.Cliques, also
+listType=map +listMapKey=name.

Fixes:
- computeGenerationHash sorts cliques by name before hashing. Slice
  order is mixed back in via a startup-ordering marker only for
  CliqueStartupTypeInOrder, where slice index encodes the startup chain.
- ComputeHash canonicalizes order-independent +listType=map slices in
  PodSpec before hashing. InitContainers (sequential execution), Env
  ($(VAR) substitution), EnvFrom/Tolerations (atomic) are intentionally
  not sorted — those orderings are part of the desired state.

Resolves ai-dynamo#565
- Document Explicit-mode StartsAfter limitation on startupOrderingMarker:
  Spec.StartsAfter is not part of Spec.PodSpec, so today neither slice
  order nor StartsAfter mutations participate in the Explicit-mode
  generation hash. Tracked as a follow-up.
- TestProcessGenerationHashChange_CliqueReorderIsNoOp: also assert
  Status.CurrentGenerationHash is unchanged on a clique reorder.
- TestComputeGenerationHash_AnyOrderEqualsExplicit_WhenPodSpecsMatch:
  add the Explicit + reorder no-op corollary, which is the property
  most at risk of regressing if someone later "fixes" the marker logic
  to also engage on Explicit.
- Fix doc preamble for TestComputePCLQPodTemplateHash_PodSpecSliceOrderInvariants:
  list Container.Env / InitContainers as order-sensitive +listType=map
  cases (not atomic) and use EnvFrom/Tolerations as the genuine atomic
  examples instead of VolumeMounts (which is in fact +listType=map).
- Rename env_var_reorder_changes_hash_atomic_listtype and
  resize_policy_reorder_changes_hash_atomic_listtype to drop the
  incorrect "atomic_listtype" suffix; clarify the assertion messages.
- Note in resize_policy subtest that the assertion would need to flip
  to assert.Equal if canonicalization is later extended to ResizePolicy.
- Update dead cross-reference in
  TestProcessPendingUpdatesPCSHashFlipDoesNotDeletePods comment to point
  at TestComputePCLQPodTemplateHash_PodSpecSliceOrderInvariants in
  podclique_test.go.
Two doc-only corrections from review:

- pod.go: PodSpec.ReadinessGates and Container.RestartPolicyRules are
  +listType=atomic slice fields that ComputeHash correctly leaves
  untouched, but they were neither enumerated in the "Intentionally NOT
  sorted" list nor covered by the catch-all paragraph (which only
  mentioned scalar/struct, +listType=set, and absent fields). Add both
  fields to the enumeration and broaden the catch-all to include
  un-enumerated +listType=atomic slices.

- pod_test.go: TestComputeHash_DoesNotCanonicalizeAtomicSlices was
  misnamed — Container.Env and PodSpec.InitContainers are both
  +listType=map by name, not atomic. Their order is preserved due to
  runtime semantics ($(VAR) substitution / sequential init execution).
  Rename to TestComputeHash_DoesNotCanonicalizeOrderSensitiveSlices and
  rewrite the preamble to classify each subtest's field correctly.
  Also fix the env subtest assertion message which mislabelled
  Container.Env as +listType=atomic.

No behavior change.
TestComputeGenerationHash_AnyOrderEqualsExplicit_WhenPodSpecsMatch built
cliques without distinguishing labels, so the three per-clique pod
templates were byte-identical. The slice-order-invariance assertion
under Explicit would have passed trivially even against the bugged
pre-canonicalization code. Add a per-clique WithLabels so each clique
hashes distinctly, plus a comment explaining why it's load-bearing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rapid PodCliqueSet spec updates trigger full gang roll instead of in-place replica scale

3 participants