fix: stabilize PodCliqueSet hashes against map slice reorder#566
fix: stabilize PodCliqueSet hashes against map slice reorder#566gflarity wants to merge 15 commits into
Conversation
|
/ok to test e21f27e |
@gflarity, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test e21f27e |
|
After discussing with the group I see two viable options. See below for more information. I've decided to implement the two-hash compatibility window fix because it keeps the upgrade self-contained in Grove: no external migration job, no need to coordinate with upstream writers such as Dynamo, and no requirement to block CR changes during the operator restart. Option 1: Pre-Start Migration During Operator RestartDuring upgrade, the old Grove pod stops. The new Grove pod starts but runs a migration before starting reconcilers. The migration computes canonical hashes and patches existing stored hashes/labels so normal reconciliation sees a consistent canonical state. It must patch PCS, PCLQ, PCSG, and Pod hash fields, and should fail if an update is actively in progress. Pros:
Cons:
Option 2: Two-Hash CompatibilityA transition Grove release computes both hashes from the current desired spec: canonicalHash(current spec) If stored hash matches either, the object is treated as current. If it matches legacy, Grove patches it to canonical during reconciliation. If it matches neither, normal rolling-update behaviour applies. Pros:
Cons:
|
|
|
||
| func podTemplateSpecForGenerationHash(pclqTemplateSpec *grovecorev1alpha1.PodCliqueTemplateSpec, priorityClassName string) *corev1.PodTemplateSpec { | ||
| podTemplateSpec := &corev1.PodTemplateSpec{ | ||
| ObjectMeta: metav1.ObjectMeta{ |
There was a problem hiding this comment.
I am not sure if updating the PodCliqueTemplateSpec labels and annotations should necessarily trigger rolling updates. They need to propagated to all the resources not trigger updates. This is ok for now since it preserves current behavior.
cc @renormalize who is also looking into the fix.
There was a problem hiding this comment.
Ya, I wanted to preserve the current behvaiour as much as is practical right since it's a pretty impactful change.
| // so that two specs representing the same desired state always produce the | ||
| // same byte-for-byte serialization. See the doc on ComputeHash for the full | ||
| // list of slices canonicalized and the rationale for the slices left alone. | ||
| func canonicalizePodTemplateSpecForHashing(in *corev1.PodTemplateSpec) *corev1.PodTemplateSpec { |
There was a problem hiding this comment.
I agree the canonicalization is necessary to avoid triggering unnecessary rolling updates but I have a couple of questions here:
- I am concerned how much performance hit this code will now introduce.
- Is there a standard utility that is used by Kubernetes operators that we can reuse here?
There was a problem hiding this comment.
- I am concerned how much performance hit this code will now introduce.
The existing hash path already reflects over the full template via dump.ForHash, so I expect the extra deep copy and bounded slice sorts to be small. I'm not sure what the state of the performance testings suite is, but a before and after run might make sense when it's ready.
- Is there a standard utility that is used by Kubernetes operators that we can reuse here?
The thought occurred to me, but a quick search didn't find much. I also wanted to keep the canonical hashing logic close to current hashing logic to make reviewing easy.
There was a problem hiding this comment.
the infra for scale test is ready, I think its a good use case to add a scale test with a scenario that highlight this issue. Run it on main as baseline and on your branch to compare the results - I did a similar thing here you can use it as an example
There was a problem hiding this comment.
Done, no noticeable degradation in performance during the scale test. I even made one that looked more "Dynamo-like" with more volumes, containers, mounts etc.
Im in favor of the first option. |
|
Scale-test follow-up for the performance concern: I ran both 1k-pod scale tests against a fair local main baseline (09da63b, with only the new scale-test harness/YAML applied) and this branch (ce2b190) on the same 100-KWOK node scale profile. The existing ScaleTest_1000 showed no measurable regression, with total time at 547.8s on baseline vs 535.0s on this branch (-2.3%). The "Dynamo-like" variant, which better exercises the hash-canonicalization concern, was also not worse overall: total time was 688.7s baseline vs 605.7s branch (-12.1%); early deploy markers were slightly slower (pods-created +2.1s, pcs-available +3.3s), but pods-ready was 86.1s faster and the steady-state availability check was not slower. |
|
/ok to test ce2b190 |
1dccd9f to
5a2e834
Compare
Grove computed two hashes (PCS-level generation hash, per-PCLQ pod-template hash) by serializing API objects with dump.ForHash, which preserves slice order. The Kubernetes API declares many PodSpec slices +listType=map (Containers, Volumes, ImagePullSecrets, container Ports, VolumeMounts, VolumeDevices, ResizePolicy, HostAliases, TopologySpreadConstraints, ResourceClaims, EphemeralContainers) — they are name-keyed sets with no inherent order. An upstream operator that emits the same content in a different order (e.g. from non-deterministic Go map iteration) flipped both hashes, causing Grove to gang-roll every PodClique and lose any in-flight scale state. Same root cause for PodCliqueSet.Spec.Template.Cliques, also +listType=map +listMapKey=name. Fixes: - computeGenerationHash sorts cliques by name before hashing. Slice order is mixed back in via a startup-ordering marker only for CliqueStartupTypeInOrder, where slice index encodes the startup chain. - ComputeHash canonicalizes order-independent +listType=map slices in PodSpec before hashing. InitContainers (sequential execution), Env ($(VAR) substitution), EnvFrom/Tolerations (atomic) are intentionally not sorted — those orderings are part of the desired state. Resolves ai-dynamo#565
- Document Explicit-mode StartsAfter limitation on startupOrderingMarker: Spec.StartsAfter is not part of Spec.PodSpec, so today neither slice order nor StartsAfter mutations participate in the Explicit-mode generation hash. Tracked as a follow-up. - TestProcessGenerationHashChange_CliqueReorderIsNoOp: also assert Status.CurrentGenerationHash is unchanged on a clique reorder. - TestComputeGenerationHash_AnyOrderEqualsExplicit_WhenPodSpecsMatch: add the Explicit + reorder no-op corollary, which is the property most at risk of regressing if someone later "fixes" the marker logic to also engage on Explicit. - Fix doc preamble for TestComputePCLQPodTemplateHash_PodSpecSliceOrderInvariants: list Container.Env / InitContainers as order-sensitive +listType=map cases (not atomic) and use EnvFrom/Tolerations as the genuine atomic examples instead of VolumeMounts (which is in fact +listType=map). - Rename env_var_reorder_changes_hash_atomic_listtype and resize_policy_reorder_changes_hash_atomic_listtype to drop the incorrect "atomic_listtype" suffix; clarify the assertion messages. - Note in resize_policy subtest that the assertion would need to flip to assert.Equal if canonicalization is later extended to ResizePolicy. - Update dead cross-reference in TestProcessPendingUpdatesPCSHashFlipDoesNotDeletePods comment to point at TestComputePCLQPodTemplateHash_PodSpecSliceOrderInvariants in podclique_test.go.
Two doc-only corrections from review: - pod.go: PodSpec.ReadinessGates and Container.RestartPolicyRules are +listType=atomic slice fields that ComputeHash correctly leaves untouched, but they were neither enumerated in the "Intentionally NOT sorted" list nor covered by the catch-all paragraph (which only mentioned scalar/struct, +listType=set, and absent fields). Add both fields to the enumeration and broaden the catch-all to include un-enumerated +listType=atomic slices. - pod_test.go: TestComputeHash_DoesNotCanonicalizeAtomicSlices was misnamed — Container.Env and PodSpec.InitContainers are both +listType=map by name, not atomic. Their order is preserved due to runtime semantics ($(VAR) substitution / sequential init execution). Rename to TestComputeHash_DoesNotCanonicalizeOrderSensitiveSlices and rewrite the preamble to classify each subtest's field correctly. Also fix the env subtest assertion message which mislabelled Container.Env as +listType=atomic. No behavior change.
TestComputeGenerationHash_AnyOrderEqualsExplicit_WhenPodSpecsMatch built cliques without distinguishing labels, so the three per-clique pod templates were byte-identical. The slice-order-invariance assertion under Explicit would have passed trivially even against the bugged pre-canonicalization code. Add a per-clique WithLabels so each clique hashes distinctly, plus a comment explaining why it's load-bearing.
Summary
Fixes two related bugs that together caused Grove to gang-roll a PodCliqueSet on every spec update where the upstream operator happened to emit
+listType=mapslices in a different order. See #565 for the full repro and pod-lifecycle trace.Root cause:
dump.ForHash(used by bothcomputeGenerationHashandComputeHash) sorts map keys but preserves slice order. Several Kubernetes API slices that are declared+listType=map(andPodCliqueSet.Spec.Template.Cliquesitself) are name-keyed sets with no inherent order, but they were being treated as ordered by the hash. Non-deterministic Go map iteration in upstream code (e.g. the Dynamo operator) flipped the hashes on every reconcile, which scheduled real rolling updates and silently destroyed in-flight scale state.Fixes:
computeGenerationHash(operator/internal/controller/podcliqueset/reconcilespec.go) sortsSpec.Template.Cliquesby name before hashing. Slice order is mixed back into the hash only forCliqueStartupTypeInOrder(via astartupOrderingMarker), where the slice index does encode startup-chain semantics.AnyOrderandExplicitno longer leak slice order into the hash.ComputeHash(operator/internal/utils/kubernetes/pod.go) canonicalizes order-independent+listType=mapslices in PodSpec before hashing:Containers,Volumes,ImagePullSecrets,HostAliases,TopologySpreadConstraints,ResourceClaims,EphemeralContainersPorts,VolumeMounts,VolumeDevices,ResizePolicyInitContainers(sequential execution),Container.Env($(VAR)substitution),EnvFrom/Tolerations(+listType=atomic), andArgs/Commandare intentionally NOT sorted — their order is part of the desired state.Test plan
go vet ./...clean on the affected packagesgo test ./internal/utils/kubernetes/... ./internal/controller/common/component/utils/... ./internal/controller/podcliqueset/... ./internal/controller/podclique/components/pod/...all passpod_test.go::TestComputeHash_AdditionalListTypeMapSlices+ 8 sub-tests; mirrored at the per-PCLQ layer inpodclique_test.go::TestComputePCLQPodTemplateHash_AdditionalListTypeMapSlices)TestComputeHash_RealSpecChangesStillFlipHash,TestComputePCLQPodTemplateHash_RealSpecChangesStillFlipHash,TestComputeGenerationHash_RealCliqueTemplateChangeFlipsHash) — pin that real desired-state changes still flip the hash, so a future over-canonicalization can't silently break rolling updatesTestComputeGenerationHash_InOrderToAnyOrderFlipsHash,TestComputeGenerationHash_StartupTypeChangeFlipsHash,TestComputeGenerationHash_InOrderStartupIsSensitiveToCliqueOrder)TestComputePCLQPodTemplateHash_RealisticDynamoLikePodSpec) — same desired state, every+listType=mapslice (includingVolumeMounts) shuffled, hash must be stableTestComputeHash_NilSafety)TestComputeGenerationHash_CombinedReplicaChangeAndCliqueReorderIsNoOp) — the realistic scale-up patch shape from the bug reportResolves Rapid PodCliqueSet spec updates trigger full gang roll instead of in-place replica scale #565
Original NVIDIA bug report: https://nvbugspro.nvidia.com/bug/6109874 (internal)