diff --git a/docs/proposals/292-in-place-pod-image-update/README.md b/docs/proposals/292-in-place-pod-image-update/README.md new file mode 100644 index 000000000..3454e408c --- /dev/null +++ b/docs/proposals/292-in-place-pod-image-update/README.md @@ -0,0 +1,327 @@ +# GREP-292: In-Place Pod Image Update + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Limitations/Risks & Mitigations](#limitationsrisks--mitigations) + - [Unsupported Pod Template Changes](#unsupported-pod-template-changes) + - [Premature Update Completion](#premature-update-completion) + - [Readiness and Traffic Impact](#readiness-and-traffic-impact) + - [Image Pull or Startup Failures](#image-pull-or-startup-failures) +- [Design Details](#design-details) + - [API Changes](#api-changes) + - [High-Level Architecture](#high-level-architecture) + - [Eligibility Detection](#eligibility-detection) + - [Pod In-Place Update State](#pod-in-place-update-state) + - [Completion Detection](#completion-detection) + - [Standalone PodClique Flow](#standalone-podclique-flow) + - [PodCliqueScalingGroup Flow](#podcliquescalinggroup-flow) + - [Status and Conditions](#status-and-conditions) + - [Monitoring](#monitoring) + - [Test Plan](#test-plan) + - [Unit Tests](#unit-tests) + - [Integration Tests](#integration-tests) + - [E2E Tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + + +## Summary + +This GREP extends `PodCliqueSet` updates with opt-in in-place Pod image updates. When a workload update only changes regular container images, Grove can patch existing Pods by updating `spec.containers[*].image` and wait for kubelet to restart the affected containers, instead of deleting Pods and forcing rescheduling. Existing `RollingRecreate` and `OnDelete` behavior remain unchanged unless users explicitly select one of the new in-place strategies. + +## Motivation + +Grove currently applies PodCliqueSet template changes through either automatic recreate with `RollingRecreate` or user-driven deletion with `OnDelete`. Recreating Pods is expensive for AI workloads even when the only change is an image tag: + +- Pods enter scheduler queues again, and PodCliqueScalingGroup replacements may require gang scheduling. +- Scarce accelerator placement can be lost between deletion and replacement. +- Node-local placement, warm caches, mounted resources, IP address continuity, and scheduler backend state can be disrupted. +- Under cluster pressure, rescheduling can take longer than pulling and restarting a new image on the same node. + +Kubernetes allows updating regular container images on an existing Pod. Grove should use this capability for image-only changes while preserving recreate semantics for unsupported changes. + +### Goals + +- Add explicit `PodCliqueSet` update strategy types for in-place image updates. +- Apply in-place updates when the effective Pod template change is limited to regular container image changes. +- Preserve `RollingRecreate` as the default update strategy. +- Reuse Grove's existing update progress and template hash model. +- Keep PodCliqueScalingGroup updates gang-aware by updating one selected replica at a time. +- Surface in-place update progress, completion, blocked state, and fallback behavior through status and events. + +### Non-Goals + +- Supporting in-place updates for fields other than regular container images. +- Supporting in-place updates for `initContainers`, ephemeral containers, resources, commands, args, env, probes, volumes, scheduling fields, resource claims, topology constraints, startup order, or service discovery fields. +- Changing the default `RollingRecreate` behavior. +- Providing application-level compatibility checks between old and new images. +- Guaranteeing zero traffic impact while kubelet restarts updated containers. +- Introducing a new history resource for PodCliqueSet revisions. + +## Proposal + +Introduce two new `PodCliqueSet` update strategy types: + +- `InPlaceIfPossible`: Grove attempts eligible image-only updates in place. If the change is not eligible, Grove falls back to the existing `RollingRecreate` behavior. +- `InPlaceOnly`: Grove only attempts in-place updates. If the change is not eligible, Grove blocks the update and does not delete Pods. + +The new strategies are opt-in. Existing users continue to get the current `RollingRecreate` behavior when `spec.updateStrategy` is omitted or explicitly set to `RollingRecreate`. The existing `OnDelete` strategy remains manual and does not automatically patch or delete existing Pods. + +In-place update is evaluated per Pod. For each Pod with an outdated template hash, Grove builds the desired Pod from the current PodCliqueSet template and compares it with the existing Pod. If the only Pod spec differences are regular container image changes, Grove patches the existing Pod. After kubelet restarts the updated containers and reports new container status, Grove marks the Pod as updated by changing its Grove template hash label to the target hash. + +### Limitations/Risks & Mitigations + +#### Unsupported Pod Template Changes + +Only regular container image changes are eligible for in-place update. Any other Pod spec change requires recreate semantics. + +*Mitigation*: `InPlaceIfPossible` falls back to `RollingRecreate`; `InPlaceOnly` blocks the update with status and events that describe why the Pod is not eligible. + +#### Premature Update Completion + +If Grove updates the Pod template hash label before kubelet actually restarts containers, `status.updatedReplicas` could incorrectly report success. + +*Mitigation*: Grove records in-place update state separately and updates the Pod template hash label only after kubelet reports changed `status.containerStatuses[*].imageID` for updated containers and those containers are ready. + +#### Readiness and Traffic Impact + +Patching an image causes kubelet to restart the container. The Pod should not stay ready from Grove's perspective during the update. + +*Mitigation*: Grove injects a Grove-managed readiness gate into newly created Pods when in-place update is enabled. Before patching images, Grove sets the condition to `False`; after completion, Grove sets it back to `True`. Existing Pods without the readiness gate are not patched in place. + +#### Image Pull or Startup Failures + +If the new image cannot be pulled or fails to become ready, the Pod remains in an update-in-progress state. + +*Mitigation*: Grove keeps the old template hash label until completion and surfaces the failure through status and events. After a Pod has been patched, Grove waits for kubelet or user intervention rather than switching strategies mid-update. + +## Design Details + +### API Changes + +Extend `UpdateStrategyType` with two new values: + +```go +// UpdateStrategyType defines the type of update strategy for PodCliqueSet. +// +kubebuilder:validation:Enum={RollingRecreate,OnDelete,InPlaceIfPossible,InPlaceOnly} +type UpdateStrategyType string + +const ( + // RollingRecreateStrategy indicates that replicas will be progressively + // deleted and recreated when templates change. This remains the default. + RollingRecreateStrategy UpdateStrategyType = "RollingRecreate" + + // OnDeleteStrategy indicates that replicas will only be updated when users + // manually delete Pods or PodCliqueScalingGroup replicas. + OnDeleteStrategy UpdateStrategyType = "OnDelete" + + // InPlaceIfPossibleStrategy indicates that Grove should update Pods in place + // when the template change is eligible, and fall back to RollingRecreate when + // it is not. + InPlaceIfPossibleStrategy UpdateStrategyType = "InPlaceIfPossible" + + // InPlaceOnlyStrategy indicates that Grove should only update Pods in place. + // Unsupported changes block the update instead of deleting Pods. + InPlaceOnlyStrategy UpdateStrategyType = "InPlaceOnly" +) +``` + +Example usage: + +```yaml +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: inference +spec: + replicas: 2 + updateStrategy: + type: InPlaceIfPossible + template: + cliques: + - name: decode + spec: + replicas: 4 + podSpec: + containers: + - name: server + image: ghcr.io/example/decode:v2 +``` + +### High-Level Architecture + +```mermaid +flowchart TD + User["User updates PodCliqueSet image"] --> PCS["PodCliqueSet controller"] + PCS --> Hash["Compute target generation hash"] + Hash --> Strategy{"updateStrategy.type"} + Strategy -->|RollingRecreate| ExistingRR["Existing rolling recreate flow"] + Strategy -->|OnDelete| ExistingOD["Existing OnDelete flow"] + Strategy -->|InPlaceIfPossible or InPlaceOnly| Select["Select next PCS replica to update"] + Select --> Child["PodClique / PodCliqueScalingGroup controllers"] + Child --> PodCtl["Pod component sync"] + PodCtl --> Diff{"Pod eligible for in-place update?"} + Diff -->|yes| NotReady["Set in-place readiness gate False"] + NotReady --> Patch["Patch spec.containers[*].image"] + Patch --> Kubelet["Kubelet pulls image and restarts container"] + Kubelet --> Complete{"Container status shows new image running and ready?"} + Complete -->|no| Wait["Requeue and keep update in progress"] + Complete -->|yes| Mark["Patch Grove template hash label and readiness gate True"] + Mark --> Status["Update updatedReplicas and updateProgress"] + Diff -->|no, InPlaceIfPossible| ExistingRR + Diff -->|no, InPlaceOnly| Blocked["Record blocked reason and emit event"] +``` + +### Eligibility Detection + +Grove builds the desired Pod using the same code path used for new Pod creation. It then compares the existing Pod against the desired Pod after normalizing fields that Grove or Kubernetes mutate at runtime. + +An update is eligible for in-place patch when all of the following are true: + +- The Pod is managed by a PodClique in the current update scope. +- The Pod is not terminating. +- The Pod carries the Grove in-place readiness gate. +- The existing Pod and desired Pod have the same regular container set, matched by container name. +- The only Pod spec changes are `spec.containers[*].image`. +- `initContainers`, ephemeral containers, volumes, resource claims, scheduling fields, resources, probes, env, command, args, security context, restart policy, and DNS settings are unchanged. + +When eligibility fails, Grove records a structured reason for status and events. + +### Pod In-Place Update State + +Grove stores in-place update state on the Pod: + +```go +type InPlaceUpdateState struct { + // PodTemplateHash is the target Pod template hash. + PodTemplateHash string `json:"podTemplateHash"` + + // PodCliqueSetGenerationHash is the target PodCliqueSet generation hash. + PodCliqueSetGenerationHash string `json:"podCliqueSetGenerationHash"` + + // UpdateStartedAt is when Grove started this Pod in-place update. + UpdateStartedAt metav1.Time `json:"updateStartedAt,omitempty"` + + // LastContainerStatuses records image IDs before the image patch. + LastContainerStatuses map[string]InPlaceUpdateContainerStatus `json:"lastContainerStatuses,omitempty"` + + // ContainerImages records target images by container name. + ContainerImages map[string]string `json:"containerImages,omitempty"` +} + +type InPlaceUpdateContainerStatus struct { + ImageID string `json:"imageID,omitempty"` +} +``` + +Grove uses: + +- Pod annotation `grove.io/in-place-update-state` +- Pod condition `grove.io/InPlaceUpdateReady` + +The condition is injected as a Pod readiness gate for new Pods created when the selected update strategy is `InPlaceIfPossible` or `InPlaceOnly`. + +### Completion Detection + +An in-place Pod update is complete when: + +- The Pod still has the target in-place update state. +- For each updated container, the current `status.containerStatuses[name].imageID` differs from the image ID recorded before the patch. +- Each updated container is ready. + +After completion, Grove patches the Pod: + +- `metadata.labels[grove.io/pod-template-hash]` to the target hash. +- `grove.io/InPlaceUpdateReady=True`. +- Removes stale in-place update state. + +The existing PodClique status calculation can then count the Pod in `updatedReplicas` because the Pod template hash label matches the target hash. + +### Standalone PodClique Flow + +For a standalone PodClique under `InPlaceIfPossible` or `InPlaceOnly`: + +1. `PodCliqueSet` initializes `status.updateProgress` with the target generation hash. +2. `PodClique` initializes or resets `status.updateProgress` with the target Pod template hash. +3. The Pod component lists existing Pods and identifies Pods whose `grove.io/pod-template-hash` does not match the target hash. +4. The Pod component updates one eligible ready Pod at a time. +5. Old non-ready Pods are not patched in place. `InPlaceIfPossible` recreates them; `InPlaceOnly` blocks until they become suitable or are manually handled. +6. The PodClique update completes when all Pods have the target template hash and the current minAvailable condition is satisfied. + +### PodCliqueScalingGroup Flow + +For PodCliqueScalingGroups, Grove continues to select one PodCliqueSet replica for update at a time. + +1. PCSG records update progress for the selected replica. +2. Child PodCliques receive updated target template hashes. +3. Each child PodClique's Pod controller applies the standalone in-place logic to its Pods. +4. PCSG marks the replica updated when every child PodClique in the replica has reached the target Pod template hash and minAvailable requirements. + +If any Pod in the selected PCSG replica is not eligible: + +- `InPlaceIfPossible` falls back to the existing PCSG rolling recreate logic for that replica. +- `InPlaceOnly` blocks the PCSG update and records the blocked reason. + +### Status and Conditions + +The existing fields remain authoritative: + +- `PodCliqueSet.status.updatedReplicas` +- `PodCliqueSet.status.updateProgress` +- `PodCliqueScalingGroup.status.updatedReplicas` +- `PodCliqueScalingGroup.status.updateProgress` +- `PodClique.status.updatedReplicas` +- `PodClique.status.updateProgress` + +When `InPlaceOnly` cannot progress, Grove records an update-blocked condition on the smallest relevant resource: PodClique for standalone updates, PodCliqueScalingGroup for PCSG replica updates, and PodCliqueSet when a child update blocks top-level progress. + +### Monitoring + +Grove should emit Kubernetes Events: + +- `StartedPodInPlaceUpdate`: Pod image patch flow has started. +- `SuccessfulPodInPlaceUpdate`: Pod reached the target image and target template hash. +- `FailedPodInPlaceUpdate`: Pod patch failed or completion check failed with a terminal error. +- `SkippedPodInPlaceUpdate`: Pod is not eligible and Grove is falling back to recreate. +- `BlockedInPlaceUpdate`: `InPlaceOnly` cannot apply the update in place. + +### Test Plan + +#### Unit Tests + +- API validation accepts `InPlaceIfPossible` and `InPlaceOnly`. +- Eligibility detection returns true for image-only changes. +- Eligibility detection returns false for env, resources, command, args, init container image, volume, scheduler, resource claim, and other non-image changes. +- Pod patch generation only updates expected container images and Grove-managed annotations/labels. +- In-place update completion waits for container status changes before updating the Pod template hash label. +- `InPlaceIfPossible` falls back to recreate when eligibility fails before patching. +- `InPlaceOnly` records a blocked condition and does not delete Pods when eligibility fails. + +#### Integration Tests + +- Standalone PodClique image-only update patches Pods in place and preserves Pod names. +- PodCliqueScalingGroup image-only update patches Pods in place while preserving PCSG replica PodClique names. +- Updating a non-image field with `InPlaceIfPossible` recreates the affected Pod or PCSG replica. +- Updating a non-image field with `InPlaceOnly` blocks without deleting Pods. + +#### E2E Tests + +- Deploy a PodCliqueSet with `InPlaceIfPossible`, update an image, and verify Pod names stay unchanged while container image IDs change. +- Verify `updatedReplicas` progresses from old value to desired replicas only after kubelet reports the new image. +- Verify an image pull failure leaves the Pod not updated and surfaces Events/status. + +### Graduation Criteria + +The implementation is complete when: + +- `PodCliqueSet` accepts `InPlaceIfPossible` and `InPlaceOnly`. +- Image-only Pod updates are patched in place. +- In-place updates use a Grove-managed readiness gate. +- Completion is detected from container status before the Pod template hash label is advanced. +- `InPlaceIfPossible` falls back to recreate for unsupported changes. +- `InPlaceOnly` blocks without deleting Pods for unsupported changes. +- Standalone PodClique and PodCliqueScalingGroup flows preserve existing update ordering. +- Unit and integration tests cover successful, fallback, blocked, and failed update paths. diff --git a/operator/api/core/v1alpha1/crds/grove.io_podcliquesets.yaml b/operator/api/core/v1alpha1/crds/grove.io_podcliquesets.yaml index 8a3217178..ae857f448 100644 --- a/operator/api/core/v1alpha1/crds/grove.io_podcliquesets.yaml +++ b/operator/api/core/v1alpha1/crds/grove.io_podcliquesets.yaml @@ -10776,6 +10776,17 @@ spec: templates change. This applies to both standalone PodCliques and PodCliqueScalingGroups. properties: + inPlaceUpdate: + description: InPlaceUpdate configures behavior for in-place update + strategies. + properties: + gracePeriodSeconds: + description: |- + GracePeriodSeconds is the delay between marking a Pod not ready through the + Grove in-place readiness gate and patching container images. + format: int32 + type: integer + type: object type: default: RollingRecreate description: |- @@ -10786,6 +10797,8 @@ spec: enum: - RollingRecreate - OnDelete + - InPlaceIfPossible + - InPlaceOnly type: string type: object required: diff --git a/operator/api/core/v1alpha1/podcliqueset.go b/operator/api/core/v1alpha1/podcliqueset.go index 9d1f3677b..fd448c7d9 100644 --- a/operator/api/core/v1alpha1/podcliqueset.go +++ b/operator/api/core/v1alpha1/podcliqueset.go @@ -112,6 +112,17 @@ type PodCliqueSetUpdateStrategy struct { // Default is RollingRecreate. // +kubebuilder:default=RollingRecreate Type UpdateStrategyType `json:"type,omitempty"` + // InPlaceUpdate configures behavior for in-place update strategies. + // +optional + InPlaceUpdate *InPlaceUpdateStrategy `json:"inPlaceUpdate,omitempty"` +} + +// InPlaceUpdateStrategy defines settings for in-place Pod image updates. +type InPlaceUpdateStrategy struct { + // GracePeriodSeconds is the delay between marking a Pod not ready through the + // Grove in-place readiness gate and patching container images. + // +optional + GracePeriodSeconds *int32 `json:"gracePeriodSeconds,omitempty"` } // PodCliqueSetRollingUpdateProgress captures the progress of a rolling update of the PodCliqueSet. @@ -424,7 +435,7 @@ type HeadlessServiceConfig struct { } // UpdateStrategyType defines the type of update strategy for PodCliqueSet. -// +kubebuilder:validation:Enum={RollingRecreate,OnDelete} +// +kubebuilder:validation:Enum={RollingRecreate,OnDelete,InPlaceIfPossible,InPlaceOnly} type UpdateStrategyType string const ( @@ -439,6 +450,13 @@ const ( // they are manually deleted. Changes to templates do not automatically // trigger replica deletions. OnDeleteStrategy UpdateStrategyType = "OnDelete" + // InPlaceIfPossibleStrategy indicates that Pods should be updated in-place + // when their changes are eligible for in-place update. Unsupported changes + // fall back to RollingRecreate. + InPlaceIfPossibleStrategy UpdateStrategyType = "InPlaceIfPossible" + // InPlaceOnlyStrategy indicates that Pods should only be updated in-place. + // Unsupported changes block the update instead of deleting Pods. + InPlaceOnlyStrategy UpdateStrategyType = "InPlaceOnly" ) // CliqueStartupType defines the order in which each PodClique is started. diff --git a/operator/api/core/v1alpha1/zz_generated.deepcopy.go b/operator/api/core/v1alpha1/zz_generated.deepcopy.go index 44633fe16..f999d46b7 100644 --- a/operator/api/core/v1alpha1/zz_generated.deepcopy.go +++ b/operator/api/core/v1alpha1/zz_generated.deepcopy.go @@ -186,6 +186,27 @@ func (in *HeadlessServiceConfig) DeepCopy() *HeadlessServiceConfig { return out } +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *InPlaceUpdateStrategy) DeepCopyInto(out *InPlaceUpdateStrategy) { + *out = *in + if in.GracePeriodSeconds != nil { + in, out := &in.GracePeriodSeconds, &out.GracePeriodSeconds + *out = new(int32) + **out = **in + } + return +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new InPlaceUpdateStrategy. +func (in *InPlaceUpdateStrategy) DeepCopy() *InPlaceUpdateStrategy { + if in == nil { + return nil + } + out := new(InPlaceUpdateStrategy) + in.DeepCopyInto(out) + return out +} + // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *LastError) DeepCopyInto(out *LastError) { *out = *in @@ -840,7 +861,7 @@ func (in *PodCliqueSetSpec) DeepCopyInto(out *PodCliqueSetSpec) { if in.UpdateStrategy != nil { in, out := &in.UpdateStrategy, &out.UpdateStrategy *out = new(PodCliqueSetUpdateStrategy) - **out = **in + (*in).DeepCopyInto(*out) } in.Template.DeepCopyInto(&out.Template) return @@ -1027,6 +1048,11 @@ func (in *PodCliqueSetUpdateProgress) DeepCopy() *PodCliqueSetUpdateProgress { // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *PodCliqueSetUpdateStrategy) DeepCopyInto(out *PodCliqueSetUpdateStrategy) { *out = *in + if in.InPlaceUpdate != nil { + in, out := &in.InPlaceUpdate, &out.InPlaceUpdate + *out = new(InPlaceUpdateStrategy) + (*in).DeepCopyInto(*out) + } return } diff --git a/operator/internal/controller/common/component/utils/podcliqueset.go b/operator/internal/controller/common/component/utils/podcliqueset.go index 3370cccd6..bcfe675ad 100644 --- a/operator/internal/controller/common/component/utils/podcliqueset.go +++ b/operator/internal/controller/common/component/utils/podcliqueset.go @@ -121,6 +121,23 @@ func IsAutoUpdateStrategy(pcs *grovecorev1alpha1.PodCliqueSet) bool { return pcs.Spec.UpdateStrategy == nil || pcs.Spec.UpdateStrategy.Type != grovecorev1alpha1.OnDeleteStrategy } +// IsInPlaceUpdateStrategy returns true when PodCliqueSet update strategy should attempt in-place Pod image updates. +func IsInPlaceUpdateStrategy(pcs *grovecorev1alpha1.PodCliqueSet) bool { + if pcs == nil || pcs.Spec.UpdateStrategy == nil { + return false + } + return pcs.Spec.UpdateStrategy.Type == grovecorev1alpha1.InPlaceIfPossibleStrategy || + pcs.Spec.UpdateStrategy.Type == grovecorev1alpha1.InPlaceOnlyStrategy +} + +// IsInPlaceOnlyStrategy returns true when unsupported in-place changes should block instead of falling back to recreate. +func IsInPlaceOnlyStrategy(pcs *grovecorev1alpha1.PodCliqueSet) bool { + if pcs == nil || pcs.Spec.UpdateStrategy == nil { + return false + } + return pcs.Spec.UpdateStrategy.Type == grovecorev1alpha1.InPlaceOnlyStrategy +} + // GetExpectedPCLQNamesGroupByOwner returns the expected unqualified PodClique names which are either owned by PodCliqueSet or PodCliqueScalingGroup. func GetExpectedPCLQNamesGroupByOwner(pcs *grovecorev1alpha1.PodCliqueSet) (expectedPCLQNamesForPCS []string, expectedPCLQNamesForPCSG []string) { pcsgConfigs := pcs.Spec.Template.PodCliqueScalingGroupConfigs diff --git a/operator/internal/controller/common/component/utils/podcliqueset_test.go b/operator/internal/controller/common/component/utils/podcliqueset_test.go index 450b0c44a..824261bfa 100644 --- a/operator/internal/controller/common/component/utils/podcliqueset_test.go +++ b/operator/internal/controller/common/component/utils/podcliqueset_test.go @@ -301,6 +301,24 @@ func TestIsAutoUpdateStrategy(t *testing.T) { }, expected: false, }, + { + name: "in_place_if_possible_is_auto", + pcs: &grovecorev1alpha1.PodCliqueSet{ + Spec: grovecorev1alpha1.PodCliqueSetSpec{ + UpdateStrategy: &grovecorev1alpha1.PodCliqueSetUpdateStrategy{Type: grovecorev1alpha1.InPlaceIfPossibleStrategy}, + }, + }, + expected: true, + }, + { + name: "in_place_only_is_auto", + pcs: &grovecorev1alpha1.PodCliqueSet{ + Spec: grovecorev1alpha1.PodCliqueSetSpec{ + UpdateStrategy: &grovecorev1alpha1.PodCliqueSetUpdateStrategy{Type: grovecorev1alpha1.InPlaceOnlyStrategy}, + }, + }, + expected: true, + }, } for _, tc := range tests { @@ -310,6 +328,51 @@ func TestIsAutoUpdateStrategy(t *testing.T) { } } +func TestIsInPlaceUpdateStrategy(t *testing.T) { + tests := []struct { + name string + pcs *grovecorev1alpha1.PodCliqueSet + expected bool + }{ + {name: "nil_pcs", pcs: nil, expected: false}, + {name: "nil_strategy", pcs: &grovecorev1alpha1.PodCliqueSet{}, expected: false}, + { + name: "rolling_recreate", + pcs: &grovecorev1alpha1.PodCliqueSet{Spec: grovecorev1alpha1.PodCliqueSetSpec{ + UpdateStrategy: &grovecorev1alpha1.PodCliqueSetUpdateStrategy{Type: grovecorev1alpha1.RollingRecreateStrategy}, + }}, + expected: false, + }, + { + name: "on_delete", + pcs: &grovecorev1alpha1.PodCliqueSet{Spec: grovecorev1alpha1.PodCliqueSetSpec{ + UpdateStrategy: &grovecorev1alpha1.PodCliqueSetUpdateStrategy{Type: grovecorev1alpha1.OnDeleteStrategy}, + }}, + expected: false, + }, + { + name: "in_place_if_possible", + pcs: &grovecorev1alpha1.PodCliqueSet{Spec: grovecorev1alpha1.PodCliqueSetSpec{ + UpdateStrategy: &grovecorev1alpha1.PodCliqueSetUpdateStrategy{Type: grovecorev1alpha1.InPlaceIfPossibleStrategy}, + }}, + expected: true, + }, + { + name: "in_place_only", + pcs: &grovecorev1alpha1.PodCliqueSet{Spec: grovecorev1alpha1.PodCliqueSetSpec{ + UpdateStrategy: &grovecorev1alpha1.PodCliqueSetUpdateStrategy{Type: grovecorev1alpha1.InPlaceOnlyStrategy}, + }}, + expected: true, + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + assert.Equal(t, tc.expected, IsInPlaceUpdateStrategy(tc.pcs)) + }) + } +} + // TestGetPodCliqueSet tests the GetPodCliqueSet function func TestGetPodCliqueSet(t *testing.T) { tests := []struct { diff --git a/operator/internal/controller/podclique/components/pod/inplaceupdate.go b/operator/internal/controller/podclique/components/pod/inplaceupdate.go new file mode 100644 index 000000000..6d77dadc2 --- /dev/null +++ b/operator/internal/controller/podclique/components/pod/inplaceupdate.go @@ -0,0 +1,247 @@ +// /* +// Copyright 2026 The Grove Authors. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +// */ + +package pod + +import ( + "encoding/json" + "fmt" + + apicommon "github.com/ai-dynamo/grove/operator/api/common" + grovecorev1alpha1 "github.com/ai-dynamo/grove/operator/api/core/v1alpha1" + componentutils "github.com/ai-dynamo/grove/operator/internal/controller/common/component/utils" + + corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/api/equality" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" +) + +const ( + annotationInPlaceUpdateState = "grove.io/in-place-update-state" + annotationInPlaceUpdateGrace = "grove.io/in-place-update-grace" +) + +const conditionInPlaceUpdateReady corev1.PodConditionType = "grove.io/InPlaceUpdateReady" + +type inPlaceUpdateSpec struct { + PodTemplateHash string `json:"podTemplateHash"` + PodCliqueSetGenerationHash string `json:"podCliqueSetGenerationHash"` + ContainerImages map[string]string `json:"containerImages,omitempty"` +} + +type inPlaceUpdateState struct { + PodTemplateHash string `json:"podTemplateHash"` + PodCliqueSetGenerationHash string `json:"podCliqueSetGenerationHash"` + UpdateStartedAt metav1.Time `json:"updateStartedAt,omitempty"` + LastContainerStatuses map[string]inPlaceUpdateContainerStatus `json:"lastContainerStatuses,omitempty"` + ContainerImages map[string]string `json:"containerImages,omitempty"` +} + +type inPlaceUpdateContainerStatus struct { + ImageID string `json:"imageID,omitempty"` +} + +func injectInPlaceReadinessGate(pcs *grovecorev1alpha1.PodCliqueSet, pod *corev1.Pod) { + if !componentutils.IsInPlaceUpdateStrategy(pcs) || hasInPlaceReadinessGate(pod) { + return + } + pod.Spec.ReadinessGates = append(pod.Spec.ReadinessGates, corev1.PodReadinessGate{ConditionType: conditionInPlaceUpdateReady}) +} + +func hasInPlaceReadinessGate(pod *corev1.Pod) bool { + for _, gate := range pod.Spec.ReadinessGates { + if gate.ConditionType == conditionInPlaceUpdateReady { + return true + } + } + return false +} + +func computeInPlaceUpdateSpec(existingPod, desiredPod *corev1.Pod, podTemplateHash, pcsGenerationHash string) (*inPlaceUpdateSpec, string) { + if !hasInPlaceReadinessGate(existingPod) { + return nil, "missing Grove in-place readiness gate" + } + if len(existingPod.Spec.InitContainers) != len(desiredPod.Spec.InitContainers) || + !equality.Semantic.DeepEqual(existingPod.Spec.InitContainers, desiredPod.Spec.InitContainers) { + return nil, "initContainers changed" + } + if len(existingPod.Spec.Containers) != len(desiredPod.Spec.Containers) { + return nil, "container set changed" + } + + normalizedExisting := existingPod.DeepCopy() + normalizedDesired := desiredPod.DeepCopy() + normalizedDesired.Spec.SchedulingGates = normalizedExisting.Spec.SchedulingGates + normalizedDesired.Spec.NodeName = normalizedExisting.Spec.NodeName + + containerImages := make(map[string]string) + existingContainersByName := make(map[string]corev1.Container, len(normalizedExisting.Spec.Containers)) + for _, container := range normalizedExisting.Spec.Containers { + existingContainersByName[container.Name] = container + } + for i := range normalizedDesired.Spec.Containers { + desiredContainer := &normalizedDesired.Spec.Containers[i] + existingContainer, ok := existingContainersByName[desiredContainer.Name] + if !ok { + return nil, fmt.Sprintf("container %s is new", desiredContainer.Name) + } + if desiredContainer.Image != existingContainer.Image { + containerImages[desiredContainer.Name] = desiredContainer.Image + desiredContainer.Image = existingContainer.Image + } + } + + if !equality.Semantic.DeepEqual(normalizedExisting.Spec, normalizedDesired.Spec) { + for _, desiredContainer := range normalizedDesired.Spec.Containers { + existingContainer := existingContainersByName[desiredContainer.Name] + if !equality.Semantic.DeepEqual(existingContainer, desiredContainer) { + return nil, fmt.Sprintf("container %s has non-image changes", desiredContainer.Name) + } + } + return nil, "pod spec has non-image changes" + } + if len(containerImages) == 0 { + return nil, "no container image changes" + } + return &inPlaceUpdateSpec{ + PodTemplateHash: podTemplateHash, + PodCliqueSetGenerationHash: pcsGenerationHash, + ContainerImages: containerImages, + }, "" +} + +func applyInPlaceUpdateSpec(pod *corev1.Pod, spec *inPlaceUpdateSpec) { + if pod.Labels == nil { + pod.Labels = make(map[string]string) + } + if pod.Annotations == nil { + pod.Annotations = make(map[string]string) + } + + statuses := make(map[string]inPlaceUpdateContainerStatus, len(spec.ContainerImages)) + for _, status := range pod.Status.ContainerStatuses { + if _, ok := spec.ContainerImages[status.Name]; ok { + statuses[status.Name] = inPlaceUpdateContainerStatus{ImageID: status.ImageID} + } + } + state := inPlaceUpdateState{ + PodTemplateHash: spec.PodTemplateHash, + PodCliqueSetGenerationHash: spec.PodCliqueSetGenerationHash, + UpdateStartedAt: metav1.Now(), + LastContainerStatuses: statuses, + ContainerImages: spec.ContainerImages, + } + stateJSON, _ := json.Marshal(state) + pod.Annotations[annotationInPlaceUpdateState] = string(stateJSON) + + for i := range pod.Spec.Containers { + if image, ok := spec.ContainerImages[pod.Spec.Containers[i].Name]; ok { + pod.Spec.Containers[i].Image = image + } + } + setInPlaceUpdateCondition(pod, corev1.ConditionFalse, "StartInPlaceUpdate") +} + +func markInPlaceUpdateCompleteIfReady(pod *corev1.Pod) (bool, error) { + state, ok, err := getInPlaceUpdateState(pod) + if err != nil || !ok { + return false, err + } + for containerName, oldStatus := range state.LastContainerStatuses { + currentStatus := getContainerStatus(pod, containerName) + if currentStatus == nil { + return false, nil + } + if oldStatus.ImageID != "" && currentStatus.ImageID == oldStatus.ImageID { + return false, nil + } + if !currentStatus.Ready { + return false, nil + } + } + if pod.Labels == nil { + pod.Labels = make(map[string]string) + } + pod.Labels[apicommon.LabelPodTemplateHash] = state.PodTemplateHash + delete(pod.Annotations, annotationInPlaceUpdateGrace) + delete(pod.Annotations, annotationInPlaceUpdateState) + setInPlaceUpdateCondition(pod, corev1.ConditionTrue, "InPlaceUpdateComplete") + return true, nil +} + +func markInPlaceUpdateReadyIfIdle(pod *corev1.Pod) (bool, error) { + if !hasInPlaceReadinessGate(pod) { + return false, nil + } + if _, ok, err := getInPlaceUpdateState(pod); err != nil || ok { + return false, err + } + condition := getInPlaceUpdateCondition(pod) + if condition != nil && condition.Status == corev1.ConditionTrue { + return false, nil + } + setInPlaceUpdateCondition(pod, corev1.ConditionTrue, "NoInPlaceUpdate") + return true, nil +} + +func getInPlaceUpdateState(pod *corev1.Pod) (*inPlaceUpdateState, bool, error) { + if pod.Annotations == nil { + return nil, false, nil + } + rawState, ok := pod.Annotations[annotationInPlaceUpdateState] + if !ok { + return nil, false, nil + } + state := &inPlaceUpdateState{} + if err := json.Unmarshal([]byte(rawState), state); err != nil { + return nil, false, err + } + return state, true, nil +} + +func getContainerStatus(pod *corev1.Pod, name string) *corev1.ContainerStatus { + for i := range pod.Status.ContainerStatuses { + if pod.Status.ContainerStatuses[i].Name == name { + return &pod.Status.ContainerStatuses[i] + } + } + return nil +} + +func getInPlaceUpdateCondition(pod *corev1.Pod) *corev1.PodCondition { + for i := range pod.Status.Conditions { + if pod.Status.Conditions[i].Type == conditionInPlaceUpdateReady { + return &pod.Status.Conditions[i] + } + } + return nil +} + +func setInPlaceUpdateCondition(pod *corev1.Pod, status corev1.ConditionStatus, reason string) { + condition := corev1.PodCondition{ + Type: conditionInPlaceUpdateReady, + Status: status, + Reason: reason, + LastTransitionTime: metav1.Now(), + } + for i := range pod.Status.Conditions { + if pod.Status.Conditions[i].Type == conditionInPlaceUpdateReady { + pod.Status.Conditions[i] = condition + return + } + } + pod.Status.Conditions = append(pod.Status.Conditions, condition) +} diff --git a/operator/internal/controller/podclique/components/pod/inplaceupdate_test.go b/operator/internal/controller/podclique/components/pod/inplaceupdate_test.go new file mode 100644 index 000000000..cc66065c3 --- /dev/null +++ b/operator/internal/controller/podclique/components/pod/inplaceupdate_test.go @@ -0,0 +1,186 @@ +// /* +// Copyright 2026 The Grove Authors. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +// */ + +package pod + +import ( + "testing" + + apicommon "github.com/ai-dynamo/grove/operator/api/common" + grovecorev1alpha1 "github.com/ai-dynamo/grove/operator/api/core/v1alpha1" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + corev1 "k8s.io/api/core/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" +) + +func TestInjectInPlaceReadinessGateForInPlaceStrategy(t *testing.T) { + pcs := newInPlaceTestPCS(grovecorev1alpha1.InPlaceIfPossibleStrategy) + pod := &corev1.Pod{} + + injectInPlaceReadinessGate(pcs, pod) + + assert.True(t, hasInPlaceReadinessGate(pod), "in-place update Pods should carry the readiness gate") +} + +func TestInjectInPlaceReadinessGateSkipsRollingRecreate(t *testing.T) { + pcs := newInPlaceTestPCS(grovecorev1alpha1.RollingRecreateStrategy) + pod := &corev1.Pod{} + + injectInPlaceReadinessGate(pcs, pod) + + assert.False(t, hasInPlaceReadinessGate(pod), "rolling recreate Pods should preserve existing readiness gate behavior") +} + +func TestComputeInPlaceUpdateSpecAllowsContainerImageOnlyChange(t *testing.T) { + oldPod := newReadyInPlacePod("pod-1", testOldHash, "app:v1", "old-image-id") + desiredPod := oldPod.DeepCopy() + desiredPod.Spec.Containers[0].Image = "app:v2" + + spec, reason := computeInPlaceUpdateSpec(oldPod, desiredPod, testNewHash, "pcs-hash") + + require.NotNil(t, spec, "image-only changes should be eligible for in-place update: %s", reason) + assert.Equal(t, map[string]string{"main": "app:v2"}, spec.ContainerImages) + assert.Equal(t, testNewHash, spec.PodTemplateHash) + assert.Equal(t, "pcs-hash", spec.PodCliqueSetGenerationHash) +} + +func TestComputeInPlaceUpdateSpecRejectsEnvChange(t *testing.T) { + oldPod := newReadyInPlacePod("pod-1", testOldHash, "app:v1", "old-image-id") + desiredPod := oldPod.DeepCopy() + desiredPod.Spec.Containers[0].Env = []corev1.EnvVar{{Name: "NEW_ENV", Value: "true"}} + + spec, reason := computeInPlaceUpdateSpec(oldPod, desiredPod, testNewHash, "pcs-hash") + + require.Nil(t, spec) + assert.Contains(t, reason, "container main has non-image changes") +} + +func TestApplyInPlaceUpdateSpecPatchesImageAndDefersTemplateHash(t *testing.T) { + pod := newReadyInPlacePod("pod-1", testOldHash, "app:v1", "old-image-id") + spec := &inPlaceUpdateSpec{ + PodTemplateHash: testNewHash, + PodCliqueSetGenerationHash: "pcs-hash", + ContainerImages: map[string]string{"main": "app:v2"}, + } + + applyInPlaceUpdateSpec(pod, spec) + + assert.Equal(t, "app:v2", pod.Spec.Containers[0].Image) + assert.Equal(t, testOldHash, pod.Labels[apicommon.LabelPodTemplateHash], "template hash should change only after kubelet reports completion") + assert.Contains(t, pod.Annotations, annotationInPlaceUpdateState) + assert.Equal(t, corev1.ConditionFalse, getInPlaceUpdateCondition(pod).Status) +} + +func TestMarkInPlaceUpdateCompleteUpdatesTemplateHashAfterImageIDChanges(t *testing.T) { + pod := newReadyInPlacePod("pod-1", testOldHash, "app:v1", "old-image-id") + spec := &inPlaceUpdateSpec{ + PodTemplateHash: testNewHash, + PodCliqueSetGenerationHash: "pcs-hash", + ContainerImages: map[string]string{"main": "app:v2"}, + } + applyInPlaceUpdateSpec(pod, spec) + pod.Status.ContainerStatuses[0].ImageID = "new-image-id" + pod.Status.ContainerStatuses[0].Image = "app:v2" + + completed, err := markInPlaceUpdateCompleteIfReady(pod) + + require.NoError(t, err) + require.True(t, completed) + assert.Equal(t, testNewHash, pod.Labels[apicommon.LabelPodTemplateHash]) + assert.Equal(t, corev1.ConditionTrue, getInPlaceUpdateCondition(pod).Status) +} + +func TestMarkInPlaceUpdateReadyIfIdleSetsGateTrue(t *testing.T) { + pod := newReadyInPlacePod("pod-1", testOldHash, "app:v1", "old-image-id") + pod.Status.Conditions = nil + + changed, err := markInPlaceUpdateReadyIfIdle(pod) + + require.NoError(t, err) + require.True(t, changed) + assert.Equal(t, corev1.ConditionTrue, getInPlaceUpdateCondition(pod).Status) +} + +func TestMarkInPlaceUpdateReadyIfIdleSkipsUpdatingPod(t *testing.T) { + pod := newReadyInPlacePod("pod-1", testOldHash, "app:v1", "old-image-id") + spec := &inPlaceUpdateSpec{ + PodTemplateHash: testNewHash, + PodCliqueSetGenerationHash: "pcs-hash", + ContainerImages: map[string]string{"main": "app:v2"}, + } + applyInPlaceUpdateSpec(pod, spec) + + changed, err := markInPlaceUpdateReadyIfIdle(pod) + + require.NoError(t, err) + require.False(t, changed) + assert.Equal(t, corev1.ConditionFalse, getInPlaceUpdateCondition(pod).Status) +} + +func newInPlaceTestPCS(strategy grovecorev1alpha1.UpdateStrategyType) *grovecorev1alpha1.PodCliqueSet { + return &grovecorev1alpha1.PodCliqueSet{ + ObjectMeta: metav1.ObjectMeta{Name: "pcs", Namespace: testNS}, + Spec: grovecorev1alpha1.PodCliqueSetSpec{ + UpdateStrategy: &grovecorev1alpha1.PodCliqueSetUpdateStrategy{Type: strategy}, + Template: grovecorev1alpha1.PodCliqueSetTemplateSpec{ + Cliques: []*grovecorev1alpha1.PodCliqueTemplateSpec{{ + Name: "worker", + Spec: grovecorev1alpha1.PodCliqueSpec{ + RoleName: "worker", + PodSpec: corev1.PodSpec{ + Containers: []corev1.Container{{Name: "main", Image: "app:v1"}}, + }, + }, + }}, + }, + }, + } +} + +func newInPlaceTestPCLQ(name string) *grovecorev1alpha1.PodClique { + return &grovecorev1alpha1.PodClique{ + ObjectMeta: metav1.ObjectMeta{ + Name: name, + Namespace: testNS, + Labels: map[string]string{ + apicommon.LabelPartOfKey: "pcs", + apicommon.LabelPodTemplateHash: testOldHash, + }, + }, + Spec: grovecorev1alpha1.PodCliqueSpec{ + RoleName: "worker", + PodSpec: corev1.PodSpec{ + Containers: []corev1.Container{{Name: "main", Image: "app:v1"}}, + }, + }, + } +} + +func newReadyInPlacePod(name, templateHash, image, imageID string) *corev1.Pod { + pod := newTestPod(name, templateHash, + withPhase(corev1.PodRunning), + withReadyCondition(), + withContainerStatus(&[]bool{true}[0], true), + ) + pod.Spec.ReadinessGates = []corev1.PodReadinessGate{{ConditionType: conditionInPlaceUpdateReady}} + pod.Spec.Containers = []corev1.Container{{Name: "main", Image: image}} + pod.Status.ContainerStatuses[0].Name = "main" + pod.Status.ContainerStatuses[0].Image = image + pod.Status.ContainerStatuses[0].ImageID = imageID + return pod +} diff --git a/operator/internal/controller/podclique/components/pod/pod.go b/operator/internal/controller/podclique/components/pod/pod.go index 87cba4985..ddfba084d 100644 --- a/operator/internal/controller/podclique/components/pod/pod.go +++ b/operator/internal/controller/podclique/components/pod/pod.go @@ -175,6 +175,7 @@ func (r _resource) buildResource(pcs *grovecorev1alpha1.PodCliqueSet, pclq *grov ) } backend.PreparePod(pod) + injectInPlaceReadinessGate(pcs, pod) // Add GROVE specific Pod environment variables addEnvironmentVariables(pod, pclq, pcsName, pcsReplicaIndex) diff --git a/operator/internal/controller/podclique/components/pod/rollingupdate.go b/operator/internal/controller/podclique/components/pod/rollingupdate.go index 22e870943..ca1bbfb1e 100644 --- a/operator/internal/controller/podclique/components/pod/rollingupdate.go +++ b/operator/internal/controller/podclique/components/pod/rollingupdate.go @@ -20,6 +20,7 @@ import ( "context" "fmt" "slices" + "strconv" "github.com/ai-dynamo/grove/operator/api/common" grovecorev1alpha1 "github.com/ai-dynamo/grove/operator/api/core/v1alpha1" @@ -73,6 +74,13 @@ func (w *updateWork) getNextPodToUpdate() *corev1.Pod { // processPendingUpdates processes pending updates for the PodClique. // This is the main entry point for handling rolling updates of pods in the PodClique. func (r _resource) processPendingUpdates(logger logr.Logger, sc *syncContext) error { + if componentutils.IsInPlaceUpdateStrategy(sc.pcs) { + handled, err := r.processPendingInPlaceUpdates(logger, sc) + if handled || err != nil { + return err + } + } + updateWork := r.computeUpdateWork(logger, sc) pclq := sc.pclq // Always delete old-hash pods that are not Ready (pending, unhealthy, starting, or uncategorized). @@ -134,6 +142,126 @@ func (r _resource) processPendingUpdates(logger logr.Logger, sc *syncContext) er return r.markRollingUpdateEnd(sc.ctx, logger, pclq) } +func (r _resource) processPendingInPlaceUpdates(logger logr.Logger, sc *syncContext) (bool, error) { + for _, pod := range sc.existingPCLQPods { + if _, ok, err := getInPlaceUpdateState(pod); err != nil { + return true, err + } else if ok { + original := pod.DeepCopy() + completed, err := markInPlaceUpdateCompleteIfReady(pod) + if err != nil { + return true, err + } + if !completed { + return true, groveerr.New( + groveerr.ErrCodeContinueReconcileAndRequeue, + component.OperationSync, + fmt.Sprintf("in-place update of pod %s is not complete, requeuing", pod.Name), + ) + } + if err := r.client.Status().Patch(sc.ctx, pod, client.MergeFrom(original)); err != nil { + return true, err + } + if err := r.client.Patch(sc.ctx, pod, client.MergeFrom(original)); err != nil { + return true, err + } + logger.Info("Marked Pod in-place update complete", "pod", client.ObjectKeyFromObject(pod)) + return true, groveerr.New( + groveerr.ErrCodeContinueReconcileAndRequeue, + component.OperationSync, + fmt.Sprintf("marked in-place update complete for pod %s, requeuing", pod.Name), + ) + } + } + + updateWork := r.computeUpdateWork(logger, sc) + if hasOldNonReadyPods(updateWork) { + if componentutils.IsInPlaceOnlyStrategy(sc.pcs) { + return true, groveerr.New( + groveerr.ErrCodeContinueReconcileAndRequeue, + component.OperationSync, + "in-place only update is waiting for old non-ready pods to become eligible", + ) + } + return false, nil + } + + podNamesPendingUpdate := updateWork.getPodNamesPendingUpdate(r.expectationsStore.GetDeleteExpectations(sc.pclqExpectationsStoreKey)) + if len(podNamesPendingUpdate) == 0 { + return true, r.markRollingUpdateEnd(sc.ctx, logger, sc.pclq) + } + if sc.pclq.Status.ReadyReplicas < *sc.pclq.Spec.MinAvailable { + return true, groveerr.New( + groveerr.ErrCodeContinueReconcileAndRequeue, + component.OperationSync, + fmt.Sprintf("ready replicas %d lesser than minAvailable %d, requeuing", sc.pclq.Status.ReadyReplicas, *sc.pclq.Spec.MinAvailable), + ) + } + + nextPodToUpdate := updateWork.getNextPodToUpdate() + if nextPodToUpdate == nil { + return true, groveerr.New( + groveerr.ErrCodeContinueReconcileAndRequeue, + component.OperationSync, + "in-place update is waiting for a ready old-hash pod", + ) + } + desiredPod, err := r.buildDesiredPodForInPlaceUpdate(sc, nextPodToUpdate) + if err != nil { + return true, err + } + spec, reason := computeInPlaceUpdateSpec(nextPodToUpdate, desiredPod, sc.expectedPodTemplateHash, sc.pclq.Status.UpdateProgress.PodCliqueSetGenerationHash) + if spec == nil { + if componentutils.IsInPlaceOnlyStrategy(sc.pcs) { + return true, groveerr.New( + groveerr.ErrCodeContinueReconcileAndRequeue, + component.OperationSync, + fmt.Sprintf("in-place only update is blocked for pod %s: %s", nextPodToUpdate.Name, reason), + ) + } + logger.Info("Pod is not eligible for in-place update, falling back to rolling recreate", "pod", client.ObjectKeyFromObject(nextPodToUpdate), "reason", reason) + return false, nil + } + + originalStatus := nextPodToUpdate.DeepCopy() + setInPlaceUpdateCondition(nextPodToUpdate, corev1.ConditionFalse, "StartInPlaceUpdate") + if err := r.client.Status().Patch(sc.ctx, nextPodToUpdate, client.MergeFrom(originalStatus)); err != nil { + return true, err + } + + original := nextPodToUpdate.DeepCopy() + applyInPlaceUpdateSpec(nextPodToUpdate, spec) + if err := r.client.Patch(sc.ctx, nextPodToUpdate, client.MergeFrom(original)); err != nil { + return true, err + } + logger.Info("Started Pod in-place update", "pod", client.ObjectKeyFromObject(nextPodToUpdate), "targetPodTemplateHash", sc.expectedPodTemplateHash) + return true, groveerr.New( + groveerr.ErrCodeContinueReconcileAndRequeue, + component.OperationSync, + fmt.Sprintf("started in-place update for pod %s, requeuing", nextPodToUpdate.Name), + ) +} + +func (r _resource) buildDesiredPodForInPlaceUpdate(sc *syncContext, existingPod *corev1.Pod) (*corev1.Pod, error) { + podIndexStr := existingPod.Labels[common.LabelPodCliquePodIndex] + podIndex, err := strconv.Atoi(podIndexStr) + if err != nil { + return nil, fmt.Errorf("invalid pod index label %q on pod %s: %w", podIndexStr, existingPod.Name, err) + } + desiredPod := &corev1.Pod{} + if err := r.buildResource(sc.pcs, sc.pclq, sc.associatedPodGangName, desiredPod, podIndex); err != nil { + return nil, err + } + return desiredPod, nil +} + +func hasOldNonReadyPods(work *updateWork) bool { + return len(work.oldTemplateHashPendingPods) > 0 || + len(work.oldTemplateHashUnhealthyPods) > 0 || + len(work.oldTemplateHashStartingPods) > 0 || + len(work.oldTemplateHashUncategorizedPods) > 0 +} + // computeUpdateWork categorizes pods by template hash and state. // Old-hash pods: Pending, Unhealthy, Starting, Uncategorized, or Ready. // New-hash pods: Ready only. diff --git a/operator/internal/controller/podclique/components/pod/syncflow.go b/operator/internal/controller/podclique/components/pod/syncflow.go index 27c8be554..bd1678b0c 100644 --- a/operator/internal/controller/podclique/components/pod/syncflow.go +++ b/operator/internal/controller/podclique/components/pod/syncflow.go @@ -151,6 +151,12 @@ func (r _resource) runSyncFlow(logger logr.Logger, sc *syncContext) syncFlowResu } } + if componentutils.IsInPlaceUpdateStrategy(sc.pcs) { + if err := r.syncIdleInPlaceReadinessGates(logger, sc); err != nil { + result.recordError(err) + } + } + if componentutils.IsAutoUpdateStrategy(sc.pcs) && componentutils.IsPCLQAutoUpdateInProgress(sc.pclq) { if err := r.processPendingUpdates(logger, sc); err != nil { result.recordError(err) @@ -165,6 +171,24 @@ func (r _resource) runSyncFlow(logger logr.Logger, sc *syncContext) syncFlowResu return result } +func (r _resource) syncIdleInPlaceReadinessGates(logger logr.Logger, sc *syncContext) error { + for _, pod := range sc.existingPCLQPods { + original := pod.DeepCopy() + changed, err := markInPlaceUpdateReadyIfIdle(pod) + if err != nil { + return err + } + if !changed { + continue + } + if err := r.client.Status().Patch(sc.ctx, pod, client.MergeFrom(original)); err != nil { + return err + } + logger.Info("Marked idle Pod in-place readiness gate ready", "pod", client.ObjectKeyFromObject(pod)) + } + return nil +} + // syncExpectationsAndComputeDifference reconciles create/delete expectations with actual pod state and computes the replica difference // It takes in the existing pods and adjusts the captured create/delete expectations in the ExpectationStore. Post synchronization // it computes the difference of pods using => as-is-pods + pods-expecting-creation - desired-pods - pods-expecting-deletion