feat: switch core StatefulSet to in-place rolling updates#1179
Merged
feat: switch core StatefulSet to in-place rolling updates#1179
Conversation
Replace blue-green deployment (hash-suffixed StatefulSet per template change) with a single deterministically-named StatefulSet using OnDelete update strategy. The operator detects outdated pods by comparing controller-revision-hash against StatefulSet.status.updateRevision, and deletes them one at a time (highest ordinal first) after session evacuation. Replicant multi-ReplicaSet pattern is intentionally retained.
Remove fields that: * No longer relevant for single-core-set style operation. * Duplicate existing status fields from other resources.
This field was duplicating spec fields verbatim. This commit drops this field, making status slimmer and enforcing single source of truth.
This commit makes API field naming naming simpler and clearer.
This Kubernetes version is unsupported for quite some time.
This commit simplifies operational model by removing extra readiness gates. 1. No more `on-serving` readiness gate. Instead, container readiness probe points to "availability check" endpoint of Evacuation API directly. User still can override it but it's strongly recommended against. 2. As a result, `oldestCoreRequester()` now effectively does not consider nodes that are in the process of evacuation. 3. In turn, direct `forPod(...)` API is allowed to point to non-ready-but-running EMQX nodes.
This commit ensures that core set rolling update can progress even if an outdated core node is a DS replication site. Since such rolling updates preserve persistent data, this should be safe.
This commit re-evaluates and significantly simplifies the set of EMQX CR conditions, and makes condition descriptions more informative. 1. Retires separate `Initialized`, `CoreNodesReady`, `ReplicantNodesReady`. 2. Conditions are evaluated independently, no more state machine transitions. 3. Reconcilers do not consult conditions anymore, prefer to use internal APIs instead. Also stop ignoring errors in load state reconciler.
This commit fixes the issue with in-place rolling update when core nodes constantly tried to leave and the rejoin the cluster. This was needed for blue-green updates but does not make sense for the new approach.
This makes EMQX CR status a bit more consistent and helps with observability.
This commit corrects the "cores are available" criterion for core pod removal safety: now all cores except for candidate are accounted, and NumReplicas-1 is considered sufficient.
This commit allows single-node core sets to complete rolling updates successfully, instead of them becoming stuck on node evacuation.
This commit ensures that users are not allowed to create non-rolling-updatable EMQX clusters. Since replicants are still updated in a blue-green manner, during updates involving EMQX version upgrade at least 1 older-version core and 1 newer-version core need to be running in a cluster.
This commit adds additional safeguards around cores and replicants rolling update, to accomodate for "replicant connects to an exact same version core" EMQX requirement: 1. Updated replicant sets are not allowed to spin up until at least 1 core is rolling-updated. 2. Update of 1 last core is postponed until "current" replicant set is fully migrated.
This commit ensures that scale-down picks candidates statically and deterministically.
This should give relevant controllers enough time to update respective resource statuses, to make the rest of the reconcilers chain rely on up-to-date information.
This commit also improves `dsCleanupSites` reconciler observability.
This commit improves stability of `dsUpdateReplicaSets` reconciler. It now avoids consulting both EMQX runtime cluster state and DS cluster state (using instead only the latter) as the former can sometimes fail to include nodes that are in the process of starting or stopping, which can cause unwarranted target replica set changes.
This is an important prerequisite for in-place rolling updates: persistent data is now expected to survive pods deletion. Also enforce suitable PVC retention policy: PVs / PVCs should survive regular pod deletions (as part of rolling updates) but not parent StatefulSet scale-down or deletion.
This commit fixes the consequence of introducing volume persistence to core set pods by default: node evacuation state survives pod recreation, and thus needs to be stopped explicitly.
This new reconciler is responsible for force-leaving nodes out of the EMQX cluster view, to keep it consistent with current core and replicant sets.
This ensures that potentially not-entirely-complete reconciliations have a chance to be retried earlier than in 30 seconds, in case EMQX CR is considered Ready.
This ensures there's no special conditions for choosing between shorter and longer requeue timeouts.
This commit ensures that listeners service targets correct set of pod at all times: cores if no replicants, otherwise "update" replicants if there's at least 1 ready, otherwise "current" replicants if there's at lesst 1 ready.
This is a workaround to allow label-only core set updates to complete.
This commit ensures that label-only changes are applicable to managed resources: without this, changing core template labels was either ignored or might have caused controller to stuck in a retry loop, because selector updates for StatefulSets are in general prohibited.
2685270 to
966507e
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main-3.x #1179 +/- ##
===========================================
Coverage ? 74.17%
===========================================
Files ? 48
Lines ? 3667
Branches ? 0
===========================================
Hits ? 2720
Misses ? 801
Partials ? 146 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
966507e to
62bd468
Compare
This commit ensures that DS reconcilers prefer to work with the same cluster view, preferably sourced from 6.x, primarily because EMQX clusters running 6.1.0 and newer can have separate cluster views different from 6.0.x and earlier.
This commit eliminates few linter complaints.
62bd468 to
8a234d6
Compare
Member
|
when would it be the good timing to change |
Contributor
Author
|
@zmstone The plan is to switch to |
zmstone
approved these changes
Apr 8, 2026
This commit fixes an uncommon issue where a core set having more than 10 pods was rolling updated in an incorrect order.
Specifically, this commit mentions how evacuations are managed and what should be user expectations.
zmstone
approved these changes
Apr 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR reworks how Operator manages set of EMQX core nodes.
v3alpha1version.minReadySeconds, consistent naming).on-servinggate has been retired.See individual commits for details.
Core set rolling update
OnDeletestrategy.Important notes