Skip to content

feat: switch core StatefulSet to in-place rolling updates#1179

Merged
keynslug merged 51 commits intomain-3.xfrom
feat/EMQX-15033/rolling-upgrade
Apr 8, 2026
Merged

feat: switch core StatefulSet to in-place rolling updates#1179
keynslug merged 51 commits intomain-3.xfrom
feat/EMQX-15033/rolling-upgrade

Conversation

@keynslug
Copy link
Copy Markdown
Contributor

@keynslug keynslug commented Apr 2, 2026

Summary

This PR reworks how Operator manages set of EMQX core nodes.

  1. There's now single managed StatefulSet for core nodes (aka "core set").
  2. Rolling updates happen in-place, without migrations across 2 or more separate StatefulSets.
  3. Core template includes default PVC, strongly recommended against running with ephemeral volumes.
  4. EMQX CRD is now on v3alpha1 version.
  5. EMQX CR Status became slimmer: less duplication, less status conditions.
  6. EMQX CR Spec is slightly more conventional (e.g. minReadySeconds, consistent naming).
  7. No extra readiness gates, on-serving gate has been retired.
  8. Replicants are still updated in the blue-green manner (will be addressed in followup PRs).

See individual commits for details.

Core set rolling update

  • Core set employs underlying StatefulSet's OnDelete strategy.
  • As replicant require same-version core to function,
    • At least 1 core needs to be updated before spinning up new replicant set.
    • At least 1 old-version core node needs to be preserved until old replicants are migrated and scaled down.
    • There's now a strict requirement to have >1 cores in core-replicant clusters.

Important notes

  1. Should be considered WIP for the most part.
  2. There's no backward compatibility measures.
  3. Rebalance CRD is temporarily disabled.

keynslug added 30 commits April 2, 2026 15:04
Replace blue-green deployment (hash-suffixed StatefulSet per template change)
with a single deterministically-named StatefulSet using OnDelete update strategy.
The operator detects outdated pods by comparing controller-revision-hash against
StatefulSet.status.updateRevision, and deletes them one at a time (highest ordinal
first) after session evacuation.

Replicant multi-ReplicaSet pattern is intentionally retained.
Remove fields that:
 * No longer relevant for single-core-set style operation.
 * Duplicate existing status fields from other resources.
This field was duplicating spec fields verbatim. This commit drops
this field, making status slimmer and enforcing single source of
truth.
This commit makes API field naming naming simpler and clearer.
This Kubernetes version is unsupported for quite some time.
This commit simplifies operational model by removing extra readiness
gates.
1. No more `on-serving` readiness gate.
   Instead, container readiness probe points to "availability check"
   endpoint of Evacuation API directly. User still can override it
   but it's strongly recommended against.
2. As a result, `oldestCoreRequester()` now effectively does not
   consider nodes that are in the process of evacuation.
3. In turn, direct `forPod(...)` API is allowed to point to
   non-ready-but-running EMQX nodes.
This commit ensures that core set rolling update can progress even
if an outdated core node is a DS replication site. Since such
rolling updates preserve persistent data, this should be safe.
This commit re-evaluates and significantly simplifies the set of
EMQX CR conditions, and makes condition descriptions more
informative.
1. Retires separate `Initialized`, `CoreNodesReady`,
   `ReplicantNodesReady`.
2. Conditions are evaluated independently, no more state machine
   transitions.
3. Reconcilers do not consult conditions anymore, prefer to use
   internal APIs instead.

Also stop ignoring errors in load state reconciler.
This commit fixes the issue with in-place rolling update when
core nodes constantly tried to leave and the rejoin the cluster.
This was needed for blue-green updates but does not make sense
for the new approach.
This makes EMQX CR status a bit more consistent and helps with
observability.
This commit corrects the "cores are available" criterion for core
pod removal safety: now all cores except for candidate are
accounted, and NumReplicas-1 is considered sufficient.
This commit allows single-node core sets to complete rolling
updates successfully, instead of them becoming stuck on node
evacuation.
This commit ensures that users are not allowed to create
non-rolling-updatable EMQX clusters.

Since replicants are still updated in a blue-green manner, during
updates involving EMQX version upgrade at least 1 older-version
core and 1 newer-version core need to be running in a cluster.
This commit adds additional safeguards around cores and replicants
rolling update, to accomodate for "replicant connects to an exact
same version core" EMQX requirement:
1. Updated replicant sets are not allowed to spin up until at least
   1 core is rolling-updated.
2. Update of 1 last core is postponed until "current" replicant set
   is fully migrated.
This commit ensures that scale-down picks candidates statically and
deterministically.
This should give relevant controllers enough time to update
respective resource statuses, to make the rest of the reconcilers
chain rely on up-to-date information.
This commit also improves `dsCleanupSites` reconciler observability.
This commit improves stability of `dsUpdateReplicaSets` reconciler.
It now avoids consulting both EMQX runtime cluster state and DS
cluster state (using instead only the latter) as the former can
sometimes fail to include nodes that are in the process of starting
or stopping, which can cause unwarranted target replica set changes.
This is an important prerequisite for in-place rolling updates:
persistent data is now expected to survive pods deletion.

Also enforce suitable PVC retention policy: PVs / PVCs should
survive regular pod deletions (as part of rolling updates) but
not parent StatefulSet scale-down or deletion.
This commit fixes the consequence of introducing volume persistence
to core set pods by default: node evacuation state survives pod
recreation, and thus needs to be stopped explicitly.
This new reconciler is responsible for force-leaving nodes out of
the EMQX cluster view, to keep it consistent with current core and
replicant sets.
keynslug added 8 commits April 7, 2026 10:39
This ensures that potentially not-entirely-complete reconciliations
have a chance to be retried earlier than in 30 seconds, in case
EMQX CR is considered Ready.
This ensures there's no special conditions for choosing between
shorter and longer requeue timeouts.
This commit ensures that listeners service targets correct set of
pod at all times: cores if no replicants, otherwise "update"
replicants if there's at least 1 ready, otherwise "current"
replicants if there's at lesst 1 ready.
This is a workaround to allow label-only core set updates to
complete.
This commit ensures that label-only changes are applicable to
managed resources: without this, changing core template labels
was either ignored or might have caused controller to stuck in
a retry loop, because selector updates for StatefulSets are in
general prohibited.
@keynslug keynslug force-pushed the feat/EMQX-15033/rolling-upgrade branch from 2685270 to 966507e Compare April 7, 2026 09:38
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 83.58663% with 162 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main-3.x@a2885da). Learn more about missing BASE report.

Files with missing lines Patch % Lines
internal/controller/sync_core_set.go 79.18% 37 Missing and 14 partials ⚠️
internal/controller/load_state.go 61.53% 19 Missing and 11 partials ⚠️
internal/controller/add_replicant_set.go 84.21% 10 Missing and 2 partials ⚠️
internal/controller/ds_update_replica_sets.go 64.00% 6 Missing and 3 partials ⚠️
internal/controller/rebalance_controller.go 0.00% 8 Missing ⚠️
internal/controller/sync_emqx_config.go 61.11% 7 Missing ⚠️
internal/controller/update_emqx_status.go 96.47% 5 Missing and 1 partial ⚠️
internal/controller/util/pod.go 75.00% 4 Missing and 2 partials ⚠️
internal/controller/add_core_set.go 95.57% 3 Missing and 2 partials ⚠️
internal/emqx/api/evacuation.go 50.00% 4 Missing and 1 partial ⚠️
... and 11 more
Additional details and impacted files
@@             Coverage Diff             @@
##             main-3.x    #1179   +/-   ##
===========================================
  Coverage            ?   74.17%           
===========================================
  Files               ?       48           
  Lines               ?     3667           
  Branches            ?        0           
===========================================
  Hits                ?     2720           
  Misses              ?      801           
  Partials            ?      146           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@keynslug keynslug force-pushed the feat/EMQX-15033/rolling-upgrade branch from 966507e to 62bd468 Compare April 7, 2026 11:27
keynslug added 6 commits April 7, 2026 13:41
This commit ensures that DS reconcilers prefer to work with the same
cluster view, preferably sourced from 6.x, primarily because EMQX
clusters running 6.1.0 and newer can have separate cluster views
different from 6.0.x and earlier.
This commit eliminates few linter complaints.
@keynslug keynslug force-pushed the feat/EMQX-15033/rolling-upgrade branch from 62bd468 to 8a234d6 Compare April 7, 2026 11:44
@keynslug keynslug marked this pull request as ready for review April 7, 2026 13:47
@zmstone
Copy link
Copy Markdown
Member

zmstone commented Apr 7, 2026

when would it be the good timing to change v3alpha1 to v3?

@keynslug
Copy link
Copy Markdown
Contributor Author

keynslug commented Apr 8, 2026

@zmstone The plan is to switch to v3beta1 once API looks reasonable, future-proof and is ready for release, and to v3 once the rest of Operator v3 features are in and it's not going to change anymore.

keynslug added 5 commits April 8, 2026 13:05
This commit fixes an uncommon issue where a core set having more
than 10 pods was rolling updated in an incorrect order.
Specifically, this commit mentions how evacuations are managed and
what should be user expectations.
@keynslug keynslug merged commit 2abd471 into main-3.x Apr 8, 2026
14 checks passed
@keynslug keynslug deleted the feat/EMQX-15033/rolling-upgrade branch April 8, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants