pool: release ownership during rebalance#75
Merged
Conversation
Rebalance now moves job ownership instead of only moving execution, preserving the singleton owner invariant in the replicated job map while keeping payloads available for the new worker.
Internal requeues and ownership moves now keep the durable payload if a successor worker fails to start, while stop and notify events route to the current owner instead of the current hash target.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a Pulse pool ownership bug that could let the same durable job run on more than one worker after worker membership changed, especially during rolling updates or rebalances.
The important model is:
Before this change, rebalance moved local execution but did not reliably move replicated ownership. That left stale owners in the ownership map, so later routing decisions and metrics could see multiple active owners for the same job key.
What Changed
Reviewer Guide
Start with
pool/worker.goandpool/node.go:Worker.releaseJobis the shared path for stopping local execution and removing this worker's ownership while preserving the payload.Worker.stopJobadds payload deletion on top of release, so deleting a job remains explicit.Node.workerForEventkeeps start events hash-routed, but routes stop and notify through current ownership.Node.jobPayloadExistschecks Redis directly when no active owner is visible, because the durable payload is the source of truth during handoff.Then review the tests in
pool/worker_test.go,pool/node_test.go, andpool/marshal_test.go. They cover ownership transfer, handoff failure, stale control-event delivery, and the new requeue marker in job marshaling.Lifecycle Coverage
This PR intentionally covers the main pool lifecycle transitions:
ErrJobExistswhen a durable payload or active pending guard exists.Test Plan
GOWORK=off go test ./pool -count=1GOWORK=off go test -p 1 ./... -count=1CI is green for the current head.