Skip to content

pool: release ownership during rebalance#75

Merged
raphael merged 5 commits into
mainfrom
fix-rebalance-job-ownership
May 18, 2026
Merged

pool: release ownership during rebalance#75
raphael merged 5 commits into
mainfrom
fix-rebalance-job-ownership

Conversation

@raphael
Copy link
Copy Markdown
Member

@raphael raphael commented May 18, 2026

Summary

This PR fixes a Pulse pool ownership bug that could let the same durable job run on more than one worker after worker membership changed, especially during rolling updates or rebalances.

The important model is:

  • The job payload in Redis is the durable record that says the job exists.
  • The replicated job ownership map says which worker is currently executing that durable job.
  • During handoff, ownership may briefly move from one worker to another, but the payload must survive until a worker has either stopped the job or another worker can recover it.

Before this change, rebalance moved local execution but did not reliably move replicated ownership. That left stale owners in the ownership map, so later routing decisions and metrics could see multiple active owners for the same job key.

What Changed

  • Rebalance now releases the old worker's ownership when a job moves, while preserving the durable payload for the successor worker.
  • Internal requeues and handoffs are marked as recoverable moves. If the successor worker fails to start, Pulse removes the failed ownership claim but keeps the payload so orphan recovery can retry the job instead of losing it.
  • Stop and notify events now route to the worker that currently owns the job instead of the worker selected by the current hash ring. This keeps job control events correct when workers are added, removed, or rebalanced.
  • Control events survive handoff gaps. If a durable payload exists but no active owner is visible yet, the event is left pending for redelivery instead of being acknowledged or dropped. If the payload is gone, the event is acknowledged because the job no longer exists.
  • Workers now require local ownership before handling stop or notify events. A stale worker that receives a control event for a job it no longer owns requeues the event rather than acknowledging it or running handler side effects.

Reviewer Guide

Start with pool/worker.go and pool/node.go:

  • Worker.releaseJob is the shared path for stopping local execution and removing this worker's ownership while preserving the payload.
  • Worker.stopJob adds payload deletion on top of release, so deleting a job remains explicit.
  • Node.workerForEvent keeps start events hash-routed, but routes stop and notify through current ownership.
  • Node.jobPayloadExists checks Redis directly when no active owner is visible, because the durable payload is the source of truth during handoff.

Then review the tests in pool/worker_test.go, pool/node_test.go, and pool/marshal_test.go. They cover ownership transfer, handoff failure, stale control-event delivery, and the new requeue marker in job marshaling.

Lifecycle Coverage

This PR intentionally covers the main pool lifecycle transitions:

  • Startup and orphan recovery: durable payloads remain recoverable when ownership is missing.
  • Shutdown and explicit stop: payloads are deleted only through the stop path after local execution is released.
  • Rebalance and new workers: old ownership is removed, successor starts are recoverable, and stop/notify events are not lost during the handoff window.
  • New jobs: external dispatch still fails fast with ErrJobExists when a durable payload or active pending guard exists.
  • Worker removal: jobs are requeued as recoverable handoffs rather than treated as fresh dispatches.

Test Plan

  • GOWORK=off go test ./pool -count=1
  • GOWORK=off go test -p 1 ./... -count=1

CI is green for the current head.

raphael added 5 commits May 18, 2026 10:12
Rebalance now moves job ownership instead of only moving execution, preserving the singleton owner invariant in the replicated job map while keeping payloads available for the new worker.
Internal requeues and ownership moves now keep the durable payload if a successor worker fails to start, while stop and notify events route to the current owner instead of the current hash target.
@raphael raphael merged commit 4c3bed6 into main May 18, 2026
5 checks passed
@raphael raphael deleted the fix-rebalance-job-ownership branch May 18, 2026 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant