Skip to content

Synchronize gateway liveness (#15096)#15150

Open
officialasishkumar wants to merge 2 commits intolinkerd:mainfrom
officialasishkumar:officialasishkumar/service-mirror-gatewayalive-sync
Open

Synchronize gateway liveness (#15096)#15150
officialasishkumar wants to merge 2 commits intolinkerd:mainfrom
officialasishkumar:officialasishkumar/service-mirror-gatewayalive-sync

Conversation

@officialasishkumar
Copy link
Copy Markdown

Problem

RemoteClusterServiceWatcher updates gateway liveness from a background goroutine while mirror endpoint reconciliation reads the same state without synchronization. This leaves the multicluster service mirror vulnerable to race detector failures and inconsistent readiness updates under gateway flapping.

Solution

Protect gateway liveness with a dedicated RWMutex-backed accessor, route the liveness loop through a helper that uses those accessors, and update existing tests to use the synchronized setter. Add a race-focused regression test that exercises the liveness watcher concurrently with endpoint readiness updates.

Validation

  • go test ./multicluster/service-mirror/...
  • go test -race ./multicluster/service-mirror/...
  • go test -race ./multicluster/service-mirror -run TestGatewayAliveSynchronization -count=1

Fixes #15096

Signed-off-by: Asish Kumar officialasishkumar@gmail.com

@officialasishkumar officialasishkumar requested a review from a team as a code owner April 9, 2026 17:17
@raykroeker raykroeker self-assigned this Apr 23, 2026
Copy link
Copy Markdown
Contributor

@raykroeker raykroeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of gatewayAlive synchronization was discussed as part of the initial code's introduction.

I agree with the original conversation, this is not practically a data-race even though static analysis suggests it is.

Copy link
Copy Markdown
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this field is just a single bool, did you consider using an atomic bool instead of adding a mutex?

@raykroeker
Copy link
Copy Markdown
Contributor

Since this field is just a single bool, did you consider using an atomic bool instead of adding a mutex?

I false-started an implementation that used an atomic bool before I found the attached PR
@officialasishkumar do you mind adjusting your approach?

Problem

RemoteClusterServiceWatcher updates gateway liveness from a background goroutine while mirror endpoint reconciliation reads the same state without synchronization. This leaves the multicluster service mirror vulnerable to race detector failures and inconsistent readiness updates under gateway flapping.

Solution

Protect gateway liveness with an atomic bool accessor, route the liveness loop through a helper that uses those accessors, and update existing tests to use the synchronized setter. Add a race-focused regression test that exercises the liveness watcher concurrently with endpoint readiness updates.

Validation

- go test ./multicluster/service-mirror/...
- go test -race ./multicluster/service-mirror/...
- go test -race ./multicluster/service-mirror -run TestGatewayAliveSynchronization -count=1

Fixes linkerd#15096

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
@officialasishkumar officialasishkumar force-pushed the officialasishkumar/service-mirror-gatewayalive-sync branch from c02c418 to de0bb5f Compare April 29, 2026 21:07
@officialasishkumar
Copy link
Copy Markdown
Author

Replaced the mutex-protected bool with sync/atomic.Bool in de0bb5f and kept access through the existing helpers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data race in service-mirror RemoteClusterServiceWatcher

4 participants