feat: resilient Watch RPC with reconnect and resourceVersion tracking#7188
Open
AdilFayyaz wants to merge 1 commit intoadil/apps-app-servicefrom
Open
feat: resilient Watch RPC with reconnect and resourceVersion tracking#7188AdilFayyaz wants to merge 1 commit intoadil/apps-app-servicefrom
AdilFayyaz wants to merge 1 commit intoadil/apps-app-servicefrom
Conversation
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
This was referenced Apr 9, 2026
Draft
pingsutw
reviewed
Apr 9, 2026
Comment on lines
+239
to
+245
| if resourceVersion != "" { | ||
| opts = append(opts, &client.ListOptions{ | ||
| Raw: &metav1.ListOptions{ | ||
| ResourceVersion: resourceVersion, | ||
| AllowWatchBookmarks: true, | ||
| }, | ||
| }) |
Member
There was a problem hiding this comment.
Don't we still need to bookmark at first call?
Comment on lines
+276
to
+283
| delay := state.nextBackoff() | ||
| logger.Warnf(ctx, "KService watch in namespace %s closed unexpectedly (attempt %d); reconnecting in %v", | ||
| ns, state.consecutiveErrors, delay) | ||
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case <-time.After(delay): |
Member
There was a problem hiding this comment.
Do we need backoff for Kubernetes watch timeouts? After the app service has been running for 1 hour, it always waits for 30 seconds (max(30, 2^60/5)) every 5 minutes. I think we only need backoff for other errors
| return true | ||
| } | ||
|
|
||
| c.updateResourceVersion(event, state) |
Member
There was a problem hiding this comment.
Should we update the resource version after we successfully send the response? so the drainWatcher will process the same item in the next loop if it failed to send the event in the previous loop.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking issue
Depends on: #7176, #7175, #7166
Why are the changes needed?
K8s watches time out every ~5 minutes by default. The previous implementation closed the client's stream silently on disconnect with no reconnect, making Watch unreliable for long-lived connections.
What changes were proposed in this pull request?
How was this patch tested?
channel close reconnect, Error event reconnect, Bookmark RV propagation to next watch call, exponential backoff timing, ctx cancel stops the goroutine, initial watch error surfaces synchronously
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link
main