feat: resilient Watch RPC with reconnect and resourceVersion tracking by AdilFayyaz · Pull Request #7188 · flyteorg/flyte

AdilFayyaz · 2026-04-09T20:06:16Z

Tracking issue

Why are the changes needed?

K8s watches time out every ~5 minutes by default. The previous implementation closed the client's stream silently on disconnect with no reconnect, making Watch unreliable for long-lived connections.

What changes were proposed in this pull request?

Replaced the single-shot watch goroutine in AppK8sClient.Watch() with a reconnect loop (watchLoop + drainWatcher) that transparently reopens the K8s watch on unexpected closes or Error events
Added resourceVersion tracking — extracted from every Added/Modified/Deleted/Bookmark event and passed to the next watch call, ensuring no events are missed or replayed across reconnects
Added exponential backoff (1s → 2s → 4s → 30s max) between reconnect attempts; backoff resets on any successful event or Bookmark
K8s Error events are now logged with code/reason/message instead of being silently dropped

How was this patch tested?

go test ./app/internal/k8s/... -run TestWatch — 6 new tests covering:
channel close reconnect, Error event reconnect, Bookmark RV propagation to next watch call, exponential backoff timing, ctx cancel stops the goroutine, initial watch error surfaces synchronously
go test ./app/... — full suite passes with no regressions

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

main
- Flyte 2 WIP #6583
  - feat: add AppK8sClient for KService lifecycle management #7166
    - feat: implement InternalAppService for Knative app lifecycle #7175
      - feat: implement AppService control plane proxy with TTL cache #7176
        
        feat: resilient Watch RPC with reconnect and resourceVersion tracking 👈
        
        feat: path based ingress routing for Flyte apps via Traefik + Knative #7197

Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>

pingsutw · 2026-04-09T22:56:11Z

app/internal/k8s/app_client.go

+	if resourceVersion != "" {
+		opts = append(opts, &client.ListOptions{
+			Raw: &metav1.ListOptions{
+				ResourceVersion:     resourceVersion,
+				AllowWatchBookmarks: true,
+			},
+		})


Don't we still need to bookmark at first call?

pingsutw · 2026-04-09T23:16:16Z

app/internal/k8s/app_client.go

+		delay := state.nextBackoff()
+		logger.Warnf(ctx, "KService watch in namespace %s closed unexpectedly (attempt %d); reconnecting in %v",
+			ns, state.consecutiveErrors, delay)
+
+		select {
+		case <-ctx.Done():
+			return
+		case <-time.After(delay):


Do we need backoff for Kubernetes watch timeouts? After the app service has been running for 1 hour, it always waits for 30 seconds (max(30, 2^60/5)) every 5 minutes. I think we only need backoff for other errors

pingsutw · 2026-04-09T23:22:11Z

app/internal/k8s/app_client.go

+				return true
+			}
+
+			c.updateResourceVersion(event, state)


Should we update the resource version after we successfully send the response? so the drainWatcher will process the same item in the next loop if it failed to send the event in the previous loop.

add: watch reconnect

fec1914

Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>

AdilFayyaz self-assigned this Apr 9, 2026

AdilFayyaz added added Merged changes that add new functionality flyte2 labels Apr 9, 2026

This was referenced Apr 9, 2026

feat: add AppK8sClient for KService lifecycle management #7166

Open

feat: implement InternalAppService for Knative app lifecycle #7175

Open

feat: implement AppService control plane proxy with TTL cache #7176

Open

Flyte 2 WIP #6583

Draft

pingsutw added this to the V2 GA milestone Apr 9, 2026

AdilFayyaz requested a review from pingsutw April 9, 2026 20:10

pingsutw reviewed Apr 9, 2026

View reviewed changes

github-actions bot mentioned this pull request Apr 10, 2026

feat: path based ingress routing for Flyte apps via Traefik + Knative #7197

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: resilient Watch RPC with reconnect and resourceVersion tracking#7188

feat: resilient Watch RPC with reconnect and resourceVersion tracking#7188
AdilFayyaz wants to merge 1 commit intoadil/apps-app-servicefrom
adil/apps-watch-reconnect

AdilFayyaz commented Apr 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

pingsutw Apr 9, 2026

Uh oh!

pingsutw Apr 9, 2026 •

edited

Loading

Uh oh!

pingsutw Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdilFayyaz commented Apr 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

Uh oh!

pingsutw Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

pingsutw Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pingsutw Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AdilFayyaz commented Apr 9, 2026 •

edited by github-actions bot

Loading

pingsutw Apr 9, 2026 •

edited

Loading