Skip to content

fix(server): make WatchList=false effective and stop pre-closed watch retry loops#330

Open
matthyx wants to merge 1 commit into
mainfrom
fix/318-effective-feature-gates-idle-watch
Open

fix(server): make WatchList=false effective and stop pre-closed watch retry loops#330
matthyx wants to merge 1 commit into
mainfrom
fix/318-effective-feature-gates-idle-watch

Conversation

@matthyx

@matthyx matthyx commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes the Rancher cattle-agent tight retry loop reported in #318, which the mitigation in #321 could not resolve (confirmed by the reporter). Two independent server defects were found and fixed:

1. The WatchList=false feature-gate override was a silent no-op

ComponentGlobalsRegistry.Set() — the only bridge propagating the parsed feature-gates flag into utilfeature.DefaultMutableFeatureGate — ran before flags.Set("feature-gates", ...) populated the flag value, and was never re-invoked. The WatchList gate therefore silently stayed at its default (enabled), so the server accepted watch?sendInitialEvents=true requests (HTTP 200) it could never complete: the file-based storage has no Cacher and never emits the terminal initial-events-end BOOKMARK. This matches the reporter's awaiting required bookmark event for initial events stream log lines.

The calls are now reordered so the override is effective at request time. The apiserver then rejects sendInitialEvents watches pre-stream (HTTP 422), and WatchList-enabled clients (client-go ≥ v0.35 ships WatchListClient Beta default-on; Rancher uses v0.35.1) fall back gracefully to legacy list+watch.

Note: ServerSideApply is GA/non-gated in Kubernetes 1.35; #321's ServerSideApply=false token was always inert and is removed to prevent a boot-time unrecognized feature gate error now that flag ordering is fixed. There is no SSA behavior change in this PR. A negative test guards against reintroducing stale gate tokens.

2. Pre-closed watch channels caused "very short watch" tight loops

watch.NewEmptyWatch() returns a pre-closed channel. A reflector reading it observes 0 events in <1s → VeryShortWatchError → immediate re-watch → tight loop, in both legacy and WatchList modes — which is why disabling WatchList alone could never fully fix #318. Two sites were affected:

  • namespaced watch keys in StorageImpl.Watch
  • immutableStorage.Watch (ConfigurationScanSummary, VulnerabilitySummary, GeneratedNetworkPolicy)

Both now return idleWatch: a zero-goroutine watch.Interface that stays open and event-free until client disconnect or Stop() (closure driven by context.AfterFunc + sync.Once). Reflectors hold a quiet open connection instead of tight-looping. The real watchDispatcher path for cluster-scoped resources is untouched.

Scope

This does not implement WatchList/sendInitialEvents semantics (#320) for the resources served by the real watch dispatcher — they still never emit initial-events-end bookmarks. Correctness relies on the now-effective WatchList=false producing the pre-stream 422 so clients never enter that path.

Testing

  • Regression test proving the gate is effective after PersistentPreRunE (fails on pre-fix code) + boot-safety + unrecognized-token guard + skip contract
  • 422 pre-stream assertion at the exact handler decision point (ValidateListOptions)
  • Idle-watch tests: stays open with zero events, closes on ctx-cancel / Stop() / double-Stop() / Stop-before-cancel; N=200 concurrent watches add ~0 goroutines (goleak-verified)
  • Full suite green (make test — target added in this PR), go vet clean, -race -count=3 clean
  • Pending before merge: kind-cluster smoke with a client-go v0.35 reflector / Rancher to confirm fallback behavior end-to-end (a test image will be posted on WatchList and sendInitialEvents support in storage #318)

Refs #318, #320

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Improved watch behavior with idle watch support to prevent tight retry loops.
    • Enhanced feature-gate configuration handling.
  • Bug Fixes

    • Watches now properly manage lifecycle and context cancellation.
  • Tests

    • Added feature-gate validation and configuration tests.
    • Added idle watch behavior and performance tests.
  • Chores

    • Updated build configuration with test runner.
    • Added testing dependency.

… retry loops

Two fixes for the Rancher tight retry loop (#318), which PR #321 could not
resolve:

1. The feature-gates override never applied: ComponentGlobalsRegistry.Set()
   ran before flags.Set("feature-gates", ...) populated the flag value and was
   never re-invoked, so the WatchList gate silently stayed at its default
   (enabled). Reorder the calls so the override is effective at request time;
   the apiserver then rejects watch?sendInitialEvents=true pre-stream (422)
   and WatchList-enabled clients (client-go >= v0.35, Rancher) fall back to
   legacy list+watch instead of hanging while awaiting an initial-events-end
   bookmark that the file-based storage never sends.

   The ServerSideApply=false token is removed: ServerSideApply is GA and
   non-gated in Kubernetes 1.35, so the token was always inert and would now
   fail gate validation at boot ("unrecognized feature gate").

2. Pre-closed watch channels (watch.NewEmptyWatch) made reflectors observe
   0 events in <1s, triggering a "very short watch" tight retry loop in both
   legacy and streaming modes. Replace them with idleWatch, a zero-goroutine
   watch.Interface that stays open and event-free until client disconnect or
   Stop, in both affected sites: namespaced watch keys in StorageImpl.Watch
   and immutableStorage.Watch (ConfigurationScanSummary, VulnerabilitySummary,
   GeneratedNetworkPolicy).

Note: this does not implement WatchList semantics (#320) for the resources
served by the real watch dispatcher; correctness relies on the pre-stream 422
fallback.

Also adds a missing "make test" target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements idle watch support to prevent "very short watch" retry loops when the WatchList feature gate is disabled, addressing rapid reconnection cycles that were destabilizing Rancher integration. Idle watches keep result channels open indefinitely until explicit cancellation or stop, replacing pre-closed empty watches that triggered immediate client retries.

Changes

WatchList Support via Idle Watch

Layer / File(s) Summary
Idle watch core implementation
pkg/registry/file/watch.go
Introduces idleWatch that implements watch.Interface with a result channel that never emits events but stays open until context cancellation or Stop() is called, using context.AfterFunc and sync.Once for lifecycle management without spawning goroutines.
Storage integration with idle watches
pkg/registry/file/storage.go
Updates StorageImpl.Watch and immutableStorage.Watch to return idle watches instead of pre-closed empty watches when rejecting namespaced watch requests, eliminating the "very short watch" retry loop.
Idle watch testing and goleak dependency
pkg/registry/file/idlewatch_test.go, go.mod
Adds comprehensive test coverage for idle watch lifecycle (context cancellation, idempotent stop, concurrent behavior) and storage integration, with goroutine leak detection via go.uber.org/goleak v1.3.0.
WatchList feature gate configuration
pkg/cmd/server/start.go
Configures server startup to disable WatchList feature gate by default: imports Kubernetes features module, updates PersistentPreRunE to set feature-gates=WatchList=false, calls ComponentGlobalsRegistry.Set() after flag population, and logs the effective gate state.
Feature gate configuration testing
pkg/cmd/server/start_test.go
Adds test helper newTestOptions() for isolated test instances and three test cases validating that PersistentPreRunE correctly handles gate state preservation, disables WatchList to reject watch requests with sendInitialEvents, and rejects unrecognized feature gate tokens.
Build target for testing
Makefile
Declares phony targets and adds test target to enable make test execution of go test ./....

Sequence Diagram

sequenceDiagram
  participant Client as Rancher Agent
  participant StorageWatch as StorageImpl.Watch
  participant IdleWatch as idleWatch
  participant Ctx as context.AfterFunc
  Client->>StorageWatch: Watch(key with namespace)
  StorageWatch->>IdleWatch: newIdleWatch(ctx)
  IdleWatch->>Ctx: Register cancellation handler
  Note over IdleWatch: Channel open, never emits
  Client->>IdleWatch: ResultChan()
  IdleWatch-->>Client: result channel (stays open)
  Note over Client,IdleWatch: No rapid close/retry cycle
  Client->>IdleWatch: context cancelled
  Ctx->>IdleWatch: close result channel
  IdleWatch-->>Client: channel closed, no events emitted
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • Implement WatchList / sendInitialEvents protocol support #320: The changes directly address the root cause of the "very short watch" error by replacing pre-closed watches with long-lived idle watches in StorageImpl.Watch and immutableStorage.Watch, eliminating the rapid retry loops described in that issue.

Possibly related PRs

  • kubescape/storage#298: Both modify pkg/cmd/server/start.go's NewCommandStartWardleServer PersistentPreRunE, changing ComponentGlobalsRegistry setup and feature-gates configuration.
  • kubescape/storage#321: Extends PersistentPreRunE logic to disable the WatchList feature gate, building on related feature-gates configuration changes.

Poem

🐰 A watch that never closes tight,
Keeps channels flowing, day and night,
No rapid loops to cause a fright,
Rancher's agents rest in peace, alright!
WatchList support shines so bright.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main fixes: making WatchList=false effective and replacing pre-closed watch retry loops with idle watches.
Linked Issues check ✅ Passed The PR fully addresses issue #318 by implementing WatchList/sendInitialEvents support with proper feature-gate handling and idle-watch behavior to prevent retry loops.
Out of Scope Changes check ✅ Passed All changes are directly scoped to address the linked issue: feature-gate ordering, idle-watch implementation, and storage layer modifications for WatchList compliance.
Docstring Coverage ✅ Passed Docstring coverage is 84.62% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/318-effective-feature-gates-idle-watch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
Makefile (1)

13-15: ⚡ Quick win

Consider aligning test flags with README for CI reliability.

The README documents running tests with additional flags: go test -v -failfast -count=1 ./.... The current target omits -count=1, which disables test caching and prevents false passes from cached results in CI. Adding -failfast also saves time by stopping on first failure.

🔧 Proposed alignment with README test command
 test:
-	go test ./...
+	go test -v -failfast -count=1 ./...

As per coding guidelines, this aligns with the documented test execution in README.md:155-159.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 13 - 15, Update the Makefile test target to run the
documented test command instead of the bare "go test ./...": change the "test"
target invocation (currently using go test ./...) to use "go test -v -failfast
-count=1 ./..." so tests run verbosely, stop on first failure, and disable
caching for CI consistency.
pkg/cmd/server/start_test.go (1)

73-79: 🏗️ Heavy lift

Avoid relying on file order for process-global feature-gate state (pkg/cmd/server/start_test.go:73-79)

This test already documents that ordering is required because utilfeature.DefaultMutableFeatureGate is process-global and the rejected ServerSideApply=false override would make later NewCommandStartWardleServer calls panic during re-validation. The repo’s CI/go test commands don’t appear to enable -shuffle, so breakage risk is lower today—but future test additions/removal could still make it brittle. Restore/save-and-reapply the feature-gate state after this test (or run it in isolation) so ordering is irrelevant.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cmd/server/start_test.go` around lines 73 - 79, The test
TestPersistentPreRunERejectsUnknownGate mutates the process-global
utilfeature.DefaultMutableFeatureGate which makes test order-dependent; before
creating/setting flags on the command (or immediately after creating cmd via
NewCommandStartWardleServer) capture the current feature-gate state from
utilfeature.DefaultMutableFeatureGate (e.g., copy its internal map or export a
snapshot), then run the existing Set/ PersistentPreRunE assertions, and finally
restore the saved state (use defer to ensure restoration) so
utilfeature.DefaultMutableFeatureGate is exactly reset after the test; reference
TestPersistentPreRunERejectsUnknownGate, NewCommandStartWardleServer and
utilfeature.DefaultMutableFeatureGate when locating where to snapshot and
restore.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@Makefile`:
- Around line 13-15: Update the Makefile test target to run the documented test
command instead of the bare "go test ./...": change the "test" target invocation
(currently using go test ./...) to use "go test -v -failfast -count=1 ./..." so
tests run verbosely, stop on first failure, and disable caching for CI
consistency.

In `@pkg/cmd/server/start_test.go`:
- Around line 73-79: The test TestPersistentPreRunERejectsUnknownGate mutates
the process-global utilfeature.DefaultMutableFeatureGate which makes test
order-dependent; before creating/setting flags on the command (or immediately
after creating cmd via NewCommandStartWardleServer) capture the current
feature-gate state from utilfeature.DefaultMutableFeatureGate (e.g., copy its
internal map or export a snapshot), then run the existing Set/ PersistentPreRunE
assertions, and finally restore the saved state (use defer to ensure
restoration) so utilfeature.DefaultMutableFeatureGate is exactly reset after the
test; reference TestPersistentPreRunERejectsUnknownGate,
NewCommandStartWardleServer and utilfeature.DefaultMutableFeatureGate when
locating where to snapshot and restore.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5302860-fd9d-4e32-9ace-0115db6fe53f

📥 Commits

Reviewing files that changed from the base of the PR and between fd58401 and 1176394.

📒 Files selected for processing (7)
  • Makefile
  • go.mod
  • pkg/cmd/server/start.go
  • pkg/cmd/server/start_test.go
  • pkg/registry/file/idlewatch_test.go
  • pkg/registry/file/storage.go
  • pkg/registry/file/watch.go

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Summary:

  • License scan: failure
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

@matthyx

matthyx commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Review — no blockers found ✅

Checked out fix/318-effective-feature-gates-idle-watch and verified both fixes locally (Go 1.25): targeted tests pass, idle-watch tests clean under -race -count=3, go vet clean.

Feature-gate ordering (start.go)

Reordering flags.Set("feature-gates", "WatchList=false") before ComponentGlobalsRegistry.Set() is the correct fix — Set() is the only bridge that copies the parsed flag into the live gates, so it has to run after the flag is populated. Dropping the inert ServerSideApply=false token (GA/non-gated since 1.35) is sound and well-guarded by the negative unrecognized feature gate test. The 422 assertion exercises the real ValidateListOptions decision point. 👍

idleWatch (watch.go, storage.go)

Implementation is correct and the concurrency is safe:

  • close() guarded by sync.Once; Stop() is idempotent and safe before/after/concurrent with ctx cancellation.
  • w.stop is written once in newIdleWatch before the pointer escapes, so no race with the AfterFunc goroutine.
  • Zero goroutines at steady state (verified by the goleak/NumGoroutine test and by -race).

Both pre-closed NewEmptyWatch() sites are converted; no others remain. No behavioral regression — namespaced and immutable watches never delivered events before either, they just no longer pre-close into a "very short watch" loop.

Non-blocking nits (already raised by CodeRabbit, optional)

  1. The test suite relies on process-global DefaultMutableFeatureGate state and source ordering (TestPersistentPreRunERejectsUnknownGate must stay last). It's documented, and CI doesn't use -shuffle, so it's fine today — but a save/restore (snapshot + defer) of the gate state would make it robust to -shuffle=on and future test additions.
  2. The new make test target could match the README's go test -v -failfast -count=1 ./... to avoid cached-result false passes in CI.

Neither blocks merge. Pending kind/Rancher smoke (already noted in the PR description) is the right last gate before release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: WIP

Development

Successfully merging this pull request may close these issues.

WatchList and sendInitialEvents support in storage

1 participant