fix(server): make WatchList=false effective and stop pre-closed watch retry loops#330
fix(server): make WatchList=false effective and stop pre-closed watch retry loops#330matthyx wants to merge 1 commit into
Conversation
… retry loops Two fixes for the Rancher tight retry loop (#318), which PR #321 could not resolve: 1. The feature-gates override never applied: ComponentGlobalsRegistry.Set() ran before flags.Set("feature-gates", ...) populated the flag value and was never re-invoked, so the WatchList gate silently stayed at its default (enabled). Reorder the calls so the override is effective at request time; the apiserver then rejects watch?sendInitialEvents=true pre-stream (422) and WatchList-enabled clients (client-go >= v0.35, Rancher) fall back to legacy list+watch instead of hanging while awaiting an initial-events-end bookmark that the file-based storage never sends. The ServerSideApply=false token is removed: ServerSideApply is GA and non-gated in Kubernetes 1.35, so the token was always inert and would now fail gate validation at boot ("unrecognized feature gate"). 2. Pre-closed watch channels (watch.NewEmptyWatch) made reflectors observe 0 events in <1s, triggering a "very short watch" tight retry loop in both legacy and streaming modes. Replace them with idleWatch, a zero-goroutine watch.Interface that stays open and event-free until client disconnect or Stop, in both affected sites: namespaced watch keys in StorageImpl.Watch and immutableStorage.Watch (ConfigurationScanSummary, VulnerabilitySummary, GeneratedNetworkPolicy). Note: this does not implement WatchList semantics (#320) for the resources served by the real watch dispatcher; correctness relies on the pre-stream 422 fallback. Also adds a missing "make test" target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
📝 WalkthroughWalkthroughThis PR implements idle watch support to prevent "very short watch" retry loops when the WatchList feature gate is disabled, addressing rapid reconnection cycles that were destabilizing Rancher integration. Idle watches keep result channels open indefinitely until explicit cancellation or stop, replacing pre-closed empty watches that triggered immediate client retries. ChangesWatchList Support via Idle Watch
Sequence DiagramsequenceDiagram
participant Client as Rancher Agent
participant StorageWatch as StorageImpl.Watch
participant IdleWatch as idleWatch
participant Ctx as context.AfterFunc
Client->>StorageWatch: Watch(key with namespace)
StorageWatch->>IdleWatch: newIdleWatch(ctx)
IdleWatch->>Ctx: Register cancellation handler
Note over IdleWatch: Channel open, never emits
Client->>IdleWatch: ResultChan()
IdleWatch-->>Client: result channel (stays open)
Note over Client,IdleWatch: No rapid close/retry cycle
Client->>IdleWatch: context cancelled
Ctx->>IdleWatch: close result channel
IdleWatch-->>Client: channel closed, no events emitted
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
Makefile (1)
13-15: ⚡ Quick winConsider aligning test flags with README for CI reliability.
The README documents running tests with additional flags:
go test -v -failfast -count=1 ./.... The current target omits-count=1, which disables test caching and prevents false passes from cached results in CI. Adding-failfastalso saves time by stopping on first failure.🔧 Proposed alignment with README test command
test: - go test ./... + go test -v -failfast -count=1 ./...As per coding guidelines, this aligns with the documented test execution in README.md:155-159.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Makefile` around lines 13 - 15, Update the Makefile test target to run the documented test command instead of the bare "go test ./...": change the "test" target invocation (currently using go test ./...) to use "go test -v -failfast -count=1 ./..." so tests run verbosely, stop on first failure, and disable caching for CI consistency.pkg/cmd/server/start_test.go (1)
73-79: 🏗️ Heavy liftAvoid relying on file order for process-global feature-gate state (pkg/cmd/server/start_test.go:73-79)
This test already documents that ordering is required because
utilfeature.DefaultMutableFeatureGateis process-global and the rejectedServerSideApply=falseoverride would make laterNewCommandStartWardleServercalls panic during re-validation. The repo’s CI/go test commands don’t appear to enable-shuffle, so breakage risk is lower today—but future test additions/removal could still make it brittle. Restore/save-and-reapply the feature-gate state after this test (or run it in isolation) so ordering is irrelevant.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cmd/server/start_test.go` around lines 73 - 79, The test TestPersistentPreRunERejectsUnknownGate mutates the process-global utilfeature.DefaultMutableFeatureGate which makes test order-dependent; before creating/setting flags on the command (or immediately after creating cmd via NewCommandStartWardleServer) capture the current feature-gate state from utilfeature.DefaultMutableFeatureGate (e.g., copy its internal map or export a snapshot), then run the existing Set/ PersistentPreRunE assertions, and finally restore the saved state (use defer to ensure restoration) so utilfeature.DefaultMutableFeatureGate is exactly reset after the test; reference TestPersistentPreRunERejectsUnknownGate, NewCommandStartWardleServer and utilfeature.DefaultMutableFeatureGate when locating where to snapshot and restore.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@Makefile`:
- Around line 13-15: Update the Makefile test target to run the documented test
command instead of the bare "go test ./...": change the "test" target invocation
(currently using go test ./...) to use "go test -v -failfast -count=1 ./..." so
tests run verbosely, stop on first failure, and disable caching for CI
consistency.
In `@pkg/cmd/server/start_test.go`:
- Around line 73-79: The test TestPersistentPreRunERejectsUnknownGate mutates
the process-global utilfeature.DefaultMutableFeatureGate which makes test
order-dependent; before creating/setting flags on the command (or immediately
after creating cmd via NewCommandStartWardleServer) capture the current
feature-gate state from utilfeature.DefaultMutableFeatureGate (e.g., copy its
internal map or export a snapshot), then run the existing Set/ PersistentPreRunE
assertions, and finally restore the saved state (use defer to ensure
restoration) so utilfeature.DefaultMutableFeatureGate is exactly reset after the
test; reference TestPersistentPreRunERejectsUnknownGate,
NewCommandStartWardleServer and utilfeature.DefaultMutableFeatureGate when
locating where to snapshot and restore.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f5302860-fd9d-4e32-9ace-0115db6fe53f
📒 Files selected for processing (7)
Makefilego.modpkg/cmd/server/start.gopkg/cmd/server/start_test.gopkg/registry/file/idlewatch_test.gopkg/registry/file/storage.gopkg/registry/file/watch.go
|
Summary:
|
Review — no blockers found ✅Checked out Feature-gate ordering (
|
Summary
Fixes the Rancher cattle-agent tight retry loop reported in #318, which the mitigation in #321 could not resolve (confirmed by the reporter). Two independent server defects were found and fixed:
1. The
WatchList=falsefeature-gate override was a silent no-opComponentGlobalsRegistry.Set()— the only bridge propagating the parsedfeature-gatesflag intoutilfeature.DefaultMutableFeatureGate— ran beforeflags.Set("feature-gates", ...)populated the flag value, and was never re-invoked. The WatchList gate therefore silently stayed at its default (enabled), so the server acceptedwatch?sendInitialEvents=truerequests (HTTP 200) it could never complete: the file-based storage has no Cacher and never emits the terminalinitial-events-endBOOKMARK. This matches the reporter'sawaiting required bookmark event for initial events streamlog lines.The calls are now reordered so the override is effective at request time. The apiserver then rejects
sendInitialEventswatches pre-stream (HTTP 422), and WatchList-enabled clients (client-go ≥ v0.35 shipsWatchListClientBeta default-on; Rancher uses v0.35.1) fall back gracefully to legacy list+watch.2. Pre-closed watch channels caused "very short watch" tight loops
watch.NewEmptyWatch()returns a pre-closed channel. A reflector reading it observes 0 events in <1s →VeryShortWatchError→ immediate re-watch → tight loop, in both legacy and WatchList modes — which is why disabling WatchList alone could never fully fix #318. Two sites were affected:StorageImpl.WatchimmutableStorage.Watch(ConfigurationScanSummary, VulnerabilitySummary, GeneratedNetworkPolicy)Both now return
idleWatch: a zero-goroutinewatch.Interfacethat stays open and event-free until client disconnect orStop()(closure driven bycontext.AfterFunc+sync.Once). Reflectors hold a quiet open connection instead of tight-looping. The realwatchDispatcherpath for cluster-scoped resources is untouched.Scope
This does not implement WatchList/
sendInitialEventssemantics (#320) for the resources served by the real watch dispatcher — they still never emitinitial-events-endbookmarks. Correctness relies on the now-effectiveWatchList=falseproducing the pre-stream 422 so clients never enter that path.Testing
PersistentPreRunE(fails on pre-fix code) + boot-safety + unrecognized-token guard +skipcontractValidateListOptions)Stop()/ double-Stop()/ Stop-before-cancel; N=200 concurrent watches add ~0 goroutines (goleak-verified)make test— target added in this PR),go vetclean,-race -count=3cleanWatchListandsendInitialEventssupport in storage #318)Refs #318, #320
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
Chores