Add DNS retry intervals#4081
Open
comebackto2021 wants to merge 60 commits intoSagerNet:testingfrom
Open
Conversation
`SecTrustEvaluateWithError` is serial
This reverts commit 62cb06c.
Serialize probe rounds in startProber to eliminate unbounded fan-out of fire-and-forget probe goroutines (up to 100/sec per direction), and close HTTP/3 transports via transport.Close() in addition to CloseIdleConnections.
8f13330 to
85a08e5
Compare
Member
|
Considering that Linux libc, Darwin mDNSResponder, Windows Dnscache all actually retry requests, adding it repeatedly to the sing-box doesn't seem like a good idea. Although the sing-box DNS transport does lack a unified recovery mechanism, I think what we need is not short timeout and retry. |
1b0e6c5 to
abedea4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
dns.retry_intervalsfield — a list of per-attempt DNS querytimeouts. When set, each list entry is the deadline for one attempt; the
list length is the attempt count. Total elapsed time is bounded by
dns.timeout(#4079).This builds on #4079 and follows the same pattern used by Linux glibc,
Unbound, BIND, dnsmasq, and the Microsoft Windows DNS client — all of which
expose per-attempt timeouts rather than a single deadline.
Motivation
On unstable mobile networks (carrier-grade NAT with aggressive timeouts,
DPI middleboxes on TCP/53), plain TCP DNS sockets become "zombies" after
long idle periods — bytes are silently dropped while the kernel still
considers the socket valid. The full timeout becomes user-visible latency
on every cold query.
A single, longer total timeout (already configurable since #4079) doesn't
help: it just makes the wait longer. The fix is to retry with a fresh
connection. Microsoft Windows DNS client uses the schedule
1s / 2s / 4s / 8s / 2s— short first attempts catch transient packetloss; later attempts give a slow resolver more room.
Industry comparison:
RES_TIMEOUT+attemptsattempts+timeoutChanges
option/dns.go: addRetryIntervals badoption.Listable[badoption.Duration]to
DNSClientOptions. TheListable[T]shape lets users write either"retry_intervals": "1s"or"retry_intervals": ["1s", "2s"].dns/client.go: add a newexchangeToTransportRetrymethod (used on theforeground exchange path when the schedule is non-empty) plus an
isRetriablehelper. The existingexchangeToTransportis leftuntouched —
backgroundRefreshDNSand the empty-schedule case still gothrough it, returning raw transport errors with no wrap (strict
byte-identical backward compatibility).
dns/router.go: convert[]badoption.Duration→[]time.Duration,validate (non-positive durations rejected with a clear
dns: retry_intervals[N]: must be positiveerror), pass through toNewClient.docs/configuration/dns/index.md+index.zh.md: document the new field.Design notes
dns.timeoutis the outer cap on total elapsed time;each
retry_intervalsentry is the per-attempt deadline. They areorthogonal — the outer 10s
cancelerinprotocol/dns/handle.go:86already imposes such a cap upstream of the client, so this is consistent
with existing semantics.
retry loop.
backgroundRefreshDNSkeeps calling the legacyexchangeToTransportdirectly, so concurrent foreground retries do notreset transports out from under in-flight background queries.
transport.Reset()is best-effort — meaningful for UDP'sConnPoolSingle(the zombie-socket case), no-op for TCP/Local/Hosts/Fakeip, redundant for HTTPS/QUIC (which already self-heal internally).
Wrapped in
defer recover()for panic safety against third-partytransports.
RcodeError(NXDOMAIN, SERVFAIL,REFUSED) ends the loop immediately — never amplify load against a server
that already answered. Parent context cancellation surfaces directly.
Non-retriable errors (TLS handshake, "connection refused", IPv6
unreachable) also exit immediately.
isRetriableonly catchescontext.DeadlineExceeded,net.ErrClosed,syscall.ECONNRESET, andnetErr.Timeout().Backward compatibility
retry_intervalsdefaults to nil. When nil or empty, the foreground pathcalls the unchanged
exchangeToTransportand returns raw transport errorswithout any
E.Causewrap — observable behaviour byte-identical to today.No migration needed.
Example
{ "dns": { "timeout": "10s", "retry_intervals": ["500ms", "1s", "2s", "4s"] } }A stuck DNS socket recovers within 500ms (first retry) instead of 10s.
Verification
go build ./...— passesgo vet ./option/... ./dns/...— cleango test -count=1 ./option/... ./dns/...— all passdns/client_retry_test.go: success-on-retry, allattempts timeout, RCODE early-exit, non-retriable early-exit, parent
context cancel, empty-schedule legacy path (verifies no
E.Causewrapfor backward compat), single-entry, outer-cap truncation, zombie-socket
recovery, background-refresh-no-reset, concurrent foreground retries.
dns/client_retry_test.go: zero and negativedurations rejected with clear error path.
option/dns_test.go: list, single-stringListable form, empty list, absent, null, invalid duration.
sing-box checkvalidates the config schema and rejectsnon-positive durations with a clear error.