Skip to content

Add DNS retry intervals#4081

Open
comebackto2021 wants to merge 60 commits intoSagerNet:testingfrom
comebackto2021:feat-dns-retry-intervals
Open

Add DNS retry intervals#4081
comebackto2021 wants to merge 60 commits intoSagerNet:testingfrom
comebackto2021:feat-dns-retry-intervals

Conversation

@comebackto2021
Copy link
Copy Markdown

Summary

Adds a dns.retry_intervals field — a list of per-attempt DNS query
timeouts. When set, each list entry is the deadline for one attempt; the
list length is the attempt count. Total elapsed time is bounded by
dns.timeout (#4079).

This builds on #4079 and follows the same pattern used by Linux glibc,
Unbound, BIND, dnsmasq, and the Microsoft Windows DNS client — all of which
expose per-attempt timeouts rather than a single deadline.

Motivation

On unstable mobile networks (carrier-grade NAT with aggressive timeouts,
DPI middleboxes on TCP/53), plain TCP DNS sockets become "zombies" after
long idle periods — bytes are silently dropped while the kernel still
considers the socket valid. The full timeout becomes user-visible latency
on every cold query.

A single, longer total timeout (already configurable since #4079) doesn't
help: it just makes the wait longer. The fix is to retry with a fresh
connection. Microsoft Windows DNS client uses the schedule
1s / 2s / 4s / 8s / 2s — short first attempts catch transient packet
loss; later attempts give a slow resolver more room.

Industry comparison:

System Default
Microsoft Windows DNS client 5 attempts: 1s/2s/4s/8s/2s in 10s budget
Microsoft Windows Server forwarder 3s per attempt
Linux glibc RES_TIMEOUT + attempts 5s × 2-3 attempts
Unbound / BIND / dnsmasq configurable attempts + timeout
sing-box (today) single attempt, 10s, no per-attempt control

Changes

  • option/dns.go: add RetryIntervals badoption.Listable[badoption.Duration]
    to DNSClientOptions. The Listable[T] shape lets users write either
    "retry_intervals": "1s" or "retry_intervals": ["1s", "2s"].
  • dns/client.go: add a new exchangeToTransportRetry method (used on the
    foreground exchange path when the schedule is non-empty) plus an
    isRetriable helper. The existing exchangeToTransport is left
    untouched
    backgroundRefreshDNS and the empty-schedule case still go
    through it, returning raw transport errors with no wrap (strict
    byte-identical backward compatibility).
  • dns/router.go: convert []badoption.Duration[]time.Duration,
    validate (non-positive durations rejected with a clear dns: retry_intervals[N]: must be positive error), pass through to NewClient.
  • docs/configuration/dns/index.md + index.zh.md: document the new field.

Design notes

  • Two-cap model. dns.timeout is the outer cap on total elapsed time;
    each retry_intervals entry is the per-attempt deadline. They are
    orthogonal — the outer 10s canceler in protocol/dns/handle.go:86
    already imposes such a cap upstream of the client, so this is consistent
    with existing semantics.
  • Foreground/background separation. Only the foreground path uses the
    retry loop. backgroundRefreshDNS keeps calling the legacy
    exchangeToTransport directly, so concurrent foreground retries do not
    reset transports out from under in-flight background queries.
  • transport.Reset() is best-effort — meaningful for UDP's
    ConnPoolSingle (the zombie-socket case), no-op for TCP/Local/Hosts/
    Fakeip, redundant for HTTPS/QUIC (which already self-heal internally).
    Wrapped in defer recover() for panic safety against third-party
    transports.
  • Stop conditions. Authoritative RcodeError (NXDOMAIN, SERVFAIL,
    REFUSED) ends the loop immediately — never amplify load against a server
    that already answered. Parent context cancellation surfaces directly.
    Non-retriable errors (TLS handshake, "connection refused", IPv6
    unreachable) also exit immediately. isRetriable only catches
    context.DeadlineExceeded, net.ErrClosed, syscall.ECONNRESET, and
    netErr.Timeout().

Backward compatibility

retry_intervals defaults to nil. When nil or empty, the foreground path
calls the unchanged exchangeToTransport and returns raw transport errors
without any E.Cause wrap — observable behaviour byte-identical to today.
No migration needed.

Example

{
  "dns": {
    "timeout": "10s",
    "retry_intervals": ["500ms", "1s", "2s", "4s"]
  }
}

A stuck DNS socket recovers within 500ms (first retry) instead of 10s.

Verification

  • go build ./... — passes
  • go vet ./option/... ./dns/... — clean
  • go test -count=1 ./option/... ./dns/... — all pass
  • 11 retry subtests in dns/client_retry_test.go: success-on-retry, all
    attempts timeout, RCODE early-exit, non-retriable early-exit, parent
    context cancel, empty-schedule legacy path (verifies no E.Cause wrap
    for backward compat), single-entry, outer-cap truncation, zombie-socket
    recovery, background-refresh-no-reset, concurrent foreground retries.
  • 2 validation subtests in dns/client_retry_test.go: zero and negative
    durations rejected with clear error path.
  • 6 JSON unmarshal subtests in option/dns_test.go: list, single-string
    Listable form, empty list, absent, null, invalid duration.
  • Manual sing-box check validates the config schema and rejects
    non-positive durations with a clear error.

@nekohasekai nekohasekai force-pushed the testing branch 3 times, most recently from 8f13330 to 85a08e5 Compare April 28, 2026 00:24
@nekohasekai
Copy link
Copy Markdown
Member

Considering that Linux libc, Darwin mDNSResponder, Windows Dnscache all actually retry requests, adding it repeatedly to the sing-box doesn't seem like a good idea. Although the sing-box DNS transport does lack a unified recovery mechanism, I think what we need is not short timeout and retry.

@nekohasekai nekohasekai force-pushed the testing branch 2 times, most recently from 1b0e6c5 to abedea4 Compare April 28, 2026 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants