fix(snell): drain pending data before pool reuse#2830
Open
missuo wants to merge 1 commit into
Open
Conversation
When the local side of a relay closes first (typical for short HTTP/1 responses), the snell server may still have data queued in our TCP receive buffer — most importantly its own zero-chunk half-close. Returning the conn to the pool without draining surfaces that stale data on the next session's first read, breaking every reuse. Also cap each pooled TCP at 2 CONNECT sessions: Surge's snell-server (v5.0.1 observed) closes a reuse-mode TCP after the second session, so anything beyond the cap is a soon-to-be-dead socket. The outbound DialContext now retries a stale pool conn once so a write failure on a checked-out conn falls through to a fresh dial transparently. Repro against a real Surge snell-server v5.0.1: 6 sequential HTTP GETs through a reuse=true snell outbound. Before this change, 5/6 fail with EOF after the first session; after, 6/6 succeed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The reuse-mode snell client pool surfaces stale server data on every reuse, so against a real Surge snell-server (v5.0.1 observed) only the first session per pooled TCP succeeds and every subsequent one fails with EOF.
Three related changes:
Drain server data before pool reuse (
transport/snell/pool.go). When the local side closes first (typical for short HTTP/1 responses), the snell server may still have its reply tail + its own zero-chunk half-close queued in our TCP receive buffer. The previous `PoolConn.Close` just sent our zero chunk and put the conn back; the next session's first read then saw that leftover data. The new `drainPendingForReuse` reads frames until `shadowaead.ErrZeroChunk` is observed, with a 500 ms deadline. If drain times out or errors, the conn is closed instead of pooled.Cap each pooled TCP at 2 CONNECT sessions (`defaultMaxUsesPerConn`). Surge's snell-server closes a reuse-mode TCP after the second session, so without a cap the pool would hand the next caller a soon-to-be-dead socket. The cap is tracked per-conn via a new internal `pooledEntry` wrapper (no public API change beyond removing the never-called `Pool.Put`).
Retry once on stale pool conn (`adapter/outbound/snell.go`). Pooled conns can also die between put and re-get (e.g. server idle timeout). `DialContext` now retries the `pool.GetContext` + `writeHeaderContext` pair once — the second attempt either yields another idle conn or falls through to the pool factory and dials fresh.
Reproduction
Tested against a real Surge snell-server v5.0.1 with 6 sequential HTTP GETs through a `reuse=true` outbound (forced fresh `proxy.DialContext` per request via `DisableKeepAlives: true`):
Before this change:
```
[1] OK
[2] FAIL ... EOF
[3] FAIL ... EOF
[4] FAIL ... EOF
[5] FAIL ... EOF
[6] FAIL ... EOF
5/6 requests failed
```
After this change:
```
[1] OK dur=272ms
[2] OK dur=365ms
[3] OK dur=245ms
[4] OK dur=278ms
[5] OK dur=292ms
[6] OK dur=300ms
all 6 requests succeeded
```
The alternating timing matches the `maxUsesPerConn=2` cap: fresh dial (~290 ms) followed by pool hit (~250 ms).
Test plan