Skip to content

[RSDK-13303] Fix flaky TestRawClientOperation by draining stream before reading headers#6087

Open
viam-ci-org-reader[bot] wants to merge 1 commit into
mainfrom
claude/RSDK-13303
Open

[RSDK-13303] Fix flaky TestRawClientOperation by draining stream before reading headers#6087
viam-ci-org-reader[bot] wants to merge 1 commit into
mainfrom
claude/RSDK-13303

Conversation

@viam-ci-org-reader

Copy link
Copy Markdown

Summary

Fixes a flaky test failure in TestRawClientOperation where echoStreamClient.Header() returns context canceled.

  • Over WebRTC, the server-side EchoMultiple stream completes nearly instantly (sends one response and returns), which can cancel the stream context before the client calls Header() to retrieve response metadata
  • The fix drains the stream by calling Recv() before Header(), ensuring headers are received and cached before the stream context is canceled
  • The log line wrtc_server_stream.go:269: message received after EOS in the failure logs confirms the race between stream completion and header retrieval

Test plan

  • go test -v -run TestRawClientOperation -count=5 ./robot/web/ passes consistently
  • go vet ./robot/web/ passes

DRI

@ale7714 is the responsible engineer for this PR.

🤖 Generated with Claude Code

…re reading headers

Drain the EchoMultiple stream before calling Header(). Over WebRTC, the
server-side stream can complete and cancel the context before Header()
retrieves the response metadata, causing a "context canceled" error.
Receiving all messages first ensures headers have arrived and are cached.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot requested a review from aldenh-viam June 5, 2026 03:41
@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Jun 5, 2026

@aldenh-viam aldenh-viam left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this looked a little strange and asked Claude to dig deeper: this fix would likely work, bit might actually be covering up a race the WebRTC code (and behavioral difference with HTTP/2):

The WebRTC Header() is a different shape from HTTP/2 and has a real race here:

  // wrtc_client_stream.go:127
  func (s *webrtcClientStream) Header() (metadata.MD, error) {
      select {
      case <-s.headersReceived:
          return s.headers, nil
      default:
      }

      select {
      case <-s.ctx.Done():
          return nil, s.ctx.Err()
      case <-s.headersReceived:
          return s.headers, nil
      }
  }

Two important differences from the HTTP/2 path:

  1. It returns the error directly — no swallow-and-make-you-call-Recv. s.ctx.Err() lands in the test's err and would fail test.That(t, err, test.ShouldBeNil).
  2. The stream's ctx is cancelled by trailers, not just by external cancellation. In wrtc_base_stream.go:160, closeWithError (called from processTrailers → closeFromTrailers at wrtc_client_stream.go:378) does s.cancel(). So
    the moment the server's trailers arrive at the client, s.ctx.Done() fires — even on a successful RPC.

The key difference

Both transports cancel the per-stream context when the stream completes (HTTP/2 does it from trailer/end-of-stream handling, WebRTC does it from closeFromTrailers → closeWithError → s.cancel() at wrtc_base_stream.go:170).
So both races are present in principle — both can be holding "headers were received cleanly" AND "stream's ctx is cancelled" at the same time.

What protects HTTP/2 is a guard after the wait, not the wait itself:

  // internal/transport/client_stream.go:125
  func (s *ClientStream) Header() (metadata.MD, error) {
      s.waitOnHeader()                          // races ctx.Done vs headerChan
      if !s.headerValid || s.noHeaders {        // <-- re-check: were headers actually received?
          return nil, s.status.Err()
      }
      return s.header.Copy(), nil
  }

So even if the select inside waitOnHeader woke on ctx.Done(), the post-wait headerValid check returns the headers anyway when they really did arrive. WebRTC's Header() skips that step and trusts whichever case won the
select.

(Separately, the outer (*clientStream).Header() in HTTP/2 swallows errors and returns (nil, nil) per gRPC-Go convention, but that's an orthogonal design choice — not what makes the race benign.)

@ale7714

ale7714 commented Jun 5, 2026

Copy link
Copy Markdown
Member

@aldenh-viam in this case, it looks like this is still a reasonable fix for the flakyness. or you think we should take a different approach?

@aldenh-viam

Copy link
Copy Markdown
Contributor

@aldenh-viam in this case, it looks like this is still a reasonable fix for the flakyness. or you think we should take a different approach?

As a workaround quick fix, yes, but I think it would be better to confirm whether the Headers() implementation in goutils is indeed unintentionally non-compliant and if so, fix it there.

@ale7714

ale7714 commented Jun 5, 2026

Copy link
Copy Markdown
Member

i think i'm gonna try to get goutils work with this flow too. as there are a number of flaky test that seem to be related to go utils instead

ale7714 added a commit that referenced this pull request Jun 6, 2026
The check-approvals job posts a Review Required comment on the PR when approvals are unmet. That comment goes through the issues-comments endpoint, but for a pull request resource the GITHUB_TOKEN needs pull-requests: write -- issues: write does not cover PRs. The job only granted pull-requests: read + issues: write, so the POST returned 403 'Resource not accessible by integration' and crashed the check on every Claude PR at open time (e.g. #6087).

Verified on the same PR: assign-netcode-reviewer (pull-requests: write) edited the PR fine, while check-approvals (issues: write) 403'd -- confirming the scope, not the token/actor, is the issue. Brings this job in line with the other claude/* workflows (claude-pr-assistant, claude-ci-fix, claude-jira), which all already grant pull-requests: write.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test This pull request is marked safe to test from a trusted zone

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants