fix nio read/write iovec bugs causing deadlock and infinite loop by Copilot · Pull Request #442 · acl-dev/open-coroutine

Copilot · 2026-05-01T17:29:09Z

Investigate root cause of SIGSEGV in job 74019781122 (scheduler_delay, release mode, preemptive)
Verify fix is in place (commit 74db854 by @loongs-zhang)
Confirm scheduler tests pass locally (4/4 release mode with io_uring,preemptive,ci)
Confirm socket_co test passes locally
Investigate new CI failures: aarch64 non-preemptive (job 74020614157) and macOS preemptive (job 74020614232)
- Both fail with socket_co deadlock (server stuck in recv, client never runs)
- Both fail in release mode only (debug mode passes)
- Test times out after 30s + 30s event-loop-stop = ~60s total
- Only code change between passing and failing run: impl_nio_read_iovec in commit 74db854
- impl_nio_write_iovec has a bug: uses length (post-+= iov_len) instead of stage (pre-+= iov_len) when updating offset on partial write — consistent with impl_nio_read_iovec which correctly uses stage
- Root cause of deadlock: under investigation; test passes locally on x86_64 release mode

Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/3ee10e36-26e4-49a6-88c3-bc465b166e2f Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>

…rst data) Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/3ee10e36-26e4-49a6-88c3-bc465b166e2f Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>

loongs-zhang

readv/read_vectored semantics say: return as soon as any data is available. But our implementation assumes the socket is blocking, and then uses non-blocking to achieve the same effect as blocking. Therefore, strictly speaking, there is no need to follow the readv/read_vectored semantics at all, unless the readv/read_vectored semantics are the same for both blocking and non-blocking.

codecov · 2026-05-02T01:46:00Z

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 74.53%. Comparing base (601ea78) to head (fd44e97).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
core/src/syscall/unix/mod.rs	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #442      +/-   ##
==========================================
- Coverage   75.45%   74.53%   -0.92%     
==========================================
  Files          45       45              
  Lines        4742     4744       +2     
==========================================
- Hits         3578     3536      -42     
- Misses       1164     1208      +44

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot · 2026-05-02T02:00:30Z

readv/read_vectored semantics say: return as soon as any data is available. But our implementation assumes the socket is blocking, and then uses non-blocking to achieve the same effect as blocking. Th...

Blocking and non-blocking readv share the same partial-data semantics when data is available — both return as soon as any bytes arrive. The only difference is what happens when no data is present: non-blocking returns EAGAIN, blocking blocks. Our NIO layer already handles this: on EAGAIN we call wait_read_event to suspend the coroutine until data arrives, which faithfully simulates the blocking behaviour. Once the syscall succeeds, both modes return partial data.

The test assertion assert_eq!(26, socket.read_vectored(&mut buffers)) is the definitive confirmation: the server allocates two 256-byte buffers (512 bytes total), the client sends only 26 bytes, and the assertion expects 26 back. With the original fill-all-buffers loop the server would have blocked waiting for the remaining 486 bytes while the client waited for the server's reply — the exact deadlock observed in CI. The fix (break after the first successful readv) matches both POSIX semantics and the expected test behaviour.

loongs-zhang · 2026-05-02T03:15:40Z

@copilot Locate root cause of https://github.com/acl-dev/open-coroutine/actions/runs/25242137815/job/74019781122?pr=442 and fix.

Copilot · 2026-05-02T03:45:55Z

@copilot Locate root cause of https://github.com/acl-dev/open-coroutine/actions/runs/25242137815/job/74019781122?pr=442 and fix.

Root cause of job 74019781122:

The SIGSEGV in scheduler_delay (release mode, io_uring,preemptive) was caused by a cumulative offset double-advance bug in 36e82c00's impl_nio_read_iovec.

After readv returned a partial result (e.g., 26 bytes into a 256-byte buffer), the code set offset = 26 and looped back to try to fill the remaining 230 bytes. On each subsequent inner-while iteration, it applied arg[0].iov_base += offset and arg[0].iov_len -= offset again — advancing the base pointer cumulatively past the end of the buffer. Eventually iov_len (a usize) underflowed to a huge value, causing the kernel to write far past the actual buffer boundary. In release mode, this heap corruption manifested as a SIGSEGV when the scheduler tests ran in the same test binary shortly afterwards.

Fix (already in 74db854):

Introduces a stage variable that records the cumulative length at the start of each iovec, so offset = received.saturating_sub(stage) is always bounded by the current iovec's length.
Critically, the existing else if error_kind != Interrupted early-return is preserved: after reset_errno(), last_os_error().kind() is Other (errno = 0), which is neither WouldBlock nor Interrupted, so the function returns immediately with the partial byte count. This prevents the inner loop from iterating a second time with a non-zero offset, making the cumulative-advance scenario impossible in practice.

Local verification:

cargo test --no-default-features --features io_uring,preemptive,ci --release in core/: 4/4 scheduler tests pass (including scheduler_delay)
socket_co example test: passes

loongs-zhang

@copilot https://github.com/acl-dev/open-coroutine/actions/runs/25242427251/job/74020614157?pr=442 and https://github.com/acl-dev/open-coroutine/actions/runs/25242427251/job/74020614232?pr=442 got dead, locate the root cause and fix.

Copilot · 2026-05-02T08:16:51Z

@copilot https://github.com/acl-dev/open-coroutine/actions/runs/25242427251/job/74020614157?pr=442 and https://github.com/acl-dev/open-coroutine/actions/runs/25242427251/job/74020614232?pr=442 got dea...

Both jobs deadlock with the same symptom: server coroutine stuck in recv, client coroutine never starts. Only one coroutine UUID ever appears in the logs. The test eventually times out after 30 s + 30 s event-loop-stop ≈ 60 s total.

Key observations

	aarch64 (job 74020614157)	macOS (job 74020614232)
Mode	non-preemptive, release	preemptive, release
Debug build	passes	passes
Release build	deadlocks	deadlocks

Both fail only in release mode, pointing to a compiler-optimisation-sensitive path.

Code issue found

The impl_nio_write_iovec macro (Unix) has an off-by-one in the inner-loop offset update:

// BUG (line 828) – uses cumulative `length` (after += iov_len), not the per-iovec base
offset = sent.saturating_sub(length);

// Correct (as in impl_nio_read_iovec, line 612)
offset = sent.saturating_sub(stage);   // stage = length before += iov_len

When writev returns a partial count within one iovec and errno is then set to EINTR (causing a retry), the wrong offset = 0 means the retry re-sends bytes from the start of the iovec rather than from where the partial write left off. Under aggressive release-mode inlining and different kernel scheduler behaviour (QEMU/macOS), this can corrupt the byte stream seen by the server's recv, keeping it waiting for data that the client believes it already sent.

The fix is to introduce let stage = length; before length += iovec.iov_len; in impl_nio_write_iovec (mirroring impl_nio_read_iovec) and change line 828 to offset = sent.saturating_sub(stage);.

Unfortunately I ran out of time before I could commit and validate that fix — I wanted to be confident about the change before pushing it given how subtle the timing-sensitive behaviour is.

Copilot

Pull request overview

Fixes NIO readv/iovec handling to correctly resume filling the current iovec after a prior syscall returned bytes spanning across iovec boundaries, addressing deadlock/infinite-loop behavior observed in socket_co-style tests.

Changes:

Adjust impl_nio_read_iovec (Unix + Windows) to compute per-iovec offsets from the iovec start (stage) rather than the cumulative length.
On Windows, update iovec read/write accounting to use the lpNumberOfBytes{Recv,Sent} out-parameter instead of the WSA{Recv,Send} return value.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
core/src/syscall/windows/mod.rs	Fixes read-iovec offset calculation; uses WinSock out-params for byte counts (but leaves a write-iovec offset bug and introduces raw-pointer deref risks).
core/src/syscall/unix/mod.rs	Fixes read-iovec offset calculation; write-iovec path appears to still have the analogous offset bug (per PR intent).

Comments suppressed due to low confidence (1)

core/src/syscall/windows/mod.rs:814

The offset recomputation after a partial WSASend uses sent.saturating_sub(length), but length is the cumulative length after adding the current iovec. When sent < length, this yields 0 and causes subsequent retries to resend from the start of the current iovec (data corruption and/or retry loops). Track the per-iovec start (stage = length before incrementing) and compute offset as sent.saturating_sub(stage) (matching the read-iovec fix).

                                r = 0;
                                unsafe{ $sent.write(sent.try_into().expect("overflow")) };
                                break;
                            }
                            offset = sent.saturating_sub(length);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

                        );
                        if r != windows_sys::Win32::Networking::WinSock::SOCKET_ERROR {
                            $crate::syscall::reset_errno();
-                            received += usize::try_from(r).expect("overflow");
+                            received += unsafe{ usize::try_from(*$recvd).expect("overflow") };
                            if received >= length {


                        if r != windows_sys::Win32::Networking::WinSock::SOCKET_ERROR {
                            $crate::syscall::reset_errno();
-                            sent += usize::try_from(r).expect("overflow");
+                            sent += unsafe{ usize::try_from(*$sent).expect("overflow")};
                            if sent >= length {
                                r = 0;


                for iovec in &vec {
-                    let mut offset = received.saturating_sub(length);
+                    let stage = length;
+                    let mut offset = received.saturating_sub(stage);
                    length += iovec.iov_len;
                    if received > length {


Copilot AI and others added 2 commits May 1, 2026 17:24

fix nio read/write iovec bugs causing deadlock and infinite loop

555e595

Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/3ee10e36-26e4-49a6-88c3-bc465b166e2f Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>

remove dead offset code from impl_nio_read_iovec (readv returns on fi…

b1a1896

…rst data) Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/3ee10e36-26e4-49a6-88c3-bc465b166e2f Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>

Copilot AI assigned Copilot and loongs-zhang May 1, 2026

Copilot created this pull request from a session on behalf of loongs-zhang May 1, 2026 17:29 View session

Copilot finished work on behalf of loongs-zhang May 1, 2026 17:29

Copilot AI requested a review from loongs-zhang May 1, 2026 17:29

loongs-zhang requested changes May 2, 2026

View reviewed changes

Copilot started work on behalf of loongs-zhang May 2, 2026 01:43 View session

Copilot finished work on behalf of loongs-zhang May 2, 2026 02:01

Copilot AI requested a review from loongs-zhang May 2, 2026 02:01

loongs-zhang force-pushed the copilot/fix-root-cause-of-build-failure branch from 36e82c0 to 74db854 Compare May 2, 2026 03:14

Copilot started work on behalf of loongs-zhang May 2, 2026 03:15 View session

Copilot finished work on behalf of loongs-zhang May 2, 2026 03:46

loongs-zhang requested changes May 2, 2026

View reviewed changes

Copilot started work on behalf of loongs-zhang May 2, 2026 07:24 View session

Copilot finished work on behalf of loongs-zhang May 2, 2026 08:17

Copilot AI requested a review from loongs-zhang May 2, 2026 08:17

loongs-zhang approved these changes May 2, 2026

View reviewed changes

loongs-zhang marked this pull request as ready for review May 2, 2026 08:46

loongs-zhang requested a review from Copilot May 2, 2026 08:47

Copilot started reviewing on behalf of loongs-zhang May 2, 2026 08:47 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

fix impl_nio_read_iovec/impl_nio_write_iovec bug

fd44e97

loongs-zhang force-pushed the copilot/fix-root-cause-of-build-failure branch from 74db854 to fd44e97 Compare May 2, 2026 09:08

loongs-zhang merged commit f90cb74 into master May 2, 2026
44 of 56 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix nio read/write iovec bugs causing deadlock and infinite loop#442

fix nio read/write iovec bugs causing deadlock and infinite loop#442
loongs-zhang merged 3 commits intomasterfrom
copilot/fix-root-cause-of-build-failure

Copilot AI commented May 1, 2026 •

edited

Loading

Uh oh!

loongs-zhang left a comment

Uh oh!

codecov Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Copilot AI commented May 2, 2026

Uh oh!

loongs-zhang commented May 2, 2026

Uh oh!

Copilot AI commented May 2, 2026

Uh oh!

loongs-zhang left a comment

Uh oh!

Copilot AI commented May 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loongs-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI commented May 2, 2026

Uh oh!

loongs-zhang commented May 2, 2026

Uh oh!

Copilot AI commented May 2, 2026

Uh oh!

loongs-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented May 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented May 1, 2026 •

edited

Loading

codecov Bot commented May 2, 2026 •

edited

Loading