Skip to content

fix: rebind asyncio Semaphore and HTTP client on event-loop change#1858

Open
leofan-lab wants to merge 1 commit intoTHUDM:mainfrom
leofan-lab:fix/ray-loop-rebind
Open

fix: rebind asyncio Semaphore and HTTP client on event-loop change#1858
leofan-lab wants to merge 1 commit intoTHUDM:mainfrom
leofan-lab:fix/ray-loop-rebind

Conversation

@leofan-lab
Copy link
Copy Markdown
Contributor

Summary

Ray actors can serve calls on different asyncio event loops across re-entries. asyncio.Semaphore and httpx.AsyncClient's internal pool locks bind to the loop that created them; using them from a different loop later raises RuntimeError: ... is attached to a different event loop or silently stalls.

Make GenerateState.semaphore and _get_http_client() lazily rebuild when the running loop differs from the one the cached primitive was bound to. Loop identity is tracked via weakref.ref(loop) rather than id(loop) to avoid address recycling.

When this fires

Encountered while wiring up in-training eval for async rollouts. The first eval call after a rollout crashed at async with state.semaphore: — the rollout had bound the Semaphore to loop A, and Ray re-entered the actor on loop B for eval.

Tests

tests/test_http_utils_loop_rebind.py — tests covering rebind across asyncio.run() calls and same-loop reuse. Mutation-tested: breaking the rebind condition causes test_http_client_rebinds_across_fresh_event_loops to fail with a clear assertion message.

asyncio primitives (Semaphore, httpx.AsyncClient's internal locks / pool)
bind to the loop they were created on. Ray actors can serve calls on a
different loop across re-entries (e.g. rollout → eval), and reusing a
loop-bound primitive from a new loop raises "attached to a different
event loop" or silently stalls.

Make both GenerateState.semaphore and _get_http_client() lazily (re)build
when the current loop is not the one the cached primitive was bound to.
Loop identity is tracked via weakref so a recycled id() on a fresh loop
doesn't look like the old one. The old HTTP client is dropped rather
than aclose()'d from a foreign loop (which races on its own pool locks);
sockets close when GC collects it.

Tests: tests/test_http_utils_loop_rebind.py covers the rebind firing
across two asyncio.run() calls and the no-op within a single loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant