fix(io): generalize PR #54 write-path contract to non-TLS sites (P3)#12
fix(io): generalize PR #54 write-path contract to non-TLS sites (P3)#12andypost wants to merge 1 commit into
Conversation
…(P3) Phase 3 of roadmap/plan-graceful-shutdown.md (Pattern D' on the `roadmap` branch). Audits the five upstream TODO/FIXME sites flagged for the "silent fall-through on write paths" bug family and applies PR nginx#54's contract where it matters. Of the five, two are real NULL-deref bugs that needed code fixes; three are documentation-only dispositions. Real fixes ---------- src/nxt_port_socket.c:746 nxt_port_read_handler The empty `if (b == NULL) {}` body fell through to `iov[1].iov_base = b->mem.pos`, NULL-derefing on transient OOM in the port mem_pool. Now alerts, disarms the read event via nxt_fd_event_block_read, and `goto fail` (the existing fail: label at :802 queues nxt_port_error_handler). Strict improvement: crash -> orderly teardown. Recovery via timer-driven retry was rejected as out of P3 scope. src/nxt_port_socket.c:889 nxt_port_queue_read_handler Same NULL-deref class -- `b->mem.pos` was used in both the dequeue memcpy and the iov[1] recv branch. Mirror the fix: alert, disarm, decrement queue->nitems (balances the entry +1 at :830), and queue nxt_port_error_handler. Both sites now carry an `nxt_assert(b != NULL)` immediately past the OOM-teardown block, documenting the invariant for future audits and providing a deterministic guard against a future regression that re-introduces an empty if-body. Documentation-only dispositions ------------------------------- src/nxt_port_socket.c:1369 nxt_port_error_handler The bare `/* TODO */` historically asked for richer error context (e.g. surfacing socket.error to peers). Audit found the handler already drains queued sends, releases buffers, and decrements port refcounts -- no silent-success or NULL-deref bug hidden here. Replaced bare TODO with a disposition note; richer error responses are a future enhancement and not part of PR nginx#54's write-path contract. src/nxt_router.c:5895,5917 nxt_router_get_mmap_handler Two `// FIXME` markers on best-effort RPC_ERROR sends. The function is `void`, delivery failures surface in port->socket.error and the port-side teardown (nxt_port_error_handler) fires; peer-side RPC timeout is the ultimate backstop. Replaced FIXMEs with a one-line disposition note and added explicit `(void)` casts to match the convention at the end of the same function. Tests ----- No new tests. The NULL-deref sites cannot be reliably reproduced from Python without malloc-failure injection (LD_PRELOAD or AddressSanitizer with allocator hooks); ASan validation is recommended in CI follow-up. The nxt_assert(b != NULL) lines provide deterministic in-tree documentation of the invariant in lieu of a behavioural test. Verified -------- ./configure --tests --modules=python && ./configure python \ --config=python3-config && make -j$(nproc) # clean python3 -m pytest test/test_idle_close_wait.py # 2 pass python3 -m pytest test/test_python_application.py # 40 pass, 6 skip Refs ---- - roadmap/plan-graceful-shutdown.md P3 (Pattern D') - PR nginx#54 (TLS write-path contract; the canonical shape mirrored by the two real fixes above)
There was a problem hiding this comment.
Code Review
This pull request improves error handling in src/nxt_port_socket.c by addressing potential NULL pointer dereferences during buffer allocation failures. It also updates code comments and explicitly handles return values in src/nxt_router.c. The review feedback suggests using specific format specifiers (%PI and %uD) for PID and ID types in log messages to improve type safety and remove unnecessary casts.
| nxt_alert(task, "port{%d,%d} %d: buf alloc failed; " | ||
| "disabling read", | ||
| (int) port->pid, (int) port->id, port->socket.fd); |
There was a problem hiding this comment.
The format specifiers in the nxt_alert call should be updated for correctness and consistency:
%PIshould be used fornxt_pid_t(port->pid).%uDshould be used foruint32_t(port->id) to avoid potential overflow when casting to a signedint.
This also removes the need for casting the arguments.
nxt_alert(task, "port{%PI,%uD} %d: buf alloc failed; "
"disabling read",
port->pid, port->id, port->socket.fd);| nxt_alert(task, "port{%d,%d} %d: buf alloc failed; " | ||
| "disabling read", | ||
| (int) port->pid, (int) port->id, port->socket.fd); |
There was a problem hiding this comment.
The format specifiers in the nxt_alert call should be updated for correctness and consistency:
%PIshould be used fornxt_pid_t(port->pid).%uDshould be used foruint32_t(port->id) to avoid potential overflow when casting to a signedint.
This also removes the need for casting the arguments.
nxt_alert(task, "port{%PI,%uD} %d: buf alloc failed; "
"disabling read",
port->pid, port->id, port->socket.fd);
Summary
Phase 3 (Pattern D′) of the graceful-shutdown / graceful-reload plan documented in
roadmap/plan-graceful-shutdown.mdon theroadmapbranch.Audits the five upstream
TODO/FIXMEsites flagged for the silent fall-through on write paths bug family and applies PR nginx#54's contract where it matters. Two are real NULL-deref bugs that needed code fixes; three are documentation-only dispositions.Real fixes — NULL-deref on transient port mem_pool OOM
Both sites in
src/nxt_port_socket.chadif (b == NULL) {}with an empty body, then dereferencedb->mem.posimmediately after. Strict improvement: crash → orderly teardown.src/nxt_port_socket.c:746—nxt_port_read_handlersrc/nxt_port_socket.c:889—nxt_port_queue_read_handlerSame shape, plus
nxt_atomic_fetch_add(&queue->nitems, -1)to balance the+1at function entry (:830), then queuenxt_port_error_handlerand return.Recovery via timer-driven retry was rejected as out of P3 scope; terminal teardown is the strict improvement over the pre-P3 NULL-deref.
Documentation-only dispositions
src/nxt_port_socket.c:1369nxt_port_error_handler/* TODO */src/nxt_router.c:5895,5917nxt_router_get_mmap_handler// FIXMEmarkers on best-effortRPC_ERRORsends(void)cast matching the convention at the end of the same function. Function isvoid, delivery failures surface inport->socket.error, port-side teardown fires, peer-side RPC timeout is the backstop.Tests
No new tests. The NULL-deref sites cannot be reliably reproduced from Python without malloc-failure injection (
LD_PRELOADshim or AddressSanitizer with allocator hooks); ASan validation is recommended in CI follow-up.The two
nxt_assert(b != NULL)lines after the OOM-teardown blocks provide deterministic in-tree documentation of the invariant in lieu of a behavioural test — a future regression that re-introduces an emptyif (b == NULL) {}body wouldnxt_asserton the next iteration.Verification
Independence
This PR does not depend on PR #7 (P1) or PR #11 (P2). It branches off
masterand the touched files (nxt_port_socket.c,nxt_router.c:5895-5927) are disjoint from both. Any merge order works.Out of scope
nitemsaccounting on suspend-message error paths innxt_port_queue_read_handler(caught during this PR's audit but not introduced by it; filed under the port IPC leak discussion in PR fix(port): plug mp-pool retain and fd/buffer leaks in IPC reply paths #8).roadmap/plan-graceful-shutdown.mdPattern D — that's P1/P2/P4/P5/P6/P7.Refs
roadmap/plan-graceful-shutdown.mdP3 (Pattern D′)Generated by Claude Code