fix(port): plug NULL-deref in port read handlers under buf-alloc OOM by andypost · Pull Request #19 · andypost/unit

andypost · 2026-05-08T10:45:04Z

Summary

Plugs three TODO/FIXME placeholders in src/nxt_port_socket.c that left the port read paths with a latent NULL-pointer crash under sustained malloc() pressure. Identified during the P3 (Pattern D′) audit on roadmap/plan-graceful-shutdown.md but classified as IPC-resilience, not write-path D′ — split out of PR #16 into its own PR.

What changes

Site A — `nxt_port_read_handler` (line 745-750)

Master had:

for ( ;; ) {
    b = nxt_port_buf_alloc(port);

    if (nxt_slow_path(b == NULL)) {
        /* TODO: disable event for some time */
    }

    iov[1].iov_base = b->mem.pos;     /* SIGSEGV if b == NULL */
    ...
}

When nxt_port_buf_alloc()'s nxt_buf_mem_alloc() failed (port->free_bufs empty + mp_alloc returned NULL), the empty TODO body fell through and dereferenced b at line 755. Worker SIGSEGV under sustained memory pressure on the port read side.

New behaviour: log at NXT_LOG_INFO and nxt_fd_event_block_read() the port socket, then return. The read event stays in engine data structures so a subsequent nxt_fd_event_enable_read() (when buf pressure abates) can revive it without a fresh nxt_fd_event_add(). Matches the original TODO's "disable event for some time" intent without inventing a self-recovery timer in this PR.

Site B — `nxt_port_queue_read_handler` (line 889-893)

Identical NULL-deref shape after the queue dequeue sub-block. Same fix, plus the existing pattern at lines 884-887 of decrementing queue->nitems on early return — preserved here:

nxt_atomic_fetch_add(&queue->nitems, -1);
nxt_fd_event_block_read(task->thread->engine, &port->socket);
return;

The two log strings distinguish the message-mode ("read-buf alloc failed; blocking read event") from queue-mode ("... blocking queue read event") so operators can tell which path tripped.

Site C — `nxt_port_error_handler` (line 1345)

A bare /* TODO */ with no actionable fix path attached. The handler already does the correct cleanup (close fds, run completion handlers, walk the message queue). Comment removed for clarity; no behavioural change.

Out of scope

Auto-rearm of the read event when port->mem_pool buf pressure abates. Today, after buf-alloc OOM the port stays inactive until external action (timer, restart, manual reload). Wiring an enable-on-buf-free callback would be a multi-file refactor of nxt_port_buf_completion + the engine's wakeup discipline; non-trivial and orthogonal to plugging the crash. Tracked as a follow-up under the IPC-hardening tier-2 plan (roadmap/plan-ipc-hardening.md, PR docs(roadmap): plans for IPC hardening pass + malloc-injection harness #9).
Deterministic regression test for malloc-failure cases. Needs the LD_PRELOAD allocator-failure injection harness designed in roadmap/plan-malloc-injection.md (PR docs(roadmap): plans for IPC hardening pass + malloc-injection harness #9). Once that harness lands, a test that exercises nxt_port_buf_alloc() returning NULL N times in a row becomes possible. Today the path is review-verified only.

Tests

./configure --tests
make -j$(nproc)                                  # clean build, no warnings
python3 -m pytest test/test_idle_close_wait.py \
                  test/test_static.py \
                  test/test_configuration.py     # 18 pass, 33 skip
                                                 # skips: missing
                                                 # python module(s);
                                                 # unrelated.

The configuration test suite drives heavy port-IPC traffic via the control API (controller ↔ router ↔ main) on every conf write — no regression in normal-path behaviour observed.

Independence

Does not depend on PR #7 (P1 signal split), PR #11 (P2 listener drain), or PR #16 (P3 write-path D′). Branched off master; touches one file.

Generated by Claude Code

Fixes three TODO/FIXME placeholders in src/nxt_port_socket.c that left port-read paths with a latent NULL-pointer crash under sustained malloc pressure. Identified during the P3 (Pattern D′) audit on roadmap/plan-graceful-shutdown.md but classified as IPC-resilience, not write-path D′ — split into its own PR. Site A — nxt_port_read_handler (line 745-750) --------------------------------------------- The receive loop was: for ( ;; ) { b = nxt_port_buf_alloc(port); if (nxt_slow_path(b == NULL)) { /* TODO: disable event for some time */ } iov[1].iov_base = b->mem.pos; /* SIGSEGV if b == NULL */ ... } When nxt_port_buf_alloc()'s nxt_buf_mem_alloc() failed (port->free_bufs empty + mp_alloc returned NULL) the empty TODO body fell through and dereferenced b at line 755. Worker SIGSEGV under sustained memory pressure on the port read side. New behaviour: log at INFO via nxt_socket_error_level conventions and nxt_fd_event_block_read() the port socket, then return. The read event stays in engine data structures so a subsequent call to nxt_fd_event_enable_read() (when buf pressure abates) can revive it without a fresh nxt_fd_event_add(). This matches the agent-recommended "disable event for some time" intent of the original TODO without needing to invent a new self-recovery timer in this PR. Recovery wiring (auto-rearm when buf-pool drains) is intentionally deferred — see "Out of scope" below. Site B — nxt_port_queue_read_handler (line 889-893) --------------------------------------------------- Identical NULL-deref shape after the queue dequeue sub-block. Same fix, plus the existing pattern at lines 884-887 of decrementing queue->nitems on early return — preserved here: nxt_atomic_fetch_add(&queue->nitems, -1); nxt_fd_event_block_read(task->thread->engine, &port->socket); return; The two log lines distinguish the message-mode ("read-buf alloc failed; blocking read event") from queue-mode ("blocking queue read event") so operators can tell which path tripped. Site C — nxt_port_error_handler (line 1345) ------------------------------------------- A bare /* TODO */ with no actionable fix path attached. The handler itself already does the correct cleanup (close fds, run completion handlers, walk the message queue). Comment removed for clarity; no behavioural change. Scope notes ----------- This PR is the read-side IPC-resilience portion of the P3 plan. The write-path D′ contract (the named scope of P3) ships separately in PR #16 — different code path, different bug class, no overlap. Out of scope ------------ - Auto-rearm of the read event when port->mem_pool buf pressure abates. Today, after buf-alloc OOM the port stays inactive until external action (timer, restart, manual reload). Wiring an enable-on-buf-free callback would be a multi-file refactor of nxt_port_buf_completion + the engine's wakeup discipline; non-trivial and orthogonal to plugging the crash. Tracked as a follow-up under the IPC-hardening tier 2 plan (roadmap/plan-ipc-hardening.md, PR #9). - The deterministic regression test for malloc-failure cases — needs the LD_PRELOAD allocator-failure injection harness designed in roadmap/plan-malloc-injection.md (PR #9). Once that harness lands, a test that exercises nxt_port_buf_alloc() returning NULL N times in a row becomes possible; today the path is review-verified only. Tests ----- ./configure --tests make -j$(nproc) # clean build python3 -m pytest test/test_idle_close_wait.py test/test_static.py \ test/test_configuration.py # 18 pass, 33 skip # skips: missing # python module(s); # unrelated to this # change. The configuration test suite drives heavy port-IPC traffic via the control API (controller ↔ router ↔ main) on every conf write — no regression in normal-path behaviour observed. Independence ------------ Does not depend on PR #7 (P1 signal split), PR #11 (P2 listener drain), or PR #16 (P3 write-path D′). Branched off master; touches one file.

gemini-code-assist

Code Review

This pull request implements error handling for buffer allocation failures in port read handlers by logging the event and blocking further read events. Feedback highlights a critical issue in nxt_port_queue_read_handler where messages are lost if allocation fails after dequeuing, and suggests increasing the log level to reflect the severity of halting IPC.

gemini-code-assist · 2026-05-08T10:47:00Z

        if (nxt_slow_path(b == NULL)) {
-            /* TODO: disable event for some time */
+            nxt_log(task, NXT_LOG_INFO,
+                    "port{%d,%d} %d: read-buf alloc failed; "
+                    "blocking queue read event",
+                    (int) port->pid, (int) port->id, port->socket.fd);
+
+            nxt_atomic_fetch_add(&queue->nitems, -1);
+            nxt_fd_event_block_read(task->thread->engine, &port->socket);
+            return;
        }


In nxt_port_queue_read_handler, if nxt_port_buf_alloc(port) fails at line 895, any message already dequeued from the shared memory queue at line 841 (via nxt_port_queue_recv) will be lost. This is because the message data is currently held only in the local qmsg buffer, which is discarded when the function returns. To avoid losing IPC messages during memory pressure, consider allocating the buffer before consuming the message from the shared queue. Additionally, using NXT_LOG_INFO for a condition that effectively halts IPC for the process might be too low; NXT_LOG_WARN or NXT_LOG_ERR would be more appropriate to signal the severity of the resource exhaustion.

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(port): plug NULL-deref in port read handlers under buf-alloc OOM#19

fix(port): plug NULL-deref in port read handlers under buf-alloc OOM#19
andypost wants to merge 1 commit into
masterfrom
p3b-port-buf-alloc

andypost commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andypost commented May 8, 2026

Summary

What changes

Site A — nxt_port_read_handler (line 745-750)

Site B — nxt_port_queue_read_handler (line 889-893)

Site C — nxt_port_error_handler (line 1345)

Out of scope

Tests

Independence

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Site A — `nxt_port_read_handler` (line 745-750)

Site B — `nxt_port_queue_read_handler` (line 889-893)

Site C — `nxt_port_error_handler` (line 1345)