Skip to content

fix(port): plug NULL-deref in port read handlers under buf-alloc OOM#19

Draft
andypost wants to merge 1 commit into
masterfrom
p3b-port-buf-alloc
Draft

fix(port): plug NULL-deref in port read handlers under buf-alloc OOM#19
andypost wants to merge 1 commit into
masterfrom
p3b-port-buf-alloc

Conversation

@andypost
Copy link
Copy Markdown
Owner

@andypost andypost commented May 8, 2026

Summary

Plugs three TODO/FIXME placeholders in src/nxt_port_socket.c that left the port read paths with a latent NULL-pointer crash under sustained malloc() pressure. Identified during the P3 (Pattern D′) audit on roadmap/plan-graceful-shutdown.md but classified as IPC-resilience, not write-path D′ — split out of PR #16 into its own PR.

What changes

Site A — nxt_port_read_handler (line 745-750)

Master had:

for ( ;; ) {
    b = nxt_port_buf_alloc(port);

    if (nxt_slow_path(b == NULL)) {
        /* TODO: disable event for some time */
    }

    iov[1].iov_base = b->mem.pos;     /* SIGSEGV if b == NULL */
    ...
}

When nxt_port_buf_alloc()'s nxt_buf_mem_alloc() failed (port->free_bufs empty + mp_alloc returned NULL), the empty TODO body fell through and dereferenced b at line 755. Worker SIGSEGV under sustained memory pressure on the port read side.

New behaviour: log at NXT_LOG_INFO and nxt_fd_event_block_read() the port socket, then return. The read event stays in engine data structures so a subsequent nxt_fd_event_enable_read() (when buf pressure abates) can revive it without a fresh nxt_fd_event_add(). Matches the original TODO's "disable event for some time" intent without inventing a self-recovery timer in this PR.

Site B — nxt_port_queue_read_handler (line 889-893)

Identical NULL-deref shape after the queue dequeue sub-block. Same fix, plus the existing pattern at lines 884-887 of decrementing queue->nitems on early return — preserved here:

nxt_atomic_fetch_add(&queue->nitems, -1);
nxt_fd_event_block_read(task->thread->engine, &port->socket);
return;

The two log strings distinguish the message-mode ("read-buf alloc failed; blocking read event") from queue-mode ("... blocking queue read event") so operators can tell which path tripped.

Site C — nxt_port_error_handler (line 1345)

A bare /* TODO */ with no actionable fix path attached. The handler already does the correct cleanup (close fds, run completion handlers, walk the message queue). Comment removed for clarity; no behavioural change.

Out of scope

Tests

./configure --tests
make -j$(nproc)                                  # clean build, no warnings
python3 -m pytest test/test_idle_close_wait.py \
                  test/test_static.py \
                  test/test_configuration.py     # 18 pass, 33 skip
                                                 # skips: missing
                                                 # python module(s);
                                                 # unrelated.

The configuration test suite drives heavy port-IPC traffic via the control API (controller ↔ router ↔ main) on every conf write — no regression in normal-path behaviour observed.

Independence

Does not depend on PR #7 (P1 signal split), PR #11 (P2 listener drain), or PR #16 (P3 write-path D′). Branched off master; touches one file.


Generated by Claude Code


Generated by Claude Code

Fixes three TODO/FIXME placeholders in src/nxt_port_socket.c that left
port-read paths with a latent NULL-pointer crash under sustained
malloc pressure.  Identified during the P3 (Pattern D′) audit on
roadmap/plan-graceful-shutdown.md but classified as IPC-resilience,
not write-path D′ — split into its own PR.

Site A — nxt_port_read_handler (line 745-750)
---------------------------------------------

The receive loop was:

    for ( ;; ) {
        b = nxt_port_buf_alloc(port);

        if (nxt_slow_path(b == NULL)) {
            /* TODO: disable event for some time */
        }

        iov[1].iov_base = b->mem.pos;     /* SIGSEGV if b == NULL */
        ...
    }

When nxt_port_buf_alloc()'s nxt_buf_mem_alloc() failed (port->free_bufs
empty + mp_alloc returned NULL) the empty TODO body fell through and
dereferenced b at line 755.  Worker SIGSEGV under sustained memory
pressure on the port read side.

New behaviour: log at INFO via nxt_socket_error_level conventions and
nxt_fd_event_block_read() the port socket, then return.  The read
event stays in engine data structures so a subsequent call to
nxt_fd_event_enable_read() (when buf pressure abates) can revive it
without a fresh nxt_fd_event_add().  This matches the agent-recommended
"disable event for some time" intent of the original TODO without
needing to invent a new self-recovery timer in this PR.

Recovery wiring (auto-rearm when buf-pool drains) is intentionally
deferred — see "Out of scope" below.

Site B — nxt_port_queue_read_handler (line 889-893)
---------------------------------------------------

Identical NULL-deref shape after the queue dequeue sub-block.  Same
fix, plus the existing pattern at lines 884-887 of decrementing
queue->nitems on early return — preserved here:

    nxt_atomic_fetch_add(&queue->nitems, -1);
    nxt_fd_event_block_read(task->thread->engine, &port->socket);
    return;

The two log lines distinguish the message-mode ("read-buf alloc
failed; blocking read event") from queue-mode ("blocking queue
read event") so operators can tell which path tripped.

Site C — nxt_port_error_handler (line 1345)
-------------------------------------------

A bare /* TODO */ with no actionable fix path attached.  The handler
itself already does the correct cleanup (close fds, run completion
handlers, walk the message queue).  Comment removed for clarity; no
behavioural change.

Scope notes
-----------

This PR is the read-side IPC-resilience portion of the P3 plan.
The write-path D′ contract (the named scope of P3) ships separately
in PR #16 — different code path, different bug class, no overlap.

Out of scope
------------

- Auto-rearm of the read event when port->mem_pool buf pressure
  abates.  Today, after buf-alloc OOM the port stays inactive until
  external action (timer, restart, manual reload).  Wiring an
  enable-on-buf-free callback would be a multi-file refactor of
  nxt_port_buf_completion + the engine's wakeup discipline; non-trivial
  and orthogonal to plugging the crash.  Tracked as a follow-up under
  the IPC-hardening tier 2 plan (roadmap/plan-ipc-hardening.md, PR #9).
- The deterministic regression test for malloc-failure cases — needs
  the LD_PRELOAD allocator-failure injection harness designed in
  roadmap/plan-malloc-injection.md (PR #9).  Once that harness lands,
  a test that exercises nxt_port_buf_alloc() returning NULL N times
  in a row becomes possible; today the path is review-verified only.

Tests
-----

./configure --tests
make -j$(nproc)                                  # clean build
python3 -m pytest test/test_idle_close_wait.py test/test_static.py \
                  test/test_configuration.py     # 18 pass, 33 skip
                                                 # skips: missing
                                                 # python module(s);
                                                 # unrelated to this
                                                 # change.

The configuration test suite drives heavy port-IPC traffic via the
control API (controller ↔ router ↔ main) on every conf write — no
regression in normal-path behaviour observed.

Independence
------------

Does not depend on PR #7 (P1 signal split), PR #11 (P2 listener drain),
or PR #16 (P3 write-path D′).  Branched off master; touches one file.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements error handling for buffer allocation failures in port read handlers by logging the event and blocking further read events. Feedback highlights a critical issue in nxt_port_queue_read_handler where messages are lost if allocation fails after dequeuing, and suggests increasing the log level to reflect the severity of halting IPC.

Comment thread src/nxt_port_socket.c
Comment on lines 897 to 906
if (nxt_slow_path(b == NULL)) {
/* TODO: disable event for some time */
nxt_log(task, NXT_LOG_INFO,
"port{%d,%d} %d: read-buf alloc failed; "
"blocking queue read event",
(int) port->pid, (int) port->id, port->socket.fd);

nxt_atomic_fetch_add(&queue->nitems, -1);
nxt_fd_event_block_read(task->thread->engine, &port->socket);
return;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In nxt_port_queue_read_handler, if nxt_port_buf_alloc(port) fails at line 895, any message already dequeued from the shared memory queue at line 841 (via nxt_port_queue_recv) will be lost. This is because the message data is currently held only in the local qmsg buffer, which is discarded when the function returns. To avoid losing IPC messages during memory pressure, consider allocating the buffer before consuming the message from the shared queue. Additionally, using NXT_LOG_INFO for a condition that effectively halts IPC for the process might be too low; NXT_LOG_WARN or NXT_LOG_ERR would be more appropriate to signal the severity of the resource exhaustion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants