Skip to content

fix(lifecycle): close engine-teardown TODOs (P4 partial)#15

Open
andypost wants to merge 1 commit into
masterfrom
p4-engine-teardown
Open

fix(lifecycle): close engine-teardown TODOs (P4 partial)#15
andypost wants to merge 1 commit into
masterfrom
p4-engine-teardown

Conversation

@andypost
Copy link
Copy Markdown
Owner

@andypost andypost commented May 8, 2026

Summary

Phase 4 of the graceful-shutdown / graceful-reload plan documented in roadmap/plan-graceful-shutdown.md on the roadmap branch.

Closes two of the three P4 TODOs. The third (src/nxt_lib.c:149 "stop engines") is deferred because closing it cleanly needs an ABI signature change to the NXT_EXPORT void nxt_lib_stop(void) public API, which deserves its own focused PR with explicit external-embedder review.

What changes

src/nxt_event_engine.c:459 — "free timers"

The TODO sat after the engine's underlying-event subsystem free and asked for the timer subsystem's heap to be released.

An audit of nxt_timer_t users showed every timer is embedded inside a containing struct (nxt_runtime_t::timer, nxt_listen_event_t::idle_timer, request/connection timers in nxt_h1proto.c and nxt_router.c, etc.) — walking the rbtree and calling nxt_free() on each node would corrupt the embedding subsystem's heap.

The only allocation owned by nxt_timers_t itself is the changes batching array allocated in nxt_timers_init via nxt_malloc. That is what gets freed here.

nxt_free(engine->timers.changes);
engine->timers.changes = NULL;

Cites upstream PR nginx/unit#334 for the queue-traversal pattern that would have been used if rbtree traversal were appropriate.

src/nxt_unit.c:6014-6019 — alert level FIXME

The FIXME asked for nxt_unit_warnnxt_unit_alert "after router graceful shutdown is implemented". libunit already tracks the relevant lifecycle state in ctx_impl->online (see nxt_unit_quit at src/nxt_unit.c:5805 where online is set to 0 when the quit decision is made), so the gate is satisfiable today:

  • online (steady state OR deferred graceful drain) → alert (peer should still be reachable; sendmsg failure is a real problem).
  • !online (nxt_unit_quit has flipped the context out of service) → warn (peer going away by design; EPIPE/ECONNRESET-class errors are expected and would otherwise spam the log).

Pitfall caught during audit (commit message documents it)

A first cut of this gate also tested ctx_impl->quit_param != NXT_QUIT_GRACEFUL as a conjunct, on the assumption that quit_param tracks shutdown state. It does not — the field is initialised to NXT_QUIT_GRACEFUL in nxt_unit_ctx_init (src/nxt_unit.c:697) as the default "intended quit semantics" and is re-asserted to GRACEFUL inside nxt_unit_quit's NORMAL branch for broadcast purposes, so it is GRACEFUL at steady state too.

Including it in the gate would have routed every steady-state sendmsg failure to warn instead of alert — exactly the opposite of the FIXME's intent. The unambiguous flag is ctx_impl->online. The inline comment in this PR documents the pitfall so a future audit does not re-introduce the same bug.

Verification

./configure --tests --modules=python && ./configure python --config=python3-config && make -j$(nproc)
                                                                               # clean
python3 -m pytest test/test_python_application.py            # 40 pass, 6 skip

Independence

This PR does not depend on PR #7, #11, or #12. It branches off master and the touched files (nxt_event_engine.c, nxt_unit.c) are disjoint from the other lifecycle PRs. Any merge order works — the nxt_unit.c online-only gate works regardless of whether P1's quit_param plumbing is in place, since online is already maintained on plain master.

Out of scope

  • src/nxt_lib.c:149 "stop engines" — deferred. Function is currently NXT_EXPORT and effectively dead inside the daemon (grep finds no in-tree caller), but its signature is part of the public libunit ABI. A clean implementation needs to walk rt->engines, which means either taking rt as a parameter (signature break) or stashing rt in a global (worse). Either choice deserves its own PR with explicit ABI discussion.
  • Anything else in roadmap/plan-graceful-shutdown.md — that's P1/P2/P3/P5/P6/P7.

Refs


Generated by Claude Code

Closes two of the three P4 TODOs from
roadmap/plan-graceful-shutdown.md.  The third (nxt_lib.c:149
"stop engines") is deferred because closing it cleanly requires
an ABI signature change to NXT_EXPORT void nxt_lib_stop(void);
that is out of scope for an incremental lifecycle PR -- it will
land separately with explicit ABI / external-embedder review.

src/nxt_event_engine.c:459 -- "free timers"
-------------------------------------------
The TODO sat after the engine's underlying-event subsystem free
and asked for the timer subsystem's heap to be released.  An
audit of nxt_timer_t users showed every timer is embedded inside
a containing struct (nxt_runtime_t::timer, nxt_listen_event_t::
idle_timer, request and connection timers in nxt_h1proto.c and
nxt_router.c, etc.) -- walking the rbtree and calling nxt_free
on each node would corrupt the embedding subsystem's heap.

The only allocation owned by nxt_timers_t itself is the
`changes` batching array allocated in nxt_timers_init via
nxt_malloc.  That is what gets freed here.  Comment cites
upstream PR nginx#334 for the queue-traversal pattern we
would have used if rbtree traversal were appropriate.

src/nxt_unit.c:6014-6019 -- alert level FIXME
---------------------------------------------
The FIXME asked for nxt_unit_warn -> nxt_unit_alert "after
router graceful shutdown is implemented".  libunit already
tracks the relevant lifecycle state in ctx_impl->online (see
nxt_unit_quit at src/nxt_unit.c:5805 where online is set to 0
when the quit decision is made), so the gate is satisfiable
today:

  * online (steady state OR deferred graceful drain) -> alert
    (peer should still be reachable; sendmsg failure is a real
    problem).
  * !online (nxt_unit_quit has flipped the context out of
    service) -> warn (peer going away by design; EPIPE/
    ECONNRESET-class errors are expected and would otherwise
    spam the log).

A first cut of this gate also tested ctx_impl->quit_param !=
NXT_QUIT_GRACEFUL as a conjunct, on the assumption that
quit_param tracks shutdown state.  It does not -- the field is
initialised to NXT_QUIT_GRACEFUL in nxt_unit_ctx_init
(src/nxt_unit.c:697) as the default "intended quit semantics"
and is re-asserted to GRACEFUL inside nxt_unit_quit's NORMAL
branch for broadcast purposes, so it is GRACEFUL at steady
state too.  Including it in the gate would have routed every
steady-state sendmsg failure to warn instead of alert, exactly
the opposite of the FIXME's intent.  The unambiguous flag is
ctx_impl->online; that is what the merged version uses.  The
inline comment documents the pitfall so a future audit does not
re-introduce the same bug.

Verified
--------
  ./configure --tests --modules=python && ./configure python \
    --config=python3-config && make -j$(nproc)              # clean
  python3 -m pytest test/test_python_application.py
    => 40 passed, 6 skipped

Deferred to a follow-up
-----------------------
src/nxt_lib.c:149 -- "stop engines".  The function nxt_lib_stop
is currently NXT_EXPORT and effectively dead inside the daemon
(grep finds no in-tree caller), but its signature is part of
the public libunit ABI.  A clean implementation needs to walk
rt->engines, which means either taking rt as a parameter
(signature break) or stashing rt in a global (worse).  Either
choice deserves its own PR with explicit ABI discussion.

Refs
----
- roadmap/plan-graceful-shutdown.md P4
- upstream nginx/unit PR nginx#334 (queue-traversal pattern)
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements memory cleanup for the timer subsystem's batching array in nxt_event_engine_free and refines error logging in nxt_unit_sendmsg. The logging logic now distinguishes between steady-state operations, where sendmsg failures are treated as alerts, and shutdown phases, where they are downgraded to warnings to avoid log spam. I have no feedback to provide as no review comments were submitted for assessment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants