fix(lifecycle): close engine-teardown TODOs (P4 partial) by andypost · Pull Request #15 · andypost/unit

andypost · 2026-05-08T10:30:51Z

Summary

Phase 4 of the graceful-shutdown / graceful-reload plan documented in roadmap/plan-graceful-shutdown.md on the roadmap branch.

Closes two of the three P4 TODOs. The third (src/nxt_lib.c:149 "stop engines") is deferred because closing it cleanly needs an ABI signature change to the NXT_EXPORT void nxt_lib_stop(void) public API, which deserves its own focused PR with explicit external-embedder review.

What changes

`src/nxt_event_engine.c:459` — "free timers"

The TODO sat after the engine's underlying-event subsystem free and asked for the timer subsystem's heap to be released.

An audit of nxt_timer_t users showed every timer is embedded inside a containing struct (nxt_runtime_t::timer, nxt_listen_event_t::idle_timer, request/connection timers in nxt_h1proto.c and nxt_router.c, etc.) — walking the rbtree and calling nxt_free() on each node would corrupt the embedding subsystem's heap.

The only allocation owned by nxt_timers_t itself is the changes batching array allocated in nxt_timers_init via nxt_malloc. That is what gets freed here.

nxt_free(engine->timers.changes);
engine->timers.changes = NULL;

Cites upstream PR nginx/unit#334 for the queue-traversal pattern that would have been used if rbtree traversal were appropriate.

`src/nxt_unit.c:6014-6019` — alert level FIXME

The FIXME asked for nxt_unit_warn → nxt_unit_alert "after router graceful shutdown is implemented". libunit already tracks the relevant lifecycle state in ctx_impl->online (see nxt_unit_quit at src/nxt_unit.c:5805 where online is set to 0 when the quit decision is made), so the gate is satisfiable today:

online (steady state OR deferred graceful drain) → alert (peer should still be reachable; sendmsg failure is a real problem).
!online (nxt_unit_quit has flipped the context out of service) → warn (peer going away by design; EPIPE/ECONNRESET-class errors are expected and would otherwise spam the log).

Pitfall caught during audit (commit message documents it)

A first cut of this gate also tested ctx_impl->quit_param != NXT_QUIT_GRACEFUL as a conjunct, on the assumption that quit_param tracks shutdown state. It does not — the field is initialised to NXT_QUIT_GRACEFUL in nxt_unit_ctx_init (src/nxt_unit.c:697) as the default "intended quit semantics" and is re-asserted to GRACEFUL inside nxt_unit_quit's NORMAL branch for broadcast purposes, so it is GRACEFUL at steady state too.

Including it in the gate would have routed every steady-state sendmsg failure to warn instead of alert — exactly the opposite of the FIXME's intent. The unambiguous flag is ctx_impl->online. The inline comment in this PR documents the pitfall so a future audit does not re-introduce the same bug.

Verification

./configure --tests --modules=python && ./configure python --config=python3-config && make -j$(nproc)
                                                                               # clean
python3 -m pytest test/test_python_application.py            # 40 pass, 6 skip

Independence

This PR does not depend on PR #7, #11, or #12. It branches off master and the touched files (nxt_event_engine.c, nxt_unit.c) are disjoint from the other lifecycle PRs. Any merge order works — the nxt_unit.c online-only gate works regardless of whether P1's quit_param plumbing is in place, since online is already maintained on plain master.

Out of scope

src/nxt_lib.c:149 "stop engines" — deferred. Function is currently NXT_EXPORT and effectively dead inside the daemon (grep finds no in-tree caller), but its signature is part of the public libunit ABI. A clean implementation needs to walk rt->engines, which means either taking rt as a parameter (signature break) or stashing rt in a global (worse). Either choice deserves its own PR with explicit ABI discussion.
Anything else in roadmap/plan-graceful-shutdown.md — that's P1/P2/P3/P5/P6/P7.

Refs

roadmap/plan-graceful-shutdown.md P4
Upstream nginx/unit PR Fixed nxt_runtime_close_idle_connections(). nginx/unit#334 (queue-traversal pattern referenced in the comment)

Generated by Claude Code

Closes two of the three P4 TODOs from roadmap/plan-graceful-shutdown.md. The third (nxt_lib.c:149 "stop engines") is deferred because closing it cleanly requires an ABI signature change to NXT_EXPORT void nxt_lib_stop(void); that is out of scope for an incremental lifecycle PR -- it will land separately with explicit ABI / external-embedder review. src/nxt_event_engine.c:459 -- "free timers" ------------------------------------------- The TODO sat after the engine's underlying-event subsystem free and asked for the timer subsystem's heap to be released. An audit of nxt_timer_t users showed every timer is embedded inside a containing struct (nxt_runtime_t::timer, nxt_listen_event_t:: idle_timer, request and connection timers in nxt_h1proto.c and nxt_router.c, etc.) -- walking the rbtree and calling nxt_free on each node would corrupt the embedding subsystem's heap. The only allocation owned by nxt_timers_t itself is the `changes` batching array allocated in nxt_timers_init via nxt_malloc. That is what gets freed here. Comment cites upstream PR nginx#334 for the queue-traversal pattern we would have used if rbtree traversal were appropriate. src/nxt_unit.c:6014-6019 -- alert level FIXME --------------------------------------------- The FIXME asked for nxt_unit_warn -> nxt_unit_alert "after router graceful shutdown is implemented". libunit already tracks the relevant lifecycle state in ctx_impl->online (see nxt_unit_quit at src/nxt_unit.c:5805 where online is set to 0 when the quit decision is made), so the gate is satisfiable today: * online (steady state OR deferred graceful drain) -> alert (peer should still be reachable; sendmsg failure is a real problem). * !online (nxt_unit_quit has flipped the context out of service) -> warn (peer going away by design; EPIPE/ ECONNRESET-class errors are expected and would otherwise spam the log). A first cut of this gate also tested ctx_impl->quit_param != NXT_QUIT_GRACEFUL as a conjunct, on the assumption that quit_param tracks shutdown state. It does not -- the field is initialised to NXT_QUIT_GRACEFUL in nxt_unit_ctx_init (src/nxt_unit.c:697) as the default "intended quit semantics" and is re-asserted to GRACEFUL inside nxt_unit_quit's NORMAL branch for broadcast purposes, so it is GRACEFUL at steady state too. Including it in the gate would have routed every steady-state sendmsg failure to warn instead of alert, exactly the opposite of the FIXME's intent. The unambiguous flag is ctx_impl->online; that is what the merged version uses. The inline comment documents the pitfall so a future audit does not re-introduce the same bug. Verified -------- ./configure --tests --modules=python && ./configure python \ --config=python3-config && make -j$(nproc) # clean python3 -m pytest test/test_python_application.py => 40 passed, 6 skipped Deferred to a follow-up ----------------------- src/nxt_lib.c:149 -- "stop engines". The function nxt_lib_stop is currently NXT_EXPORT and effectively dead inside the daemon (grep finds no in-tree caller), but its signature is part of the public libunit ABI. A clean implementation needs to walk rt->engines, which means either taking rt as a parameter (signature break) or stashing rt in a global (worse). Either choice deserves its own PR with explicit ABI discussion. Refs ---- - roadmap/plan-graceful-shutdown.md P4 - upstream nginx/unit PR nginx#334 (queue-traversal pattern)

gemini-code-assist

Code Review

This pull request implements memory cleanup for the timer subsystem's batching array in nxt_event_engine_free and refines error logging in nxt_unit_sendmsg. The logging logic now distinguishes between steady-state operations, where sendmsg failures are treated as alerts, and shutdown phases, where they are downgraded to warnings to avoid log spam. I have no feedback to provide as no review comments were submitted for assessment.

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lifecycle): close engine-teardown TODOs (P4 partial)#15

fix(lifecycle): close engine-teardown TODOs (P4 partial)#15
andypost wants to merge 1 commit into
masterfrom
p4-engine-teardown

andypost commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andypost commented May 8, 2026

Summary

What changes

src/nxt_event_engine.c:459 — "free timers"

src/nxt_unit.c:6014-6019 — alert level FIXME

Pitfall caught during audit (commit message documents it)

Verification

Independence

Out of scope

Refs

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`src/nxt_event_engine.c:459` — "free timers"

`src/nxt_unit.c:6014-6019` — alert level FIXME