fix(lifecycle): close engine-teardown TODOs (P4 partial)#15
Open
andypost wants to merge 1 commit into
Open
Conversation
Closes two of the three P4 TODOs from roadmap/plan-graceful-shutdown.md. The third (nxt_lib.c:149 "stop engines") is deferred because closing it cleanly requires an ABI signature change to NXT_EXPORT void nxt_lib_stop(void); that is out of scope for an incremental lifecycle PR -- it will land separately with explicit ABI / external-embedder review. src/nxt_event_engine.c:459 -- "free timers" ------------------------------------------- The TODO sat after the engine's underlying-event subsystem free and asked for the timer subsystem's heap to be released. An audit of nxt_timer_t users showed every timer is embedded inside a containing struct (nxt_runtime_t::timer, nxt_listen_event_t:: idle_timer, request and connection timers in nxt_h1proto.c and nxt_router.c, etc.) -- walking the rbtree and calling nxt_free on each node would corrupt the embedding subsystem's heap. The only allocation owned by nxt_timers_t itself is the `changes` batching array allocated in nxt_timers_init via nxt_malloc. That is what gets freed here. Comment cites upstream PR nginx#334 for the queue-traversal pattern we would have used if rbtree traversal were appropriate. src/nxt_unit.c:6014-6019 -- alert level FIXME --------------------------------------------- The FIXME asked for nxt_unit_warn -> nxt_unit_alert "after router graceful shutdown is implemented". libunit already tracks the relevant lifecycle state in ctx_impl->online (see nxt_unit_quit at src/nxt_unit.c:5805 where online is set to 0 when the quit decision is made), so the gate is satisfiable today: * online (steady state OR deferred graceful drain) -> alert (peer should still be reachable; sendmsg failure is a real problem). * !online (nxt_unit_quit has flipped the context out of service) -> warn (peer going away by design; EPIPE/ ECONNRESET-class errors are expected and would otherwise spam the log). A first cut of this gate also tested ctx_impl->quit_param != NXT_QUIT_GRACEFUL as a conjunct, on the assumption that quit_param tracks shutdown state. It does not -- the field is initialised to NXT_QUIT_GRACEFUL in nxt_unit_ctx_init (src/nxt_unit.c:697) as the default "intended quit semantics" and is re-asserted to GRACEFUL inside nxt_unit_quit's NORMAL branch for broadcast purposes, so it is GRACEFUL at steady state too. Including it in the gate would have routed every steady-state sendmsg failure to warn instead of alert, exactly the opposite of the FIXME's intent. The unambiguous flag is ctx_impl->online; that is what the merged version uses. The inline comment documents the pitfall so a future audit does not re-introduce the same bug. Verified -------- ./configure --tests --modules=python && ./configure python \ --config=python3-config && make -j$(nproc) # clean python3 -m pytest test/test_python_application.py => 40 passed, 6 skipped Deferred to a follow-up ----------------------- src/nxt_lib.c:149 -- "stop engines". The function nxt_lib_stop is currently NXT_EXPORT and effectively dead inside the daemon (grep finds no in-tree caller), but its signature is part of the public libunit ABI. A clean implementation needs to walk rt->engines, which means either taking rt as a parameter (signature break) or stashing rt in a global (worse). Either choice deserves its own PR with explicit ABI discussion. Refs ---- - roadmap/plan-graceful-shutdown.md P4 - upstream nginx/unit PR nginx#334 (queue-traversal pattern)
There was a problem hiding this comment.
Code Review
This pull request implements memory cleanup for the timer subsystem's batching array in nxt_event_engine_free and refines error logging in nxt_unit_sendmsg. The logging logic now distinguishes between steady-state operations, where sendmsg failures are treated as alerts, and shutdown phases, where they are downgraded to warnings to avoid log spam. I have no feedback to provide as no review comments were submitted for assessment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 4 of the graceful-shutdown / graceful-reload plan documented in
roadmap/plan-graceful-shutdown.mdon theroadmapbranch.Closes two of the three P4 TODOs. The third (
src/nxt_lib.c:149"stop engines") is deferred because closing it cleanly needs an ABI signature change to theNXT_EXPORT void nxt_lib_stop(void)public API, which deserves its own focused PR with explicit external-embedder review.What changes
src/nxt_event_engine.c:459— "free timers"The TODO sat after the engine's underlying-event subsystem free and asked for the timer subsystem's heap to be released.
An audit of
nxt_timer_tusers showed every timer is embedded inside a containing struct (nxt_runtime_t::timer,nxt_listen_event_t::idle_timer, request/connection timers innxt_h1proto.candnxt_router.c, etc.) — walking the rbtree and callingnxt_free()on each node would corrupt the embedding subsystem's heap.The only allocation owned by
nxt_timers_titself is thechangesbatching array allocated innxt_timers_initvianxt_malloc. That is what gets freed here.Cites upstream PR
nginx/unit#334for the queue-traversal pattern that would have been used if rbtree traversal were appropriate.src/nxt_unit.c:6014-6019— alert level FIXMEThe FIXME asked for
nxt_unit_warn→nxt_unit_alert"after router graceful shutdown is implemented". libunit already tracks the relevant lifecycle state inctx_impl->online(seenxt_unit_quitatsrc/nxt_unit.c:5805whereonlineis set to 0 when the quit decision is made), so the gate is satisfiable today:online(steady state OR deferred graceful drain) → alert (peer should still be reachable; sendmsg failure is a real problem).!online(nxt_unit_quithas flipped the context out of service) → warn (peer going away by design; EPIPE/ECONNRESET-class errors are expected and would otherwise spam the log).Pitfall caught during audit (commit message documents it)
A first cut of this gate also tested
ctx_impl->quit_param != NXT_QUIT_GRACEFULas a conjunct, on the assumption thatquit_paramtracks shutdown state. It does not — the field is initialised toNXT_QUIT_GRACEFULinnxt_unit_ctx_init(src/nxt_unit.c:697) as the default "intended quit semantics" and is re-asserted toGRACEFULinsidenxt_unit_quit's NORMAL branch for broadcast purposes, so it isGRACEFULat steady state too.Including it in the gate would have routed every steady-state sendmsg failure to warn instead of alert — exactly the opposite of the FIXME's intent. The unambiguous flag is
ctx_impl->online. The inline comment in this PR documents the pitfall so a future audit does not re-introduce the same bug.Verification
Independence
This PR does not depend on PR #7, #11, or #12. It branches off
masterand the touched files (nxt_event_engine.c,nxt_unit.c) are disjoint from the other lifecycle PRs. Any merge order works — thenxt_unit.conline-only gate works regardless of whether P1'squit_paramplumbing is in place, sinceonlineis already maintained on plain master.Out of scope
src/nxt_lib.c:149"stop engines" — deferred. Function is currentlyNXT_EXPORTand effectively dead inside the daemon (grep finds no in-tree caller), but its signature is part of the public libunit ABI. A clean implementation needs to walkrt->engines, which means either takingrtas a parameter (signature break) or stashingrtin a global (worse). Either choice deserves its own PR with explicit ABI discussion.roadmap/plan-graceful-shutdown.md— that's P1/P2/P3/P5/P6/P7.Refs
roadmap/plan-graceful-shutdown.mdP4Generated by Claude Code