os/kernel/task: Fix for multiple exit scenario via task_exithook()#7288
os/kernel/task: Fix for multiple exit scenario via task_exithook()#7288rish-sg wants to merge 1 commit intoSamsung:masterfrom
Conversation
| * that bit is set, then just exit doing nothing more.. | ||
| */ | ||
|
|
||
| #ifdef CONFIG_SMP |
There was a problem hiding this comment.
The SMP conditional is necessary? What about context switching here in non smp?
There was a problem hiding this comment.
Got you. Need to make sure pre-emption is disabled in non-SMP case too. Condition removed.
PR #7288 — os/kernel/task: Fix for multiple exit scenario via task_exithook()
Repository: Base → Head: HEAD Commit: Scope: Review of the 🔎 Review Summary
Final Verdict: ❗ Request Changes 🚨 Must-Fix Issues1. The new guard still does not atomically check and set
|
| Item | Details |
|---|---|
| Location | os/kernel/task/task_exithook.c:558 |
| Severity | High |
| Type | Functional correctness |
Problem
The patch moves the early check into a critical section, but it still releases that critical section before the flag is set. The first guard runs at task_exithook.c:558-566, while the setter is still deferred until task_exithook.c:616-623 after dbg_save_termination_info(), optional task_atexit() / task_onexit(), and task_recover(). That leaves the same race window the PR description is trying to close: a second CPU can pass the early guard before the first CPU sets TCB_FLAG_EXIT_PROCESSING.
Impact
- Two concurrent exit paths can still both execute the teardown body for the same TCB.
- Duplicate execution here is exactly the class of failure the PR is trying to prevent: repeated
task_signalparent(),task_exitwakeup(),group_leave(), andsig_cleanup()on one task. - Because
task_terminate.c:230-236already documents thattask_exithook()may be called twice, leaving this race open means the asserted double-exit path is still credible after the patch.
Evidence
os/kernel/task/task_exithook.c:558— enters the critical section only to test the flag.os/kernel/task/task_exithook.c:566— drops the critical section before any state is changed.os/kernel/task/task_exithook.c:623— setsTCB_FLAG_EXIT_PROCESSINGmuch later, after non-trivial cleanup work has already run.os/kernel/task/task_terminate.c:236—task_exithook()is a shared termination path that can be reached more than once for the same task.
Required Fix
Files to edit:
os/kernel/task/task_exithook.c—task_exithook
Change outline:
- Keep the first
enter_critical_section()/leave_critical_section()pair. - Perform the
TCB_FLAG_EXIT_PROCESSINGtest and set in that same critical section before returning to normal execution. - Keep the later critical section only for the smaller region that truly needs it, instead of using it as the place where the flag is first published.
Example patch:
diff --git a/os/kernel/task/task_exithook.c b/os/kernel/task/task_exithook.c
@@
flags = enter_critical_section();
if ((tcb->flags & TCB_FLAG_EXIT_PROCESSING) != 0) {
leave_critical_section(flags);
return;
}
+ tcb->flags |= TCB_FLAG_EXIT_PROCESSING;
leave_critical_section(flags);
@@
- flags = enter_critical_section();
- tcb->flags |= TCB_FLAG_EXIT_PROCESSING;
+ flags = enter_critical_section();Inference:
- The exact placement of the later signal-handling critical section may still need minor adjustment after runtime validation, but the flag itself needs to be published in the first guarded block, not at line 623.
Validation Method
- Reproduce the original SMP stress case that drives concurrent exit/terminate handling for the same task.
- Add temporary tracing around the early-return path and the main cleanup body, then confirm only one CPU reaches the teardown body once the flag is set early.
- Verify the original assert does not reappear under repeated SMP stress.
2. The patch now holds the exit path inside enter_critical_section() across heavyweight teardown on all builds
| Item | Details |
|---|---|
| Location | os/kernel/task/task_exithook.c:616 |
| Severity | High |
| Type | Stability / Safety |
Problem
The new enter_critical_section() at task_exithook.c:616 is unconditional and is not released until task_exithook.c:667. In this tree, enter_critical_section() maps to irqsave() when CONFIG_IRQCOUNT is off (os/include/tinyara/irq.h:142-145), so this change widens the IRQ-disabled region on non-SMP builds too. The widened region now includes task_flushstreams(), binary_manager_remove_binlist(), group_leave(), and sig_cleanup(). group_leave() explicitly documents that no special precautions are required (os/kernel/group/group_leave.c:356-358) and can cascade into group_release() and binfmt_exit() (os/kernel/group/group_leave.c:404-412), which is far heavier than the original SMP-only signal/wakeup protection.
Impact
- Exit now runs with interrupts disabled across stream flushing and task-group release paths that were previously outside the critical section on non-SMP builds.
- This increases latency and watchdog risk in a hot kernel teardown path.
- It also changes locking expectations around teardown helpers that were not designed to run inside a global IRQ-disabled window, creating new regression risk unrelated to the original race.
Evidence
os/kernel/task/task_exithook.c:616— unconditionalenter_critical_section().os/kernel/task/task_exithook.c:645/650/658/664— stream flushing, binary-list removal, group teardown, and signal cleanup all now run before interrupts are restored.os/include/tinyara/irq.h:144— on common builds this is justirqsave(), not an SMP-only helper.os/kernel/group/group_leave.c:404and:412— the same IRQ-disabled region can reachgroup_release()andbinfmt_exit().
Required Fix
Files to edit:
os/kernel/task/task_exithook.c—task_exithook
Change outline:
- Restore the earlier behavior where the wider teardown path is not covered by a global critical section on non-SMP builds.
- Keep only the minimum IRQ-off region needed for the atomic exit-processing flag update and the specific signal/wakeup state that truly requires critical protection.
- Move
task_flushstreams(),binary_manager_remove_binlist(), andgroup_leave()back outside the IRQ-disabled region.
Example patch:
diff --git a/os/kernel/task/task_exithook.c b/os/kernel/task/task_exithook.c
@@
- flags = enter_critical_section();
- task_signalparent(tcb, status);
- task_exitwakeup(tcb, status);
- if (!nonblocking) {
- task_flushstreams(tcb);
- }
-#if defined(CONFIG_BINARY_MANAGER) && defined(CONFIG_APP_BINARY_SEPARATION)
- binary_manager_remove_binlist(tcb);
-#endif
-#ifdef HAVE_TASK_GROUP
- group_leave(tcb);
-#endif
-#ifndef CONFIG_DISABLE_SIGNALS
- sig_cleanup(tcb);
-#endif
- leave_critical_section(flags);
+ flags = enter_critical_section();
+ task_signalparent(tcb, status);
+ task_exitwakeup(tcb, status);
+#ifndef CONFIG_DISABLE_SIGNALS
+ sig_cleanup(tcb);
+#endif
+ leave_critical_section(flags);
+
+ if (!nonblocking) {
+ task_flushstreams(tcb);
+ }
+#if defined(CONFIG_BINARY_MANAGER) && defined(CONFIG_APP_BINARY_SEPARATION)
+ binary_manager_remove_binlist(tcb);
+#endif
+#ifdef HAVE_TASK_GROUP
+ group_leave(tcb);
+#endifInference:
- The exact minimum protected set may need a quick re-check against the signal/waitpid contract, but the current all-the-way-to-
group_leave()region is too broad for the problem being solved.
Validation Method
- Build at least one UP configuration and one SMP configuration after narrowing the critical section.
- Re-run the reported exit stress scenario and confirm both the original double-exit assert and any new watchdog / latency regressions stay absent.
- Check that exit-path logging shows interrupts are restored before
group_leave()/ binary teardown runs.
🟡 Nice-to-Have Improvements
1. Add a targeted regression test or stress harness for the duplicate-exit race
| Item | Details |
|---|---|
| Location | os/kernel/task/task_exithook.c:549 |
| Severity | Medium |
| Type | Validation |
Problem
This PR fixes a timing-sensitive SMP race, but the change does not add any deterministic regression coverage for the failure it describes.
Impact
- Future edits can reopen the same window without an obvious functional diff.
- Static review can catch obvious ordering mistakes, but it is weak against scheduler-sensitive regressions.
Recommended Action
- Add an SMP-focused stress test or kernel self-test that repeatedly forces concurrent exit/terminate handling for the same task.
- Capture a simple pass/fail signal such as "cleanup body entered once" or "no duplicate SIGCHLD / waitpid wakeup observed".
Expected Output
- A repeatable stress artifact or test log showing the race scenario can be exercised many times without the original assert or duplicate teardown symptoms.
✅ Notable Improvements
Notable Improvements
✔ The PR is addressing a real kernel-level race instead of a cosmetic issue
- The problem statement matches the code structure in the merge base: the old check was at the start of
task_exithook()and the flag was only set at the very end. - Focusing on the
TCB_FLAG_EXIT_PROCESSINGcontract is the right place to harden this path.
✔ The unloaded-binary termination path is now at least aligned with the same exit-processing flag
os/kernel/task/task_terminate_unloaded.c:148-152now marksTCB_FLAG_EXIT_PROCESSING, which is directionally consistent with the normal exit path.- That keeps the dedicated unload path closer to the same "single teardown" invariant as the regular task termination flow.
🧾 Final Assessment
Must-Fix Summary
- The main race is still present because the flag check and the flag set are still separated by a long unprotected window.
- The patch also broadens the IRQ-disabled region around exit teardown in a way that can create new regressions outside the original bug.
Nice-to-Have Summary
- Add a targeted SMP regression/stress test once the ordering fix is corrected.
Residual Risk / Test Gap
- Static review only; I did not run a kernel build or board-level stress workload in this review pass.
- Any conclusion about timing behavior is based on the checked-out code paths and surrounding TizenRT teardown contracts, not on live runtime traces.
📌 Final Verdict
❗ Request Changes
The PR identifies the right race, but the current patch does not actually make the exit-processing guard atomic and it widens the IRQ-disabled teardown window in the process. I would not merge this as-is; the flag publication needs to move into the first guarded block, and the broad critical section needs to be narrowed before re-review.
120574f to
660b442
Compare
|
Yes, we have checked this, there should be a single task termination handler. Similar migration was done on Nuttx and we have found the commit: We need to evaluate all the related code changes and then take a decision which will take time to validate. |
Thank you for checking. I think we need to check task_terminate, task_exithook, task_delete, pthread_exit, and exit. |
| * processing. | ||
| */ | ||
|
|
||
| tcb->flags |= TCB_FLAG_EXIT_PROCESSING; |
There was a problem hiding this comment.
if Setting it here, We must ensure that the exit hook sequent is performed.
but I think we should consider the case below.
- the thread is entering terminate itself and pass the line.
- the context switching is occured.
- other thread is call task_delete or pthread_cancel.
-> TCB_FLAG_EXIT_PROCESSING is setted, skip the task_exithook.
-> pull it out of readytorun queue.
=> the task_exithook sequence will not perfored.
There was a problem hiding this comment.
I agree with your point. For now, we have moved the setting of TCB_FLAG_EXIT_PROCESSING to the lower critical_section() of task_exithook(), this will ensure that the exithook sequence is performed at least once.
But it just reduces the race window and not completely closes it. We have tested it for the issue observed, currently it is working fine.
660b442 to
301b6d9
Compare
Description: This patch reduces the window for multiple exit of task via task_exithook(), in SMP case. ---------------------------------------------------------------------- AS-IS: TCB_FLAG_EXIT_PROCESSING is bit set in tcb->flag to prevent multiple exit. But the checker and setter code are not protected and are being called at different places. Checker: at the start of the function to return early if the flag is set. Setter: flag is set at the end of task_exithook(). This difference in exection of checker and setter opens a window of race where one process sets it, but another process has already passed the checker point. In this race, both the processes may attempt to kill the same task, which leads to a multiple-exit scenario. ---------------------------------------------------------------------- LOGS: log_csi_config: enable: 0 wifi_csi_mq_close: MQ cltask_sigchild: task: csifw_task_Main, tg_nchildren = 1 task_sigchild: task: task_manager, tg_nchildren = 0 security level: 0 =========================================================== Assertion details =========================================================== Assertion failed CPU0 at file: task/task_exithook.c line 339 task: task_manager pid: 17 print_assert_detail: Assert location (PC) : 0x0e024eff check_assert_location: Code asserted in normal thread! =========================================================== Asserted task's stack details =========================================================== check_sp_corruption: Current SP is User Thread SP: 60154490 ---------------------------------------------------------------------- Fix: We move the setter code to the beginning of critical section of task_exithook() and protect the checker within critical section to ensure atomicity. ---------------------------------------------------------------------- Signed-off-by: Rishabh Singh <ris.singh@samsung.com>
301b6d9 to
871eb2a
Compare
The nuttx changes is including task_recover into enter_critical_section. Could you please check we also need to change? |
Description:
This patch prevents multiple exit of task via task_exithook(), in SMP case.
AS-IS:
TCB_FLAG_EXIT_PROCESSING is bit set in tcb->flag to prevent multiple exit. But the checker and setter code are not protected and are being called at different places.
Checker: at the start of the function to return early if the flag is set. Setter: flag is set at the end of task_exithook(). This difference in exection of checker and setter opens a window of race where one process sets it, but another process has already passed the checker point.
In this race, both the processes may attempt to kill the same task, which leads to a multiple-exit scenario.
LOGS:
log_csi_config: enable: 0
wifi_csi_mq_close: MQ cltask_sigchild: task: csifw_task_Main, tg_nchildren = 1 task_sigchild: task: task_manager, tg_nchildren = 0
security level: 0
=========================================================== Assertion details
=========================================================== Assertion failed CPU0 at file: task/task_exithook.c line 339 task: task_manager pid: 17 print_assert_detail: Assert location (PC) : 0x0e024eff check_assert_location: Code asserted in normal thread! =========================================================== Asserted task's stack details
=========================================================== check_sp_corruption: Current SP is User Thread SP: 60154490
Fix: We move the setter code to the beginning of task_exithook() (just after the checker) and protect them within critical section to ensure atomicity.