fix(action_status_bridge): retry deferred fault reports instead of dropping them#472
Merged
Merged
Conversation
…opping them When a failing action's terminal status reaches the bridge before the FaultManager report service is discovered (e.g. a latched transient_local status delivered during the startup discovery window), the fire-and-forget report() is dropped. The bridge previously committed last_reported_state_ unconditionally, so the dropped high-severity fault was never retried: the latched status is delivered to a subscription only once, and a rescan skips already-subscribed topics, so no second message ever re-drives the report. Track the latest observed state (desired) separately from what the FaultManager was actually told (reported), and commit the transition only once the report is deliverable (a null reporter is the unit-test seam; a real reporter must have a discovered service). reconcile_pending() re-attempts undelivered transitions on the existing rescan tick, so a report dropped during the discovery window is delivered as soon as the service appears. The "Action ... -> fault" log now fires only on actual delivery rather than after a silent drop. Closes #471
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a startup discovery race in ros2_medkit_action_status_bridge where a terminal (latched) ABORTED/CANCELED action status could be observed before the FaultManager report_fault service was discovered, causing the bridge to “commit” the failure locally even though FaultReporter dropped the report—permanently losing the fault.
Changes:
- Split “latest observed” per-action state (
desired_state_) from “successfully delivered to FaultManager” state (last_reported_state_), and only commit after delivery via a newreconcile()path. - Add
reconcile_pending()on the existing rescan tick to retry deferred raises/heals once the service becomes ready. - Add a regression unit test ensuring deferred faults remain pending (not committed/dropped) when the FaultManager service is unavailable; update README to document the behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/ros2_medkit_action_status_bridge/src/action_status_bridge_node.cpp |
Implements desired-vs-reported reconciliation and rescan-driven retries for deferred fault delivery. |
src/ros2_medkit_action_status_bridge/include/ros2_medkit_action_status_bridge/action_status_bridge_node.hpp |
Documents the new reconcile model and adds DesiredState / desired_state_ tracking. |
src/ros2_medkit_action_status_bridge/test/test_action_status_bridge.cpp |
Adds a regression test covering “service unavailable => fault stays pending” behavior. |
src/ros2_medkit_action_status_bridge/README.md |
Documents deferred delivery + retry semantics during startup discovery. |
SSEFaultHandlerTest.StreamSnapshotsEntityContextAtEnqueue flaked under AddressSanitizer: enqueue_event() spun for a fixed budget (20 x 10 ms) after publishing a FaultEvent, but under sanitizer slowdown on_fault_event - which snapshots the owning entity - could run after that budget. The test then cleared the cache before the snapshot was taken, so the replayed event omitted x-medkit and the assertion failed. Add SSEFaultHandler::events_received() (a monotonic processed-event counter, mirroring connected_clients()) and make enqueue_event() spin until the handler has actually consumed the event before returning. The entity snapshot is then always taken while the App is still in the cache, independent of execution speed. No product behavior change; entity resolution stays synchronous in on_fault_event.
mfaferek93
reviewed
Jun 23, 2026
…t retry timer A fault deferred because the FaultManager service was not yet discovered was only retried on the slow rescan tick (default 2s). The FaultManager captures the freeze-frame snapshot when it receives the report, so that latency makes the snapshot stale - a late fault is reported against context that has already moved on. Deliver as soon as the report channel is available instead. - Add a dedicated fast retry timer (retry_period_sec, default 50 ms), decoupled from the discovery rescan and armed only while a delivery is pending, so a deferred report lands within one short tick of the service appearing. The happy path (service discovered) still reports synchronously in the status callback. - Hold state_mutex_ across the reconcile gate, report() and commit so the status callback and the retry timer cannot double-report under a multi-threaded executor; report()/report_passed() are async so the section stays short. - Gate the prune-vanished heal on service readiness and warn when a vanished action's heal cannot be delivered, rather than dropping it silently. - Tests: deferred raise and heal are delivered once the service is ready, and the retry pass keeps retrying without dropping while the service is down. - Document the delivery-latency model and its level-triggered boundaries in the README; add retry_period_sec to config and the parameter table.
mfaferek93
approved these changes
Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
Summary
Two fixes (see commits):
1. action_status_bridge: don't lose / promptly deliver faults when the FaultManager service isn't discovered yet (closes #471)
The bridge silently lost the first fault of a failing action when the FaultManager's
report_faultservice was not yet discovered at the instant the bridge observed the terminal status (startup discovery window; exposed by DDS timing - failing on Humble, passing on Jazzy).Root cause: on the
healthy -> failedtransitionapply_message()called the fire-and-forgetreporter->report()and then committedlast_reported_state_ = kFailedunconditionally. If the service was not ready,FaultReporterdropped the request - but the bridge had already recorded the action as reported. The latched (transient_local) status is delivered to a subscription exactly once and a rescan skips already-subscribed topics, so the dropped report was never retried and the high-severity fault was lost permanently.Fix:
desired_state_) separately from what the FaultManager was told (last_reported_state_); deliver viareconcile()only when they differ and the report is deliverable (service ready; null reporter = unit-test seam), committing only after delivery.retry_period_sec, default 50 ms, armed only while pending), decoupled from the slow discovery rescan, so it lands within one short tick of the service appearing. The executor is deliberately not blocked. The README documents this delivery-latency model and its level-triggered boundaries.reconcile()holdsstate_mutex_across gate -> report -> commit so the status callback and retry timer can't double-report under a multi-threaded executor.FaultReporterfire-and-forget contract is unchanged.2. gateway: fix flaky SSE handler test under ASan
SSEFaultHandlerTest.StreamSnapshotsEntityContextAtEnqueueflaked under AddressSanitizer.enqueue_event()spun for a fixed budget after publishing; under sanitizer slowdown the handler'son_fault_event(which snapshots the owning entity) could run after that budget, so the test cleared the cache before the snapshot was taken and the replayed event omittedx-medkit. AddedSSEFaultHandler::events_received()(a monotonic counter, likeconnected_clients()) and madeenqueue_event()wait until the handler has actually consumed the event. Test-only timing fix; no product behavior change.Issue
Type
Testing
test_integrationlaunch test passes (raise + heal); linters +clang-tidyclean. CIPixi (humble)(originally failing) and Humble/Jazzy/Lyrical/TSan green on the prior commits.test_sse_fault_handler(19 tests) passes; the snapshot test passes 30/30 in a repeat loop;clang-format+clang-tidyclean on the changed files.Checklist