Skip to content

[Kilted] Don't pause recorder after stop#2410

Closed
brow1633 wants to merge 1 commit intoros2:kiltedfrom
brow1633:kilted
Closed

[Kilted] Don't pause recorder after stop#2410
brow1633 wants to merge 1 commit intoros2:kiltedfrom
brow1633:kilted

Conversation

@brow1633
Copy link
Copy Markdown

@brow1633 brow1633 commented Apr 22, 2026

Description

Removes pause from stop, as in #2313 from rolling.

Fixes # 2409

Is this user-facing behavior change?

Yes, but should fix a regression.

Did you use Generative AI?

Claude Opus 4.6 assisted with triage.

Additional Information

Can be back ported to Jazzy as well.

Signed-off-by: Ethan Brown <ebrown@overland.ai>
Copy link
Copy Markdown
Contributor

@MichaelOrlov MichaelOrlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pause was introduced on purpose to mitigate undefined behaviour.

The problem is that the subscription's callbacks are propagated to the executor and may still be in the executor's queue after finishing the stop() operation.
Using a pause was to ensure that even if callbacks from deleted subscriptions would be called, they wouldn't propagate and call the deleted writer.
Of course pause is not completely fixing the undefined behaviour, but removing it will make it worse and more often will cause unexpected crashes.

The proper fix is to use a new API disable_callbacks as in the #2313, but unfortunately, we can't backport it.

@brow1633
Copy link
Copy Markdown
Author

This pause was introduced on purpose to mitigate undefined behaviour.

The problem is that the subscription's callbacks are propagated to the executor and may still be in the executor's queue after finishing the stop() operation. Using a pause was to ensure that even if callbacks from deleted subscriptions would be called, they wouldn't propagate and call the deleted writer. Of course pause is not completely fixing the undefined behaviour, but removing it will make it worse and more often will cause unexpected crashes.

The proper fix is to use a new API disable_callbacks as in the #2313, but unfortunately, we can't backport it.

Ah, understood. In that case, can we reset the paused_ state when calling record()? Having pause here (and a lack of a reset in record()) changed behavior. More info: #2409

@MichaelOrlov
Copy link
Copy Markdown
Contributor

MichaelOrlov commented Apr 23, 2026

As regards

In that case, can we reset the paused_ state when calling record()?

Unfortunately, not because in this case we will break another feature start-paused. We do have a dedicated CLI option --start-paused initialized here

paused_(record_options.start_paused),

The workaround would be to call resume() API or service call explicitly after the stop() and before the next record() API calls.

Update: or even better to call resume() API after the second record() call to avoid undefined behavior.

@brow1633
Copy link
Copy Markdown
Author

brow1633 commented Apr 23, 2026 via email

@MichaelOrlov
Copy link
Copy Markdown
Contributor

@brow1633

As an anecdote, due this change, we lost several terabytes of data in a single day.

Sorry about this. To make a proper fix, I had to make significant changes on a RCLCPP layer. It was a fundamental problem in the ROS 2 executor design.

@MichaelOrlov
Copy link
Copy Markdown
Contributor

@brow1633 I created a follow-up PR #2412 to outline this behavior as a known limitation in the README file.

@MichaelOrlov
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants