Skip to content

[jazzy] Reduce flakiness in rosbag2 recorder end-to-end tests (backport #2370)#2415

Open
mergify[bot] wants to merge 1 commit intojazzyfrom
mergify/bp/jazzy/pr-2370
Open

[jazzy] Reduce flakiness in rosbag2 recorder end-to-end tests (backport #2370)#2415
mergify[bot] wants to merge 1 commit intojazzyfrom
mergify/bp/jazzy/pr-2370

Conversation

@mergify
Copy link
Copy Markdown

@mergify mergify Bot commented Apr 23, 2026

Description

Reduced flakiness in rosbag2 recorder tests by strengthening PublicationManager::wait_for_matched().

Please see RCA in the relevant issue #2369 (comment)

The helper PublicationManager::wait_for_matched() now waits on ROS graph events rather than relying solely on fixed sleep polling. It still preserves the existing API and boolean return behavior, but uses a hybrid loop:
• check the publisher’s current subscription count
• wait for a graph change
• re-check periodically with a short bounded wait slice
This improves readiness detection for recorder subscriptions that appear asynchronously during topic discovery, while still handling cases where graph notifications are delayed or not sufficient on their own.
No test call sites were changed. Existing recorder tests continue to use pub_manager.wait_for_matched(...), but they now benefit from the stronger synchronization automatically.

Is this user-facing behavior change?

No.

Did you use Generative AI?

Yes. Codex gpt-5.4

Additional Information

Can be backported.


This is an automatic backport of pull request #2370 done by [Mergify](https://mergify.com).

* Reduce flakiness in rosbag2 recorder end-to-end tests

Reduced flakiness in rosbag2 recorder tests by strengthening
PublicationManager::wait_for_matched().

The helper  `PublicationManager::wait_for_matched()` now waits on ROS
graph events instead of relying only on fixed sleep-polling.
It still preserves the existing API and boolean return behavior, but
uses a hybrid loop:
• check the publisher’s current subscription count
• wait for a graph change
• re-check periodically with a short bounded wait slice
This improves readiness detection for recorder subscriptions that appear
asynchronously during topic discovery, while still handling cases where
graph notifications are delayed or not sufficient on their own.
No test call sites were changed. Existing recorder tests continue to use
pub_manager.wait_for_matched(...), but they now benefit from the
stronger synchronization automatically.

Signed-off-by: Michael Orlov <morlovmr@gmail.com>

* Fix a possible negative remaining time in wait_for_matched(..)

Refactor logic to check deadline before calculating remaining time

Signed-off-by: Michael Orlov <morlovmr@gmail.com>

---------

Signed-off-by: Michael Orlov <morlovmr@gmail.com>
(cherry picked from commit 86fbdb7)
@MichaelOrlov MichaelOrlov changed the title Reduce flakiness in rosbag2 recorder end-to-end tests (backport #2370) [jazzy] Reduce flakiness in rosbag2 recorder end-to-end tests (backport #2370) Apr 23, 2026
Copy link
Copy Markdown
Contributor

@MichaelOrlov MichaelOrlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with green CI.

@MichaelOrlov
Copy link
Copy Markdown
Contributor

Pulls: #2415
Gist: https://gist.githubusercontent.com/MichaelOrlov/813fadd393a87f4c9079851bb6963b42/raw/2606e75177e3d07f3c2cc156001ac9ab67f5e432/ros2.repos
BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_transport
TEST args: --packages-above rosbag2_test_common rosbag2_transport
ROS Distro: jazzy
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/19064

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@MichaelOrlov
Copy link
Copy Markdown
Contributor

RHEL build was autorestarted and passed green.

  • Linux-rhel Build Status

@MichaelOrlov
Copy link
Copy Markdown
Contributor

@mjcarroll @cottsay The Windows build on Jazzy permanently fails with multiple jobs with the error message that rmw_connextdds can't serialize the message.
It seems the Windows CI build is broken on a baseline for a while (at least I see this failure for the last 3-4 days).
I am curious if this is a known issue?

[ RUN      ] RosBag2PlayTestFixture.burst_with_false_preconditions
[ERROR] [1776977464.035681700] [rmw_connextdds]: Failed to serialize rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_ sample: Not enough memory in the buffer stream
[ERROR] [1776977464.035968800] [rmw_connextdds]: failed to write message to DDS
[ERROR] [1776977464.036026000] [rmw_connextdds]: failed to publish discovery sample
[ERROR] [1776977464.036058600] [rmw_connextdds]: failed to publish discovery sample
[ERROR] [1776977464.036081500] [rmw_connextdds]: failed to update graph for node
ERROR [0x0101CB47,0xFDF91636,0xE3F71A8B:0x80000003{Entity=DW,Topic=ros_discovery_info,Type=rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_,Domain=147}|WRITE|ADD TO WRITER QUEUE|PROCESS SAMPLE] PRESWriterHistoryDriver_serializeSample:FAILED TO SERIALIZE | Sample in topic ros_discovery_info with type 'rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_' and encapsulation ID 1
ERROR [0x0101CB47,0xFDF91636,0xE3F71A8B:0x80000003{Entity=DW,Topic=ros_discovery_info,Type=rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_,Domain=147}|WRITE|ADD TO WRITER QUEUE] PRESWriterHistoryDriver_initializeSample:FAILED TO SERIALIZE | Sample with sequence number (0, 1) in encapsulation 1.
ERROR [0x0101CB47,0xFDF91636,0xE3F71A8B:0x80000003{Entity=DW,Topic=ros_discovery_info,Type=rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_,Domain=147}|WRITE|ADD TO WRITER QUEUE] WriterHistoryMemoryPlugin_addEntryToSession:FAILED TO INITIALIZE | Session sample
ERROR [0x0101CB47,0xFDF91636,0xE3F71A8B:0x80000003{Entity=DW,Topic=ros_discovery_info,Type=rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_,Domain=147}|WRITE|ADD TO WRITER QUEUE] WriterHistoryMemoryPlugin_addEntryToSessions:FAILED TO ADD | Entry to session with ID 0
ERROR [0x0101CB47,0xFDF91636,0xE3F71A8B:0x80000003{Entity=DW,Topic=ros_discovery_info,Type=rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_,Domain=147}|WRITE|ADD TO WRITER QUEUE] WriterHistoryMemoryPlugin_getEntry:FAILED TO ADD | Virtual sample to sessions
ERROR [0x0101CB47,0xFDF91636,0xE3F71A8B:0x80000003{Entity=DW,Topic=ros_discovery_info,Type=rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_,Domain=147}|WRITE|ADD TO WRITER QUEUE] WriterHistoryMemoryPlugin_addSample:FAILED TO GET | Entry
ERROR [0x0101CB47,0xFDF91636,0xE3F71A8B:0x80000003{Entity=DW,Topic=ros_discovery_info,Type=rmw_dds_common::msg::dds_::ParticipantEntitiesInfo_,Domain=147}|WRITE] PRESPsWriter_writeInternal:!collator addWrite
unknown file: error: C++ exception with description "failed to initialize rcl node: failed to write message to DDS, at C:\ci\ws\src\ros2\rmw_connextdds\rmw_connextdds_common\src\ndds\dds_api_ndds.cpp:778, at C:\ci\ws\src\ros2\rcl\rcl\src\rcl\node.c:253" thrown in the test fixture's constructor.

@MichaelOrlov
Copy link
Copy Markdown
Contributor

@mjcarroll @cottsay @claraberendsen friendly ping here for comments ^^^.

@MichaelOrlov
Copy link
Copy Markdown
Contributor

Attempt to run Windows CI build one more time

  • Windows Build Status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant