-
Notifications
You must be signed in to change notification settings - Fork 0
Composition test failures have gotten persistent #86
Comments
If someone wants to investigate this using the packet.net machine for testing and doesn't have ssh access to it let me know. |
from a quick check of what the tests are outputting when they timeout, it seems regex matching might be part of the issue:
but more concerning is that the talker might not start in some cases https://ci.ros2.org/job/ci_linux-aarch64/1095/testReport/junit/(root)/projectroot/test_api_pubsub_composition__rmw_fastrtps_cpp/ :
there's a chance that the connection has been made but the output just isn't flushed or something. I'll keep this investigation as a background task |
also not passing on windows release jobs even with 10 retries: https://ci.ros2.org/view/nightly/job/nightly_win_rel/746/ or osx release: Looking like a regression. This is another job with output captured and the things don't seem to be connecting: |
I inferred the wrong date for when this started happening. since around jan 11 they have been failing as individual tests because of ros2/demos#223, but the date that the tests started persistently failing in general was jan 5 (first osx debug failures: https://ci.ros2.org/view/nightly/job/nightly_osx_debug/717/ first ARM debug failures: https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_debug/323/ first windows debug failures: https://ci.ros2.org/view/nightly/job/nightly_win_deb/742/) these are the PRs from around that time: https://github.com/search?p=1&q=user%3Aros2+user%3Aament+merged%3A2018-01-03..2018-01-06&type=Issues&utf8=%E2%9C%93 (no commits on fastrtps from what I can see). Only change to rmw_fastrtps was https://github.com/ros2/rmw_fastrtps/pull/183/files |
I was going off the dates of the commits which isn't sufficient for their workflow. Between jan 4 and jan 5 all of these commits changed in fastrtps: eProsima/Fast-DDS@adb0014...3e225d6 Reverting fastrtps back to what was used on jan 4 "fixes" the composition tests (still using The commit hashes to compare between were inferred from the difference of the ros2.repos between the nightly builds (output using I will try to narrow it down a bit further and then notify them of the regression |
I've narrowed it down to changes in this range: https://github.com/eProsima/Fast-RTPS/compare/adb0014...f67cb58?w=1 (https://ci.ros2.org/job/ci_linux-aarch64/1179/ vs https://ci.ros2.org/job/ci_linux-aarch64/1180/) I opened eProsima/Fast-DDS#200 Also for posterity I used jan4 fastrtps to check if ros2/ros1_bridge#105 or #98 were caused by the same regression, and they both still occur with jan4 fastrtps. (Bridge tested locally using https://ci.ros2.org/view/packaging/job/packaging_linux/930/, weak nodes tested with https://ci.ros2.org/job/ci_linux/4098/) |
There's an open PR on fastrtps that appears to resolve the deadlocking. |
Note to self: that PR does not resolve flakiness in the bridge tests: https://ci.ros2.org/job/ci_packaging_linux/67/testReport/junit/(root)/projectroot/test_dynamic_bridge__rmw_fastrtps_cpp/ (ROS 2 listener still doesn't receive anything) |
I asked eProsima for an update on eProsima/Fast-DDS#192 and they said they'd look at it this week. |
eProsima have proposed a subset of the fixes in eProsima/Fast-DDS@17e717c and asked us to test it. Full CI: |
edit: wrong tab. ☕ 😵 |
eProsima commented on eProsima/Fast-DDS#200 that they merged the branch with the fix. @ros2/team Composition failures are no longer expected in CI on any platform. |
Since around https://ci.ros2.org/job/nightly_linux-aarch64_debug/329/ (Jan 11 2018) the composition tests on debug ARM jobs haven't been passing in nightlies even with 10 test reruns.
The composition tests are known to be flaky, but there might be something deeper going on here if they can't even pass after 10 repeats, and that the test behaviour changed so drastically around that time
The text was updated successfully, but these errors were encountered: