Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

Composition test failures have gotten persistent #86

Closed
dhood opened this issue Feb 26, 2018 · 13 comments
Closed

Composition test failures have gotten persistent #86

dhood opened this issue Feb 26, 2018 · 13 comments
Labels
in progress Actively being worked on (Kanban column)

Comments

@dhood
Copy link
Member

dhood commented Feb 26, 2018

Since around https://ci.ros2.org/job/nightly_linux-aarch64_debug/329/ (Jan 11 2018) the composition tests on debug ARM jobs haven't been passing in nightlies even with 10 test reruns.

The composition tests are known to be flaky, but there might be something deeper going on here if they can't even pass after 10 repeats, and that the test behaviour changed so drastically around that time

@nuclearsandwich
Copy link
Member

If someone wants to investigate this using the packet.net machine for testing and doesn't have ssh access to it let me know.

@dhood
Copy link
Member Author

dhood commented Feb 26, 2018

from a quick check of what the tests are outputting when they timeout, it seems regex matching might be part of the issue:

https://ci.ros2.org/job/ci_linux-aarch64/1095/testReport/junit/(root)/projectroot/test_api_srv_composition_client_first__rmw_fastrtps_cpp/

../../../../build/composition/test_api_srv_composition_client_first__rmw_fastrtps_cpp_Debug.py (test_api_composition) pid 40715: ['/home/rosbuild/ci_scripts/ws/build/composition/api_composition'] (all > console, InMemoryHandler: test_api_composition)
(load_client_component) pid 40716: ['/home/rosbuild/ci_scripts/ws/build/composition/api_composition_cli', 'composition', 'composition::Client'] (stderr > stdout, all > console)
(load_server_component) pid 40717: ['/home/rosbuild/ci_scripts/ws/build/composition/api_composition_cli', 'composition', 'composition::Server', '--delay', '5000'] (stderr > stdout, all > console)
[load_client_component] [INFO] [api_composition_cli]: Sending request...
[load_client_component] [INFO] [api_composition_cli]: Waiting for response...
[test_api_composition] [INFO] [api_composition]: Load library /home/rosbuild/ci_scripts/ws/build/composition/libclient_component.so
[test_api_composition] [INFO] [api_composition]: Instantiate class composition::Client
[load_client_component] [INFO] [api_composition_cli]: Result of load_node: success = true
(load_client_component) rc 0
[test_api_composition] [INFO] [Client]: Service not available after waiting
[test_api_composition] [INFO] [api_composition]: Load library /home/rosbuild/ci_scripts/ws/build/composition/libclient_component.so
[test_api_composition] [INFO] [api_composition]: Instantiate class composition::Client
[load_server_component] [INFO] [api_composition_cli]: Sending request...
[load_server_component] [INFO] [api_composition_cli]: Waiting for response...
[load_server_component] [INFO] [api_composition_cli]: Result of load_node: success = true
(load_server_component) rc 0
[test_api_composition] [INFO] [api_composition]: Load library /home/rosbuild/ci_scripts/ws/build/composition/libserver_component.so
[test_api_composition] [INFO] [api_composition]: Instantiate class composition::Server
[test_api_composition] [INFO] [Client]: Service not available after waiting
[test_api_composition] [INFO] [Server]: Incoming request: [a: 2, b: 3]
[test_api_composition] [INFO] [Client]: Got result: [5]
[test_api_composition] signal_handler(2)

but more concerning is that the talker might not start in some cases https://ci.ros2.org/job/ci_linux-aarch64/1095/testReport/junit/(root)/projectroot/test_api_pubsub_composition__rmw_fastrtps_cpp/ :

(load_talker_component) pid 40660: ['/home/rosbuild/ci_scripts/ws/build/composition/api_composition_cli', 'composition', 'composition::Talker'] (stderr > stdout, all > console)
(load_listener_component) pid 40661: ['/home/rosbuild/ci_scripts/ws/build/composition/api_composition_cli', 'composition', 'composition::Listener'] (stderr > stdout, all > console)
[load_talker_component] [INFO] [api_composition_cli]: Sending request...
[load_talker_component] [INFO] [api_composition_cli]: Waiting for response...
[load_listener_component] [INFO] [api_composition_cli]: Sending request...
[load_listener_component] [INFO] [api_composition_cli]: Waiting for response...
[load_talker_component] [INFO] [api_composition_cli]: Result of load_node: success = true
[load_listener_component] [INFO] [api_composition_cli]: Result of load_node: success = true
<no prints before timeout>

there's a chance that the connection has been made but the output just isn't flushed or something. I'll keep this investigation as a background task

@dhood
Copy link
Member Author

dhood commented Mar 10, 2018

also not passing on windows release jobs even with 10 retries: https://ci.ros2.org/view/nightly/job/nightly_win_rel/746/
or windows debug:
https://ci.ros2.org/view/nightly/job/nightly_win_deb/808/#showFailuresLink

or osx debug:
https://ci.ros2.org/view/nightly/job/nightly_osx_debug/781/testReport/junit/(root)/projectroot/test_api_srv_composition_client_first__rmw_fastrtps_cpp/

or osx release:
https://ci.ros2.org/view/nightly/job/nightly_osx_release/779/

Looking like a regression.

This is another job with output captured and the things don't seem to be connecting:
https://ci.ros2.org/job/ci_linux-aarch64/1131/ (listener doesn't connect, clients don't find the server)

@dhood dhood changed the title aarch64 debug composition test failures Composition test failures have gotten persistent Mar 10, 2018
@dhood
Copy link
Member Author

dhood commented Mar 12, 2018

I inferred the wrong date for when this started happening. since around jan 11 they have been failing as individual tests because of ros2/demos#223, but the date that the tests started persistently failing in general was jan 5 (first osx debug failures: https://ci.ros2.org/view/nightly/job/nightly_osx_debug/717/ first ARM debug failures: https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_debug/323/ first windows debug failures: https://ci.ros2.org/view/nightly/job/nightly_win_deb/742/)

these are the PRs from around that time: https://github.com/search?p=1&q=user%3Aros2+user%3Aament+merged%3A2018-01-03..2018-01-06&type=Issues&utf8=%E2%9C%93 (no commits on fastrtps from what I can see). Only change to rmw_fastrtps was https://github.com/ros2/rmw_fastrtps/pull/183/files

@dhood
Copy link
Member Author

dhood commented Mar 13, 2018

no commits on fastrtps from what I can see

I was going off the dates of the commits which isn't sufficient for their workflow. Between jan 4 and jan 5 all of these commits changed in fastrtps: eProsima/Fast-DDS@adb0014...3e225d6

Reverting fastrtps back to what was used on jan 4 "fixes" the composition tests (still using --retest-until-pass so might be flaky): https://ci.ros2.org/job/ci_linux-aarch64/1179/

The commit hashes to compare between were inferred from the difference of the ros2.repos between the nightly builds (output using vcs export --exact in the logs): https://www.diffchecker.com/sjYcPvHB

I will try to narrow it down a bit further and then notify them of the regression

@dhood dhood added the in progress Actively being worked on (Kanban column) label Mar 13, 2018
@dhood
Copy link
Member Author

dhood commented Mar 16, 2018

I've narrowed it down to changes in this range: https://github.com/eProsima/Fast-RTPS/compare/adb0014...f67cb58?w=1 (https://ci.ros2.org/job/ci_linux-aarch64/1179/ vs https://ci.ros2.org/job/ci_linux-aarch64/1180/)

I opened eProsima/Fast-DDS#200

Also for posterity I used jan4 fastrtps to check if ros2/ros1_bridge#105 or #98 were caused by the same regression, and they both still occur with jan4 fastrtps. (Bridge tested locally using https://ci.ros2.org/view/packaging/job/packaging_linux/930/, weak nodes tested with https://ci.ros2.org/job/ci_linux/4098/)

@dhood
Copy link
Member Author

dhood commented Mar 20, 2018

There's an open PR on fastrtps that appears to resolve the deadlocking.
Using eProsima/Fast-DDS#192 the tests passed 50 times in a row on ARM debug.

@dhood
Copy link
Member Author

dhood commented Mar 20, 2018

Full CI testing on that fastrtps PR in case it introduced any other regressions:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@dhood
Copy link
Member Author

dhood commented Mar 21, 2018

Note to self: that PR does not resolve flakiness in the bridge tests: https://ci.ros2.org/job/ci_packaging_linux/67/testReport/junit/(root)/projectroot/test_dynamic_bridge__rmw_fastrtps_cpp/ (ROS 2 listener still doesn't receive anything)

@dhood
Copy link
Member Author

dhood commented Apr 3, 2018

I asked eProsima for an update on eProsima/Fast-DDS#192 and they said they'd look at it this week.

@dhood
Copy link
Member Author

dhood commented Apr 3, 2018

eProsima have proposed a subset of the fixes in eProsima/Fast-DDS@17e717c and asked us to test it.

Full CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Repeating composition tests on ARM debug: Build Status

@nuclearsandwich
Copy link
Member

nuclearsandwich commented Apr 4, 2018

edit: wrong tab. ☕ 😵

@dhood
Copy link
Member Author

dhood commented Apr 10, 2018

eProsima commented on eProsima/Fast-DDS#200 that they merged the branch with the fix.

@ros2/team Composition failures are no longer expected in CI on any platform.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
in progress Actively being worked on (Kanban column)
Projects
None yet
Development

No branches or pull requests

2 participants