Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreliable services #1152

Closed
daisukes opened this issue Jun 2, 2020 · 17 comments
Closed

Unreliable services #1152

daisukes opened this issue Jun 2, 2020 · 17 comments

Comments

@daisukes
Copy link
Contributor

daisukes commented Jun 2, 2020

Bug report

Required Info:
Related to ros-navigation/navigation2#1788

  • Operating System:
    • Ubuntu 20.04 (Docker container)
  • Installation type:
    • Foxy, source build (4772a2a9b926126567008a0d83c5c7323d5c684b)
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • Navigation 2 (master, 13d6629356e897e1717f2fd9579d807bd3c8cfc6)

Steps to reproduce issue

# Launch tb3 simulation and provide an initial location to start lifecycle
$ ros2 launch nav2_bringup tb3_simulation_launch.py

# Then call service like
$ ros2 service call /global_costmap/clear_entirely_global_costmap nav2_msgs/srv/ClearEntireCostmap "{}"

Expected behavior

Services should be responded.

Actual behavior

Services are sometimes not responded.

With my three environments described here, it could sometimes success and fail. Not sure but, It seems like depending CPU speed. I found this issue while trying to use clear costmap services.

Additional information

I have not checked all services in Navigation2, but at least some of the services are not responded.

@fujitatomoya
Copy link
Collaborator

Services are sometimes not responded.

could you define the problem what you wanna fix?

  • service is available, but not responding?
  • takes long time to bring up the service?

you are mentioning the reliability, so after making sure service availability, service sometimes does not respond? right?

@daisukes
Copy link
Contributor Author

daisukes commented Jun 2, 2020

Yes. After making sure service availability, service sometimes does not respond.

Specifically, I want clear costmap services work well.

@fujitatomoya
Copy link
Collaborator

@daisukes

thanks, I would like to see the stack trace or core when exactly this hang up happens. just to make sure this happens in the ros2 system core stack or user implementation(in this case, clear costmap services). can you make this happen with other service example such as https://github.com/ros2/demos/tree/master/demo_nodes_cpp/src/services?

@daisukes
Copy link
Contributor Author

daisukes commented Jun 2, 2020

After making sure the navigation2 system is working, I run a small program listed below to call a service (/global_costmap/clear_entirely_global_costmap) and stopped after waiting 15 secs and got the backtrace.

Also, run this snippet with 3-sec timeout hundreds of times. It randomly successes about 5% on my PC with Core i7-9750H @ 2.60GHz and about 70% on a server with Core i9-7920X @ 2.90GHz

Example works perfectly.

$ ros2 run --prefix 'gdb -ex run --args' service_test service_test
...
Starting program: /home/developer/ros2_ws/install/service_test/lib/service_test/service_test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6760700 (LWP 15611)]
[New Thread 0x7ffff5f5f700 (LWP 15612)]
[New Thread 0x7ffff575e700 (LWP 15613)]
[New Thread 0x7ffff4f5d700 (LWP 15614)]
[New Thread 0x7ffff475c700 (LWP 15615)]
[New Thread 0x7ffff3e9b700 (LWP 15616)]
[New Thread 0x7ffff369a700 (LWP 15617)]
[New Thread 0x7ffff2d6c700 (LWP 15618)]
^C
Thread 1 "service_test" received signal SIGINT, Interrupt.
futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffffff45e0, clockid=<optimized out>, expected=0, futex_word=0x5555556d7e28)
    at ../sysdeps/nptl/futex-internal.h:320
320	../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffffff45e0, clockid=<optimized out>, expected=0, futex_word=0x5555556d7e28)
    at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7fffffff45e0, clockid=<optimized out>, mutex=0x5555556d7dd8, cond=0x5555556d7e00)
    at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=0x5555556d7e00, mutex=0x5555556d7dd8, abstime=0x7fffffff45e0) at pthread_cond_wait.c:656
#3  0x00007ffff7cc6a7f in __gthread_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#4  0x00007ffff7ccc9c4 in std::cv_status std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#5  0x00007ffff7ccab80 in std::cv_status std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#6  0x00007ffff7d351fe in bool std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}>(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#7  0x00007ffff7d33d7d in bool std::condition_variable::wait_for<long, std::ratio<1l, 1000000000l>, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}>(std::unique_lock<std::mutex>&, std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#8  0x00007ffff7d3373c in rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#9  0x00007ffff7cbf0fe in rclcpp::ClientBase::wait_for_service_nanoseconds(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) ()
   from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#10 0x000055555556da4c in bool rclcpp::ClientBase::wait_for_service<long, std::ratio<1l, 1000l> >(std::chrono::duration<long, std::ratio<1l, 1000l> >) ()
#11 0x0000555555569ec0 in main ()
#include "nav2_costmap_2d/clear_costmap_service.hpp"

#include <chrono>
#include <cstdlib>
#include <memory>

using namespace std::chrono_literals;

int main(int argc, char **argv)
{
  rclcpp::init(argc, argv);

  std::shared_ptr<rclcpp::Node> node = rclcpp::Node::make_shared("clear_costmap_client");
  auto client = node->create_client<nav2_msgs::srv::ClearEntireCostmap>("/global_costmap/clear_entirely_global_costmap");
  auto request = std::make_shared<nav2_msgs::srv::ClearEntireCostmap::Request>();

  if (!client->wait_for_service()) { // change timeout for repeat
    if (!rclcpp::ok()) {
      RCLCPP_ERROR(rclcpp::get_logger("rclcpp"), "Interrupted while waiting for the service. Exiting.");
      return 0;
    }
    RCLCPP_INFO(rclcpp::get_logger("rclcpp"), "service not available.");
    return 0;
  }

  auto result = client->async_send_request(request);
  // Wait for the result.
  if (rclcpp::spin_until_future_complete(node, result) ==
    rclcpp::executor::FutureReturnCode::SUCCESS)
  {
    RCLCPP_INFO(rclcpp::get_logger("rclcpp"), "Succeeded");
  } else {
    RCLCPP_ERROR(rclcpp::get_logger("rclcpp"), "Failed");
  }

  rclcpp::shutdown();
  return 0;
}

@fujitatomoya
Copy link
Collaborator

ros2/ros2#922 looks same with this problem.

@hidmic
Copy link
Contributor

hidmic commented Jun 2, 2020

Indeed. @daisukes mind to try ros2/rmw_fastrtps#390? It should mitigate the issue significantly.

@daisukes
Copy link
Contributor Author

daisukes commented Jun 2, 2020

@fujitatomoya @hidmic
Thank you for the information.
I have tried that branch of rmw_fastrtps, however, it could not improve much.
15 success out of 213 trial, about 7%.

I found a thing, still not sure how it relates to the problem though.
It looks like the success case is determined when you make a node.
If you keep using a node that causes failure, I could not get success at all by calling waif_for_server multiple times even though you recreate service clients.
However, if you get a node that works fine by chance, it keeps getting success.

@hidmic
Copy link
Contributor

hidmic commented Jun 2, 2020

FYI @MiguelCompany

@fujitatomoya
Copy link
Collaborator

according to #1152 (comment), it is just a client stack, thread waiting for the event which is service availability. So as requester, service (looks) IS NOT ready. and the rest is the trace triggered by signal SIGINT.

maybe we could check the service client/server debug log to see the difference? Any thoughts?

@daisukes
Copy link
Contributor Author

daisukes commented Jun 3, 2020

Sent requests from the client are sometimes taken by the server, but sometimes not.

planner_server_19898_1591165395685.log
service_test_20379_1591165406625.log

service_test program creates a node and client, then call async_send_request without wait_for_service.

success case

client

[DEBUG] [1591165419.638036511] [rcl]: Client initialized
[DEBUG] [1591165419.638107866] [rcl]: Client sending service request
[DEBUG] [1591165419.638214040] [rcl]: Initializing wait set with '0' subscriptions, '2' guard conditions, '0' timers, '0' clients, '0' services
[DEBUG] [1591165419.638345176] [rcl]: Waiting with timeout: 0s + 500000000ns
[DEBUG] [1591165419.638375476] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.638397164] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.638415441] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.638495935] [rcl]: Waiting with timeout: 0s + 499806226ns
[DEBUG] [1591165419.638511993] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.721666039] [rcl]: Client in wait set is ready
[DEBUG] [1591165419.721823218] [rcl]: Client taking service response
[DEBUG] [1591165419.722010101] [rcl]: Client take response succeeded: true
[INFO] [1591165419.722545187] [rclcpp]: 0 Successed

server

[DEBUG] [1591165419.711462786] [rcl]: Subscription taking message
[DEBUG] [1591165419.711502554] [rcl]: Subscription take succeeded: true
[DEBUG] [1591165419.711638796] [rcl]: Waiting without timeout
[DEBUG] [1591165419.711654606] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.711665458] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.711718538] [rcl]: Waiting without timeout
[DEBUG] [1591165419.711729317] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.720513910] [rcl]: Service in wait set is ready
[DEBUG] [1591165419.720638429] [rcl]: Service server taking service request
[DEBUG] [1591165419.720692414] [rcl]: Service take request succeeded: true
[INFO] [1591165419.720715644] [global_costmap.global_costmap]: Received request to clear entirely the global_costmap
[DEBUG] [1591165419.721430447] [rcl]: Sending service response
[DEBUG] [1591165419.721573135] [rcl]: Waiting without timeout
[DEBUG] [1591165419.721592867] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.721614984] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.721699911] [rcl]: Waiting without timeout
[DEBUG] [1591165419.721718055] [rcl]: Timeout calculated based on next scheduled timer: false

failure case

client

[DEBUG] [1591165420.264981778] [rcl]: Client initialized
[DEBUG] [1591165420.265018811] [rcl]: Client sending service request
[DEBUG] [1591165420.265092068] [rcl]: Initializing wait set with '0' subscriptions, '2' guard conditions, '0' timers, '0' clients, '0' services
[DEBUG] [1591165420.265168123] [rcl]: Waiting with timeout: 0s + 500000000ns
[DEBUG] [1591165420.265181556] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165420.265192436] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165420.265199093] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165420.265296772] [rcl]: Waiting with timeout: 0s + 499847515ns
[DEBUG] [1591165420.265308321] [rcl]: Timeout calculated based on next scheduled timer: false
[ERROR] [1591165420.765323394] [rclcpp]: 0 Failed

server

[DEBUG] [1591165420.261266585] [rcl]: Subscription in wait set is ready
[DEBUG] [1591165420.261353890] [rcl]: Subscription taking message
[DEBUG] [1591165420.261383022] [rcl]: Subscription take succeeded: true
[DEBUG] [1591165420.261468070] [rcl]: Waiting without timeout
[DEBUG] [1591165420.261477170] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165420.261484102] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165420.261507740] [rcl]: Waiting without timeout
[DEBUG] [1591165420.261516860] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165420.281158735] [rcl]: Subscription in wait set is ready
[DEBUG] [1591165420.281240421] [rcl]: Subscription taking message
[DEBUG] [1591165420.281267092] [rcl]: Subscription take succeeded: true
[DEBUG] [1591165420.281324523] [rcl]: Waiting without timeout
[DEBUG] [1591165420.281331477] [rcl]: Timeout calculated based on next scheduled timer: false

@daisukes
Copy link
Contributor Author

daisukes commented Jun 3, 2020

It works well if I use Fast-RTPS v1.10.0 instead of 2.0.x, which is changed in this April.
ros2/ros2@c7c115a

@hidmic
Copy link
Contributor

hidmic commented Jun 3, 2020

Sent requests from the client are sometimes taken by the server, but sometimes not.

Interesting.

service_test program creates a node and client, then call async_send_request without wait_for_service.

I agree that a client hanging isn't a nice way to fail a service call, but if you don't wait for the service server the chances of the response future ever becoming ready are pretty low.

@sloretz
Copy link
Contributor

sloretz commented Jun 18, 2020

Closing in favor of ros2/rmw_fastrtps#392 since these appear to be duplicates, and rmw_fastrtps is a more specifc repo

@sloretz sloretz closed this as completed Jun 18, 2020
@omertal88
Copy link

Hello @daisukes ,

Did you eventually solve it? I'm struggling with the same issue (I think it's the same). Some services just don't receive / respond to calls...
I'm running ros2 galactic on ubuntu 20.04 with docker as well.

@daisukes
Copy link
Contributor Author

daisukes commented Nov 9, 2021

Hello @omertal88,

I have been using ros:galactic image on dockerhub and no preblems so far.

I remember that we used to use Fast-RTPS(Fast-DDS) for foxy and needed to downgrade until the bug was fixed.

The default DDS is changed to CycloneDDS for galactic from Fast-DDS for foxy.
https://discourse.ros.org/t/community-feedback-default-rmw-for-ros-2-galactic/16450

Have you checked different DDS implementations?

@omertal88
Copy link

Wow, so I changed the rmw to use cyclonedds and the issues were gone!
The thing is, I have some devices that run micro-ros which only supports Fast-RTPS. I hope I won't experience any issues with those devices.
Also, coudln't find documentation about how to configure the discovery of CycloneDDS (I need to use uni-cast with devices that are connected wirelessly).
Do you happen to know how?

@omertal88
Copy link

@daisukes So in fact there are issues when calling services from CycloneDDS to Fast-RTPS.
Response isn't returned.
I found this open bug: ros2/rmw_cyclonedds#184 and it seems there hasn't been any progress with it.
I may try to convert my services to simple publishers + subscribers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants