`spin_some` easily deadlocks action clients #2451

asasine · 2024-03-14T16:52:50Z

Bug report

When running an rclcpp_action::Client<T> in using spin_some() and a rclcpp::Rate, if the server you connect to publishes feedback at a faster rate than your rclcpp::Rate, your client waits for the goal to be accepted forever, effectively becoming deadlocked.

Required Info:

Operating System:
- Ubuntu 20.04
Installation type:
- Binaries
Version or commit hash:
- Galactic
DDS implementation:
- Cyclone DDS
Client library (if applicable):
- rclcpp

Note: based on my understanding of the root cause (see Additional information), I believe this bug exists in rolling too, but I've only tested on Galactic as that's what I have access to at the moment.

Steps to reproduce issue

The below server and client reproduce 100% of the time on my machine when running over loopback. Compilation settings shouldn't matter. Once built, they can be run with ros2 run ... or directly with ./path/to/binary. I'm using action_tutorials_interfaces/action/Fibonacci, but this should repro with any action interface.

Server

#include <action_tutorials_interfaces/action/fibonacci.hpp>
#include <rclcpp/rclcpp.hpp>
#include <rclcpp_action/rclcpp_action.hpp>

int main(int argc, char *argv[])
{
  rclcpp::init(argc, argv);
  auto node = rclcpp::Node::make_shared("fast_feedback_server");
  using Action = action_tutorials_interfaces::action::Fibonacci;
  std::shared_ptr<rclcpp_action::ServerGoalHandle<Action>> goal_handle = nullptr;

  auto server = rclcpp_action::create_server<Action>(
      node, "fast_feedback",
      [&](const rclcpp_action::GoalUUID &, std::shared_ptr<const Action::Goal>)
      { return rclcpp_action::GoalResponse::ACCEPT_AND_EXECUTE; },
      [&](const std::shared_ptr<rclcpp_action::ServerGoalHandle<Action>>)
      {
        RCLCPP_INFO(node->get_logger(), "Cancellation requested, accepting.");
        goal_handle = nullptr;
        return rclcpp_action::CancelResponse::ACCEPT;
      },
      [&](const std::shared_ptr<rclcpp_action::ServerGoalHandle<Action>> gh)
      {
        RCLCPP_INFO(node->get_logger(), "Goal was accepted");
        goal_handle = gh;
      });

  rclcpp::Rate rate(100.0);  // 100 Hz
  rclcpp::executors::SingleThreadedExecutor executor;
  executor.add_node(node);
  RCLCPP_INFO(node->get_logger(), "Server spinning, ready to accept goals.");
  while (rclcpp::ok())
  {
    executor.spin_some();

    // NOTE: this server intentionally does not send a result
    if (goal_handle)
    {
      goal_handle->publish_feedback(std::make_unique<Action::Feedback>());
    }

    rate.sleep();
  }

  return 0;
}

Notes:

The server spins at 100 Hz.
Feedback is published at the same rate.
The goal is never completed as it's not relevant to this bug.

Client

#include <action_tutorials_interfaces/action/fibonacci.hpp>
#include <rclcpp/rclcpp.hpp>
#include <rclcpp_action/rclcpp_action.hpp>

int main(int argc, char *argv[])
{
  rclcpp::init(argc, argv);
  auto node = rclcpp::Node::make_shared("fast_feedback_client");
  using Action = action_tutorials_interfaces::action::Fibonacci;
  using Client = rclcpp_action::Client<Action>;
  auto client = rclcpp_action::create_client<Action>(node, "fast_feedback");
  if (!client->wait_for_action_server())
  {
    RCLCPP_ERROR(node->get_logger(), "Action server not available after waiting");
    return 1;
  }

  // NOTE: no feedback_callback is set
  Client::SendGoalOptions send_goal_options;
  bool goal_response = false;
  send_goal_options.goal_response_callback = [&](std::shared_future<Client::GoalHandle::SharedPtr>)
  {
    RCLCPP_INFO(node->get_logger(), "Goal response.");
    goal_response = true;
  };

  RCLCPP_INFO(node->get_logger(), "Sending goal.");
  auto future = client->async_send_goal(Action::Goal(), send_goal_options);
  rclcpp::Rate rate(10.0);  // 10 Hz, 10% of the server's rate
  rclcpp::executors::SingleThreadedExecutor executor;
  executor.add_node(node);
  while (rclcpp::ok() && !goal_response)
  {
    RCLCPP_INFO_THROTTLE(node->get_logger(), *node->get_clock(), 10000, "Waiting for goal response.");
    executor.spin_some();
    rate.sleep();
  }

  RCLCPP_INFO_STREAM(node->get_logger(), "Goal response received: " << std::boolalpha << goal_response);
  return 0;
}

Notes

The client's rate in this example is 10% of the server's rate to make the repro foolproof for different machines/network conditions, but I've seen the bug reproduce reliably when the client is >90% of the server's rate, and even repro most of the time when they have the same rate.
No feedback callback is set on SendGoalOptions as the deadlock is not dependent on one.
The client exits once the goal is accepted since that's all that's relevant to this bug.

Expected behavior

The action client will accept the goal, perhaps after working through a short backlog of feedback up to the queue depth in the feedback subscription's QoS.

Actual behavior

The server has accepted and started working on the goal but the client never sees that. The client continuously logs Waiting for goal response. in the repro and debug logs have Received feedback for unknown goal. Ignoring.... Debug logs show Client in wait set is ready, which I presume to be the goal response, but it's never taken, so the client deadlocks.

Additional information

I did some digging and think I have tracked down the root cause, some other contributing factors, and some related impacts.

rclcpp_action::Client<T> derives from rclcpp_action::ClientBase derives from rclcpp::Waitable, and these are scheduled for work in an rclcpp::Executor as a single executable, not multiple executables of their constituent parts. When using rclcpp::Executor::spin_some, which only collects work from ready entities once per call, the rclcpp::Waitable is scheduled once, regardless of the amount of work it has ready. This should be fine, but the rclcpp::ClientBase::take_data implementation only yields a single executable per call and therefore the subsequent execute invoked by rclcpp::Executor only performs one thing on the client rather than everything that's ready.

rclcpp/rclcpp_action/src/client.cpp

Lines 550 to 687 in 5e14a28

    
           std::shared_ptr<void> 
        
           ClientBase::take_data() 
        
           { 
        
             if (pimpl_->is_feedback_ready) { 
        
               std::shared_ptr<void> feedback_message = this->create_feedback_message(); 
        
               rcl_ret_t ret = rcl_action_take_feedback( 
        
                 pimpl_->client_handle.get(), feedback_message.get()); 
        
               return std::static_pointer_cast<void>( 
        
                 std::make_shared<std::tuple<rcl_ret_t, std::shared_ptr<void>>>( 
        
                   ret, feedback_message)); 
        
             } else if (pimpl_->is_status_ready) { 
        
               std::shared_ptr<void> status_message = this->create_status_message(); 
        
               rcl_ret_t ret = rcl_action_take_status( 
        
                 pimpl_->client_handle.get(), status_message.get()); 
        
               return std::static_pointer_cast<void>( 
        
                 std::make_shared<std::tuple<rcl_ret_t, std::shared_ptr<void>>>( 
        
                   ret, status_message)); 
        
             } else if (pimpl_->is_goal_response_ready) { 
        
               rmw_request_id_t response_header; 
        
               std::shared_ptr<void> goal_response = this->create_goal_response(); 
        
               rcl_ret_t ret = rcl_action_take_goal_response( 
        
                 pimpl_->client_handle.get(), &response_header, goal_response.get()); 
        
               return std::static_pointer_cast<void>( 
        
                 std::make_shared<std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>( 
        
                   ret, response_header, goal_response)); 
        
             } else if (pimpl_->is_result_response_ready) { 
        
               rmw_request_id_t response_header; 
        
               std::shared_ptr<void> result_response = this->create_result_response(); 
        
               rcl_ret_t ret = rcl_action_take_result_response( 
        
                 pimpl_->client_handle.get(), &response_header, result_response.get()); 
        
               return std::static_pointer_cast<void>( 
        
                 std::make_shared<std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>( 
        
                   ret, response_header, result_response)); 
        
             } else if (pimpl_->is_cancel_response_ready) { 
        
               rmw_request_id_t response_header; 
        
               std::shared_ptr<void> cancel_response = this->create_cancel_response(); 
        
               rcl_ret_t ret = rcl_action_take_cancel_response( 
        
                 pimpl_->client_handle.get(), &response_header, cancel_response.get()); 
        
               return std::static_pointer_cast<void>( 
        
                 std::make_shared<std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>( 
        
                   ret, response_header, cancel_response)); 
        
             } else { 
        
               throw std::runtime_error("Taking data from action client but nothing is ready"); 
        
             } 
        
           } 
        
           std::shared_ptr<void> 
        
           ClientBase::take_data_by_entity_id(size_t id) 
        
           { 
        
             // Mark as ready the entity from which we want to take data 
        
             switch (static_cast<EntityType>(id)) { 
        
               case EntityType::GoalClient: 
        
                 pimpl_->is_goal_response_ready = true; 
        
                 break; 
        
               case EntityType::ResultClient: 
        
                 pimpl_->is_result_response_ready = true; 
        
                 break; 
        
               case EntityType::CancelClient: 
        
                 pimpl_->is_cancel_response_ready = true; 
        
                 break; 
        
               case EntityType::FeedbackSubscription: 
        
                 pimpl_->is_feedback_ready = true; 
        
                 break; 
        
               case EntityType::StatusSubscription: 
        
                 pimpl_->is_status_ready = true; 
        
                 break; 
        
             } 
        
             return take_data(); 
        
           } 
        
           void 
        
           ClientBase::execute(std::shared_ptr<void> & data) 
        
           { 
        
             if (!data) { 
        
               throw std::runtime_error("'data' is empty"); 
        
             } 
        
             if (pimpl_->is_feedback_ready) { 
        
               auto shared_ptr = std::static_pointer_cast<std::tuple<rcl_ret_t, std::shared_ptr<void>>>(data); 
        
               auto ret = std::get<0>(*shared_ptr); 
        
               pimpl_->is_feedback_ready = false; 
        
               if (RCL_RET_OK == ret) { 
        
                 auto feedback_message = std::get<1>(*shared_ptr); 
        
                 this->handle_feedback_message(feedback_message); 
        
               } else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) { 
        
                 rclcpp::exceptions::throw_from_rcl_error(ret, "error taking feedback"); 
        
               } 
        
             } else if (pimpl_->is_status_ready) { 
        
               auto shared_ptr = std::static_pointer_cast<std::tuple<rcl_ret_t, std::shared_ptr<void>>>(data); 
        
               auto ret = std::get<0>(*shared_ptr); 
        
               pimpl_->is_status_ready = false; 
        
               if (RCL_RET_OK == ret) { 
        
                 auto status_message = std::get<1>(*shared_ptr); 
        
                 this->handle_status_message(status_message); 
        
               } else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) { 
        
                 rclcpp::exceptions::throw_from_rcl_error(ret, "error taking status"); 
        
               } 
        
             } else if (pimpl_->is_goal_response_ready) { 
        
               auto shared_ptr = std::static_pointer_cast< 
        
                 std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(data); 
        
               auto ret = std::get<0>(*shared_ptr); 
        
               pimpl_->is_goal_response_ready = false; 
        
               if (RCL_RET_OK == ret) { 
        
                 auto response_header = std::get<1>(*shared_ptr); 
        
                 auto goal_response = std::get<2>(*shared_ptr); 
        
                 this->handle_goal_response(response_header, goal_response); 
        
               } else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) { 
        
                 rclcpp::exceptions::throw_from_rcl_error(ret, "error taking goal response"); 
        
               } 
        
             } else if (pimpl_->is_result_response_ready) { 
        
               auto shared_ptr = std::static_pointer_cast< 
        
                 std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(data); 
        
               auto ret = std::get<0>(*shared_ptr); 
        
               pimpl_->is_result_response_ready = false; 
        
               if (RCL_RET_OK == ret) { 
        
                 auto response_header = std::get<1>(*shared_ptr); 
        
                 auto result_response = std::get<2>(*shared_ptr); 
        
                 this->handle_result_response(response_header, result_response); 
        
               } else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) { 
        
                 rclcpp::exceptions::throw_from_rcl_error(ret, "error taking result response"); 
        
               } 
        
             } else if (pimpl_->is_cancel_response_ready) { 
        
               auto shared_ptr = std::static_pointer_cast< 
        
                 std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(data); 
        
               auto ret = std::get<0>(*shared_ptr); 
        
               pimpl_->is_cancel_response_ready = false; 
        
               if (RCL_RET_OK == ret) { 
        
                 auto response_header = std::get<1>(*shared_ptr); 
        
                 auto cancel_response = std::get<2>(*shared_ptr); 
        
                 this->handle_cancel_response(response_header, cancel_response); 
        
               } else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) { 
        
                 rclcpp::exceptions::throw_from_rcl_error(ret, "error taking cancel response"); 
        
               } 
        
             } else { 
        
               throw std::runtime_error("Executing action client but nothing is ready"); 
        
             } 
        
           }

In most scenarios, this is fine, but it deadlocks in this specific scenario: a server publishing feedback faster than the client is spinning (specifically, faster than it calls rclcpp::Executor::wait_for_work). This can happen with any of the spin* implementations but happens most readily with spin_some. Why? Because rclcpp::ClientBase::take_data only returns the first ready executable and it prioritizes feedback over responses. The implementation of rclcpp::ClientBase::take_data and execute take and execute in this order: feedback, status, goal response, result response, cancel response.

When you combine the implementation of rclcpp::Waitable that only returns execution data representing a single unit of work, the implicit prioritization mechanism built into ClientBase, and an action server whose feedback topic publishes faster than the spin rate of the client, you get a client that always has ready feedback for an unknown goal, ironically never executing the goal response that feedback is for.

Some ancillary effects of this: the goal response is never taken, the client never creates the goal handle, and therefore cancellation isn't easy. Cancellation can be done through rclcpp_action::Client<T>::async_cancel_all_goals but that may have adverse side effects depending on your scenario. Additionally, the client never becomes "result aware" for the goal since that occurs when the goal response is processed, so result responses are permanently lost if they occur during the deadlock. Feedback is also always dropped because it's for an unknown goal.

The text was updated successfully, but these errors were encountered:

mauropasse · 2024-03-15T12:35:33Z

The issue can be fixed by re-ordering ClientBase::execute and ClientBase::take_data() such as:

if (pimpl_->is_feedback_ready) { ... }
else if (pimpl_->is_status_ready) {...}

go as the last conditions. So it'd be for both take_data and execute:

if (pimpl_->is_goal_response_ready) { }
else if (pimpl_->is_result_response_ready) {}
else if (pimpl_->is_cancel_response_ready) {}
else if (pimpl_->is_feedback_ready) {}
else if (pimpl_->is_status_ready) {}

Otherwise feedback & status will always have priority over server responses.
Another alternative is using the EventsExecutor instead of the SingleThreadedExecutor, since the former always executes events in the order in which they were produced (no feedback over result priority).

asasine · 2024-03-15T13:52:16Z

I think reordering would work.

What of the fact that rclcpp::Waitable has to be taken multiple times to execute, unlike other executables like subscription/clients/etc? Is the design of Waitable intended to have implementations execute all of their work on execute(...) or only some of it?

EventsExecutor might work if event.num_events corresponds to the sum of the return values of the various get_number_of_ready_*() methods on rclcpp::Waitable. I haven't dug deep enough to know if it does yet.

rclcpp/rclcpp/src/rclcpp/experimental/executors/events_executor/events_executor.cpp

Lines 341 to 344 in 5e14a28

    
           for (size_t i = 0; i < event.num_events; i++) { 
        
             auto data = waitable->take_data_by_entity_id(event.waitable_data); 
        
             waitable->execute(data); 
        
           }

mauropasse · 2024-03-15T14:09:03Z

Is the design of Waitable intended to have implementations execute all of their work on execute(...) or only some of it?

The idea is that one single execute() of the executor, executes a single piece of work. If a rclcpp::Waitable has many events, it has to execute them separately.
When calling executor->spin_once(), we expect to execute a single event and not all from a waitable.

PD: I've already tested with the EventsExecutor and it works.
Nevertheless I think this has to be fixed for the SingleThreadedExecutor in rolling.

See ros2#2451

* Fixes for intra-process Actions * Fixes for Clang builds * Fix deadlock * Server to store results until client requests them * Fix feedback/result data race See ros2#2451 * Add missing mutex * Check return value of intra_process_action_send --------- Co-authored-by: Mauro Passerino <mpasserino@irobot.com>

// See ros2#2451 (comment)

* Fixes for intra-process Actions * Fixes for Clang builds * Fix deadlock * Server to store results until client requests them * Fix feedback/result data race See ros2#2451 * Add missing mutex * Check return value of intra_process_action_send --------- Co-authored-by: Mauro Passerino <mpasserino@irobot.com>

* Fixes for intra-process actions (#144) * Fixes for intra-process Actions * Fixes for Clang builds * Fix deadlock * Server to store results until client requests them * Fix feedback/result data race See ros2#2451 * Add missing mutex * Check return value of intra_process_action_send --------- Co-authored-by: Mauro Passerino <mpasserino@irobot.com> * Fix IPC Actions data race (#147) * Check if goal was sent through IPC before send responses * Add intra_process_action_server_is_available API to intra-process Client --------- Co-authored-by: Mauro Passerino <mpasserino@irobot.com> * Fix data race in Actions: Part 2 (#148) * Fix data race in Actions: Part 2 * Fix warning - copy elision --------- Co-authored-by: Mauro Passerino <mpasserino@irobot.com> * fix: Fixed race condition in action server between is_ready and take"… (ros2#2531) * fix: Fixed race condition in action server between is_ready and take" (ros2#2495) Some background information: is_ready, take_data and execute data may be called from different threads in any order. The code in the old state expected them to be called in series, without interruption. This lead to multiple race conditions, as the state of the pimpl objects was altered by the three functions in a non thread safe way. Co-authored-by: William Woodall <william@osrfoundation.org> Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com> * fix: added workaround for call to double calls to take_data This adds a workaround for a known bug in the executor in iron. Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com> --------- Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com> Co-authored-by: Janosch Machowinski <J.Machowinski@cellumation.com> Co-authored-by: William Woodall <william@osrfoundation.org> --------- Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com> Co-authored-by: Mauro Passerino <mpasserino@irobot.com> Co-authored-by: jmachowinski <jmachowinski@users.noreply.github.com> Co-authored-by: Janosch Machowinski <J.Machowinski@cellumation.com> Co-authored-by: William Woodall <william@osrfoundation.org>

- ros2#2451

* Always publish inter-process on TRANSIENT_LOCAL pubs (#152) * Backport upstream Action fixes - ros2#2531 * Fix actions feedback race - ros2#2451 * Fix data race in Actions: Part 3 --------- Co-authored-by: Mauro Passerino <mpasserino@irobot.com>

* Always publish inter-process on TRANSIENT_LOCAL pubs (#152) * Fix actions feedback race - ros2#2451 * Fix data race in Actions: Part 3 --------- Co-authored-by: Mauro Passerino <mpasserino@irobot.com>

- See ros2#2451 Signed-off-by: Mauro Passerino <mpasserino@irobot.com>

mauropasse pushed a commit to mauropasse/rclcpp that referenced this issue Mar 20, 2024

Fix feedback/result data race

1bacd1e

See ros2#2451

fujitatomoya mentioned this issue Apr 1, 2024

rclcpp_action: take and execute service entities in priority. #2471

Closed

fujitatomoya mentioned this issue Jul 2, 2024

Action: rclcpp_action::Client goal response and feedback order not in send order #2571

Closed

mauropasse pushed a commit to mauropasse/rclcpp that referenced this issue Aug 9, 2024

Fix feedback data race

85910c9

// See ros2#2451 (comment)

mauropasse pushed a commit to mauropasse/rclcpp that referenced this issue Aug 9, 2024

Fix feedback data race

2a36625

// See ros2#2451 (comment)

mauropasse pushed a commit to mauropasse/rclcpp that referenced this issue Aug 22, 2024

Fix actions feedback race

589885b

- ros2#2451

mauropasse mentioned this issue Aug 22, 2024

Humble backport and new fixes irobot-ros/rclcpp#154

Merged

mauropasse pushed a commit to mauropasse/rclcpp that referenced this issue Aug 23, 2024

Fix actions feedback race

9bc4a4b

- ros2#2451

mauropasse pushed a commit to mauropasse/rclcpp that referenced this issue Aug 23, 2024

Fix actions feedback race

f5c6193

- ros2#2451

mauropasse mentioned this issue Aug 23, 2024

Mauro/irobot iron fixes irobot-ros/rclcpp#155

Merged

mauropasse pushed a commit to mauropasse/rclcpp that referenced this issue Aug 27, 2024

Fix actions feedback race

33d1e2a

- See ros2#2451 Signed-off-by: Mauro Passerino <mpasserino@irobot.com>

mauropasse pushed a commit to mauropasse/rclcpp that referenced this issue Aug 27, 2024

Fix actions feedback race

eb57f71

- See ros2#2451 Signed-off-by: Mauro Passerino <mpasserino@irobot.com>

mauropasse mentioned this issue Aug 27, 2024

Fix actions feedback race #2612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`spin_some` easily deadlocks action clients #2451

`spin_some` easily deadlocks action clients #2451

asasine commented Mar 14, 2024

mauropasse commented Mar 15, 2024

asasine commented Mar 15, 2024 •

edited

Loading

mauropasse commented Mar 15, 2024 •

edited

Loading

spin_some easily deadlocks action clients #2451

spin_some easily deadlocks action clients #2451

Comments

asasine commented Mar 14, 2024

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

mauropasse commented Mar 15, 2024

asasine commented Mar 15, 2024 • edited Loading

mauropasse commented Mar 15, 2024 • edited Loading

`spin_some` easily deadlocks action clients #2451

`spin_some` easily deadlocks action clients #2451

asasine commented Mar 15, 2024 •

edited

Loading

mauropasse commented Mar 15, 2024 •

edited

Loading