fuse -> ROS 2 fuse_core: Fix Async #293

methylDragon · 2022-12-03T04:52:07Z

Description

There are several outstanding issues with the async model used in fuse_core:

There's a chance for the destructor to get called before the internal spinner thread reaches the spin() call, causing the executor to mask the executor->cancel() call and spin forever.
If a new callback is added before the internal spinner attaches the waitable, the executor only wakes up once when there are actually two waitables to execute. That is, despite two guard condition triggers for two waitables, the executor only services one waitable.
For some reason, using an internal executor with 2 or more threads causes the waitables to never get picked up. I suspect this is because of (2). That is, because of (2) the waitable that adds the callback queue never gets served (the guard condition never gets added to the waitset).

I'm a little iffy on if that's what's happening with (2), but regardless, the solutions implemented in this PR fix both issues.

The solution is to keep triggering the guard condition on hooking of the callback queues until the callback queues are serviced.

I also edited the tests to be a ~~little bit~~ a lot stricter and now test for multiple callback additions and initializations.
The tests now feature async classes with executor thread count of 0 (which means the number of threads will equal the number of CPU cores), as troublesome a case as possible to test, to ensure my solutions are as robust as possible.

I had to do this because a lot of the issues seem to be race condition related and come up extremely rarely.

Notes

There was an alternative solution that I explored (which also worked), which was to use the timeout argument for the multi-threaded executor. This removes any chance of the multi-threaded executor deadlocking on waitables, but at the cost of free spinning at the rate of the timeout.

If we encounter deadlock or threading issues down the line, we might have to use that solution instead.

Sometimes the initialize still deadlocks (and we're stuck in the waiting for the initialized_ atomic to become true. I can't reproduce it though... Is there a chance for callbacks to get dropped??
I'm also kinda iffy about all the random sleep_for dotted around the init, but it's the only way to avoid the warning printouts most of the time

Update

EDIT: I pulled out all the stops and used all the solutions I found, all at once. I do not think I understand the problem enough to come up with an elegant solution, as for some reason all the "good practices" (like locking on threads that notify condition variables), or relying on condition variable waits instead of sleeps all result in deadlocks or weird behavior like atomic flags not getting set when the block that sets them ostensibly runs (???).

The solution I've devised combines the use of condition variables and manual guard condition triggers (just for initialize()!!), and also a lower frequency executor wakeup.

The manual guard condition triggers should resolve the problem where the executor misses either the first callback, or the callback that adds the callback_queue to the node's waitables
The lower frequency executor wakeup then resolves the issue where for some reason the initialization callback doesn't run
Finally,
- I had to remove the lock acquisition on the initialization callback because for some reason it causes a deadlock
- And I had to add a small sleep because otherwise sometimes the atomic doesn't get set even though the callback gets run (????)
- And I had to throttle the tests a tiny bit, because again, for some reason, spinning up and shutting down multithreaded executors too rapidly (sequentially) causes something to break (???)

On the other hand, I'm able to spin up and down a massive amount of async instances sequentially now, when before it'd randomly block indefinitely, so it might just be good enough (I updated the tests to check for that case specifically.)? I have yet to see it fail, but I am not confident that it will never fail, since I never understood why it failed in the first place.

Some refinement might be helpful though, maybe I'm just not seeing something obvious.

Pinging @svwilliams for visibility.

methylDragon · 2022-12-03T06:36:51Z

@ros-pull-request-builder retest this please

Signed-off-by: methylDragon <methylDragon@gmail.com>

methylDragon · 2022-12-03T06:54:31Z

@ros-pull-request-builder retest this please

methylDragon · 2022-12-03T06:55:20Z

(The only tests that should fail are the fuse_loss linters)

sloretz · 2022-12-05T22:25:36Z

fuse_core/src/callback_wrapper.cpp

@@ -112,23 +112,31 @@ void CallbackAdapter::addCallback(const std::shared_ptr<CallbackWrapperBase> & c
 {
  std::lock_guard<std::mutex> lock(queue_mutex_);
  callback_queue_.push_back(callback);
+
+  // NOTE(CH3): Do we need to keep triggering guard conditions if the queue size is > 1?


I'm pretty sure the answer is yes, and this is a good catch.

IIRC the waitable interface assumes that the waitable will become ready by something that will wake the wait set. Guard conditions wake the wait set, but they only wake once no matter how many times they're triggered.

I think a good solution would be to trigger the guard condition in take_data(). If after taking data the queue is not emtpy, then trigger the guard condition again so the executor will wake for the next callback in the queue.

methylDragon · 2022-12-06T21:05:28Z

Closing in favor of #294

methylDragon requested a review from sloretz December 3, 2022 05:00

methylDragon mentioned this pull request Dec 3, 2022

fuse -> ROS 2 fuse_core : Messages and Services #285

Merged

methylDragon force-pushed the rolling-fuse-core-async branch from 7c4a644 to 74126eb Compare December 3, 2022 05:01

methylDragon mentioned this pull request Dec 3, 2022

Fuse -> ROS 2 Rolling Porting Progress #276

Closed

54 tasks

methylDragon force-pushed the rolling-fuse-core-async branch 4 times, most recently from 32d4aa5 to 6e6e09d Compare December 3, 2022 06:36

Fix async

a316052

Signed-off-by: methylDragon <methylDragon@gmail.com>

methylDragon force-pushed the rolling-fuse-core-async branch from 6e6e09d to a316052 Compare December 3, 2022 06:54

sloretz reviewed Dec 5, 2022

View reviewed changes

sloretz mentioned this pull request Dec 5, 2022

fuse -> ROS 2 fuse_core: Fix Async (Redux) #294

Merged

methylDragon closed this Dec 6, 2022

methylDragon deleted the rolling-fuse-core-async branch December 7, 2022 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fuse -> ROS 2 fuse_core: Fix Async #293

fuse -> ROS 2 fuse_core: Fix Async #293

methylDragon commented Dec 3, 2022 •

edited

Loading

methylDragon commented Dec 3, 2022

methylDragon commented Dec 3, 2022

methylDragon commented Dec 3, 2022 •

edited

Loading

sloretz Dec 5, 2022

methylDragon commented Dec 6, 2022

fuse -> ROS 2 fuse_core: Fix Async #293

fuse -> ROS 2 fuse_core: Fix Async #293

Conversation

methylDragon commented Dec 3, 2022 • edited Loading

Description

Notes

Update

methylDragon commented Dec 3, 2022

methylDragon commented Dec 3, 2022

methylDragon commented Dec 3, 2022 • edited Loading

sloretz Dec 5, 2022

Choose a reason for hiding this comment

methylDragon commented Dec 6, 2022

methylDragon commented Dec 3, 2022 •

edited

Loading

methylDragon commented Dec 3, 2022 •

edited

Loading