Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuse -> ROS 2 fuse_core: Fix Async #293

Conversation

methylDragon
Copy link
Collaborator

@methylDragon methylDragon commented Dec 3, 2022

See: #276

Description

There are several outstanding issues with the async model used in fuse_core:

  1. There's a chance for the destructor to get called before the internal spinner thread reaches the spin() call, causing the executor to mask the executor->cancel() call and spin forever.
  2. If a new callback is added before the internal spinner attaches the waitable, the executor only wakes up once when there are actually two waitables to execute. That is, despite two guard condition triggers for two waitables, the executor only services one waitable.
  3. For some reason, using an internal executor with 2 or more threads causes the waitables to never get picked up. I suspect this is because of (2). That is, because of (2) the waitable that adds the callback queue never gets served (the guard condition never gets added to the waitset).

I'm a little iffy on if that's what's happening with (2), but regardless, the solutions implemented in this PR fix both issues.

The solution is to keep triggering the guard condition on hooking of the callback queues until the callback queues are serviced.

I also edited the tests to be a little bit a lot stricter and now test for multiple callback additions and initializations.
The tests now feature async classes with executor thread count of 0 (which means the number of threads will equal the number of CPU cores), as troublesome a case as possible to test, to ensure my solutions are as robust as possible.

I had to do this because a lot of the issues seem to be race condition related and come up extremely rarely.

Notes

There was an alternative solution that I explored (which also worked), which was to use the timeout argument for the multi-threaded executor. This removes any chance of the multi-threaded executor deadlocking on waitables, but at the cost of free spinning at the rate of the timeout.

If we encounter deadlock or threading issues down the line, we might have to use that solution instead.

  • Sometimes the initialize still deadlocks (and we're stuck in the waiting for the initialized_ atomic to become true. I can't reproduce it though... Is there a chance for callbacks to get dropped??
  • I'm also kinda iffy about all the random sleep_for dotted around the init, but it's the only way to avoid the warning printouts most of the time

Update

EDIT: I pulled out all the stops and used all the solutions I found, all at once. I do not think I understand the problem enough to come up with an elegant solution, as for some reason all the "good practices" (like locking on threads that notify condition variables), or relying on condition variable waits instead of sleeps all result in deadlocks or weird behavior like atomic flags not getting set when the block that sets them ostensibly runs (???).

The solution I've devised combines the use of condition variables and manual guard condition triggers (just for initialize()!!), and also a lower frequency executor wakeup.

  • The manual guard condition triggers should resolve the problem where the executor misses either the first callback, or the callback that adds the callback_queue to the node's waitables
  • The lower frequency executor wakeup then resolves the issue where for some reason the initialization callback doesn't run
  • Finally,
    • I had to remove the lock acquisition on the initialization callback because for some reason it causes a deadlock
    • And I had to add a small sleep because otherwise sometimes the atomic doesn't get set even though the callback gets run (????)
    • And I had to throttle the tests a tiny bit, because again, for some reason, spinning up and shutting down multithreaded executors too rapidly (sequentially) causes something to break (???)

On the other hand, I'm able to spin up and down a massive amount of async instances sequentially now, when before it'd randomly block indefinitely, so it might just be good enough (I updated the tests to check for that case specifically.)? I have yet to see it fail, but I am not confident that it will never fail, since I never understood why it failed in the first place.

Some refinement might be helpful though, maybe I'm just not seeing something obvious.

Pinging @svwilliams for visibility.

@methylDragon methylDragon requested a review from sloretz December 3, 2022 05:00
@methylDragon methylDragon force-pushed the rolling-fuse-core-async branch from 7c4a644 to 74126eb Compare December 3, 2022 05:01
@methylDragon methylDragon force-pushed the rolling-fuse-core-async branch 4 times, most recently from 32d4aa5 to 6e6e09d Compare December 3, 2022 06:36
@methylDragon
Copy link
Collaborator Author

@ros-pull-request-builder retest this please

Signed-off-by: methylDragon <methylDragon@gmail.com>
@methylDragon methylDragon force-pushed the rolling-fuse-core-async branch from 6e6e09d to a316052 Compare December 3, 2022 06:54
@methylDragon
Copy link
Collaborator Author

@ros-pull-request-builder retest this please

@methylDragon
Copy link
Collaborator Author

methylDragon commented Dec 3, 2022

Build Status

(The only tests that should fail are the fuse_loss linters)

@@ -112,23 +112,31 @@ void CallbackAdapter::addCallback(const std::shared_ptr<CallbackWrapperBase> & c
{
std::lock_guard<std::mutex> lock(queue_mutex_);
callback_queue_.push_back(callback);

// NOTE(CH3): Do we need to keep triggering guard conditions if the queue size is > 1?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the answer is yes, and this is a good catch.

IIRC the waitable interface assumes that the waitable will become ready by something that will wake the wait set. Guard conditions wake the wait set, but they only wake once no matter how many times they're triggered.

I think a good solution would be to trigger the guard condition in take_data(). If after taking data the queue is not emtpy, then trigger the guard condition again so the executor will wake for the next callback in the queue.

@methylDragon
Copy link
Collaborator Author

Closing in favor of #294

@methylDragon methylDragon deleted the rolling-fuse-core-async branch December 7, 2022 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants