fix unprotected access to task_queue_.begin()->first in Scheduler #1084

Trisfald · 2017-12-06T17:09:49Z

Description

This PR fixes a concurrency issue in the Scheduler component. Passing a reference to task_queue_.begin()->first to wait_until is problematic because the timeout's reads aren't synchronized with writes happening in Scheduler::schedule.

[] It needs and includes Unit Tests

Changes in Client or Server public APIs

[] It includes documentation for these changes in /doc.

jcague · 2017-12-11T09:06:47Z

sorry @Trisfald I don't see the issue with the previous code. Why is that reference problematic? lock will be reacquire regardless of the reason, so any access to that timeout will be safe, and the reference points to the timeout itself and not to the task queue, so it shouldn't be an issue since we don't change timeout values. I might be wrong so please show me a possible theoretic path where it could happen.

Trisfald · 2017-12-12T09:21:34Z

Hello! While I ran licode with Helgrind (a tool from the valgrind family), the analyzer kept complaining about an issue about concurrent access of the timeout. So I gave a first look, made this simple change and the tool was happy.
Looking more in depth however, I think you are right, it should be safe as it is. The task_queue_ might indeed change during the wait_until call, but since you are taking a reference the timeout is always the same object. I thought the problem was with references invalidation of the container. Today I looked and multimap should not invalidate anything on insert. It is probably a Helgrind false positive!
To sum up: the only thing this pr brings is support for other containers instead of multimap (like vector or deque or adapters such as priority queue) where references are invalidated on a new insert.

Trisfald · 2017-12-12T14:11:11Z

I found maybe a wrong execution path.

Two threads are running serviceQueue.
Thread A reach wait_until and inside it releases the lock and waits. Thread B do the same.
Thread B is notifed before thread A, acquires the lock, executes task_queue_.erase(task_queue_.begin());, unlocks
Thread A wakes up, acquires the lock, still inside wait_until it uses the reference to the timeout but the element has been deleted.

Sorry if the example is a bit contrived

jcague · 2017-12-12T21:37:12Z

there's no chance that two threads would be waiting in wait_until because Scheduler uses just one thread using io_service.

Trisfald · 2017-12-13T08:44:29Z

I see, with 1 thread it's all right. Yet printing n_threads_servicing_queue in Scheduler.cpp gives me 2. Maybe kNumThreadsPerScheduler in ThreadPool.h should be set to 1?

jcague · 2017-12-13T09:50:23Z

aww, you're right! I don't know why we have set it to 2. Given the current code we should use 1 thread instead, otherwise we should some different things there, probably not only the reference to begin()->first. Do you agree on iterating this PR to make this function thread-safe if more changes are needed?

Trisfald · 2017-12-13T11:49:38Z

Yes, sure! With this change it should be already thread safe. I'll look into it a bit more with a threading analyzer to see if there are other issues

jcague

I think we would need to update current_timeout_ inside the while loop, and probably we can make it only exist inside the serviceQueue() function, and not as a member of the class, but I can be wrong.

jcague · 2017-12-15T09:22:04Z

erizo/src/erizo/thread/Scheduler.cpp

      while (!stop_requested_ && !task_queue_.empty() &&
-             new_task_scheduled_.wait_until(lock, task_queue_.begin()->first) != std::cv_status::timeout) {
+             new_task_scheduled_.wait_until(lock, current_timeout_) != std::cv_status::timeout) {


we're not updating current_timeout_ within the loop, could it be an issue? what if someone adds an earlier task in the scheduler?

yes, that might be an issue. I don't know how we can correctly update the timeout for all threads when a new task is added. Maybe re-setting in the loop and do a notify_all on insert?

jcague · 2017-12-15T09:22:52Z

erizo/src/erizo/thread/Scheduler.cpp

      while (!stop_requested_ && !task_queue_.empty() &&
-             new_task_scheduled_.wait_until(lock, task_queue_.begin()->first) != std::cv_status::timeout) {
+             new_task_scheduled_.wait_until(lock, current_timeout_) != std::cv_status::timeout) {
+        boost::thread::yield();


not sure why do we need to call yield() here?

to give up our cpu time slice, since we are going to wait again

jcague · 2017-12-15T09:24:47Z

erizo/src/erizo/thread/Scheduler.cpp

@@ -43,6 +45,10 @@ void Scheduler::serviceQueue() {
      Function f = task_queue_.begin()->second;
      task_queue_.erase(task_queue_.begin());

+      if (!task_queue_.empty()) {
+        current_timeout_ = task_queue_.begin()->first;


we wouldn't need to set current_timeout_ here, am I right?

I'm starting to think we should still set it to the current worker but also signal the change to all other workers so they can update their timeout. But perhaps I'm overthinking it?

Trisfald · 2017-12-15T10:17:22Z

I see current_timeout_ as a property connected to the queue (of which we have 1) , not as property of the workers executing serviceQueue() (of which we have n). So I think it's better to keep it near the queue in the form of a class member.

jcague · 2017-12-15T12:19:16Z

@Trisfald I run an experiment to see what happens inside Scheduler in the example you mentioned before (two threads accessing the same begin()->first reference) and I don't see any issue in there. I have a branch in my local repo: master...jcague:test_scheduler_unprotected_access I added some logs to see exactly what happens in those cases and found no issues. It seems like there's not an access to an invalid reference in any case. can you check whether it's the same execution path you thought about?
I run in with ./tests --gtest_filter=*execute_a_simple_task*

Trisfald · 2017-12-15T15:20:33Z

@jcague I ran your experiment a bunch of times (with 1 task added to the queue for simplicity) and got this output:

Added 1 task
Test starts waiting
7f82e5c51700 Waiting! 2017-12-15 15:28:00 // Step 1 - Thread 1 is waiting
7f82e5450700 Waiting! 2017-12-15 15:28:00 // Step 2 - Thread 2 is waiting
7f82e5450700 Executing! 00                // Step 3 - Thread 1 executes the task
7f82e5450700 Removed task from queue      // Step 4 - Thread 1 removes the task from the queue
7f82e5c51700 Executing! 01                // Step 5 - Thread 2 wakes up and finds the queue empty
Test stops waiting

This is the problematic execution path I was talking about. Now all works correctly (almost?) every time. In fact , I saw a segfault in Scheduler only once since I started working with Licode. (some other one might had gone under the radar).

I think the problem is between step 4 and 5. My std library implementation of wait_until has this piece of code:

	auto __s = chrono::time_point_cast<chrono::seconds>(__atime);
	auto __ns = chrono::duration_cast<chrono::nanoseconds>(__atime - __s);

	__gthread_time_t __ts =
	  {
	    static_cast<std::time_t>(__s.time_since_epoch().count()),
	    static_cast<long>(__ns.count())
	  };

	__gthread_cond_timedwait(&_M_cond, __lock.mutex()->native_handle(),
				 &__ts);              // Stuff here releases the lock during the wait and reacquires it later

	return (__clock_t::now() < __atime                         // ok __atime is a ref to begin() of the queue.
		? cv_status::no_timeout : cv_status::timeout);    // let's hope the other thread didn't get the time 
      }                                                           // to wake up and erase the element while the lock was unheld. Very rare  
                                                                  // but not impossible

However It's possible I'm totally mistaken and that crash happened for another reason all together.

jcague · 2019-03-19T12:30:36Z

This issue was finally fixed by #1345

fix unprotected access to task_queue_.begin()->first in Scheduler

b384646

jcague reviewed Dec 15, 2017

View reviewed changes

jcague closed this Mar 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix unprotected access to task_queue_.begin()->first in Scheduler #1084

fix unprotected access to task_queue_.begin()->first in Scheduler #1084

Trisfald commented Dec 6, 2017

jcague commented Dec 11, 2017

Trisfald commented Dec 12, 2017

Trisfald commented Dec 12, 2017

jcague commented Dec 12, 2017

Trisfald commented Dec 13, 2017

jcague commented Dec 13, 2017

Trisfald commented Dec 13, 2017

jcague left a comment

jcague Dec 15, 2017

Trisfald Dec 15, 2017

jcague Dec 15, 2017

Trisfald Dec 15, 2017

jcague Dec 15, 2017

Trisfald Dec 15, 2017

Trisfald commented Dec 15, 2017

jcague commented Dec 15, 2017 •

edited

Loading

Trisfald commented Dec 15, 2017

jcague commented Mar 19, 2019

fix unprotected access to task_queue_.begin()->first in Scheduler #1084

fix unprotected access to task_queue_.begin()->first in Scheduler #1084

Conversation

Trisfald commented Dec 6, 2017

jcague commented Dec 11, 2017

Trisfald commented Dec 12, 2017

Trisfald commented Dec 12, 2017

jcague commented Dec 12, 2017

Trisfald commented Dec 13, 2017

jcague commented Dec 13, 2017

Trisfald commented Dec 13, 2017

jcague left a comment

Choose a reason for hiding this comment

jcague Dec 15, 2017

Choose a reason for hiding this comment

Trisfald Dec 15, 2017

Choose a reason for hiding this comment

jcague Dec 15, 2017

Choose a reason for hiding this comment

Trisfald Dec 15, 2017

Choose a reason for hiding this comment

jcague Dec 15, 2017

Choose a reason for hiding this comment

Trisfald Dec 15, 2017

Choose a reason for hiding this comment

Trisfald commented Dec 15, 2017

jcague commented Dec 15, 2017 • edited Loading

Trisfald commented Dec 15, 2017

jcague commented Mar 19, 2019

jcague commented Dec 15, 2017 •

edited

Loading