Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

Don't hold mutex until release cv in cv_wait #519

Closed
wants to merge 1 commit into from

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented Jan 7, 2016

If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.

We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.

This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.

Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.

Signed-off-by: Chunwei Chen tuxoko@gmail.com

@tuxoko
Copy link
Contributor Author

tuxoko commented Jan 7, 2016

Fix openzfs/zfs#4166 and second part of openzfs/zfs#4106

@dweeezil
Copy link
Contributor

dweeezil commented Jan 7, 2016

I will try my existing tests with this.

@tuxoko
Copy link
Contributor Author

tuxoko commented Jan 7, 2016

Update: Change ASSERT to remove if and modify __cv_timedwait_hires.

@behlendorf
Copy link
Contributor

@tuxoko nice! This isn't nearly as disruptive as I'd feared. The ref counts nicely ensure the memory remains valid while cv_wait (and friends) finish up. Moving the mutex to the end of the function removes the lock inversion. We've always relied on the caller not destroying the mutex prematurely so there no new concern there. That said, this kind of thing can be subtle so we'll definitely want to stress the new code... and it sounds like @dweeezil is already on that. Awesome, thanks guys!

@dweeezil
Copy link
Contributor

dweeezil commented Jan 7, 2016

Looks good, see openzfs/zfs#4106 (comment).

@tuxoko
Copy link
Contributor Author

tuxoko commented Jan 7, 2016

The splat complains a lot in openzfs/zfs#4173
I'll need to look into it.

If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.

We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.

This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.

Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
@dweeezil
Copy link
Contributor

@tuxoko What are your thoughts on this so far? I didn't look at the buildbot errors but I can easily get an assert from condvar:broadcast1 in debug builds because mutex is not held. Is this what the bots are triggering? So far, I've not seen the assert in testing with zfs, however, it would seem that zio_wait() might be able to trigger it and also possibly other places.

@tuxoko
Copy link
Contributor Author

tuxoko commented Jan 11, 2016 via email

@tuxoko
Copy link
Contributor Author

tuxoko commented Jan 11, 2016

@dweeezil
Now I got to my desktop, I can explain more clearly.
The problem in splat code is that the waker (cv_broadcast, cv_signal) doing free (cv_destroy and mutex_destroy). This type of construct is always wrong, because you don't know when the waiter will release the mutex.

@behlendorf
Copy link
Contributor

@tuxoko @dweeezil the updated fix to include the test cases LGTM. Either of you have reservations about merging this?

@tuxoko
Copy link
Contributor Author

tuxoko commented Jan 12, 2016

@behlendorf
I'd say go for it.

@dweeezil
Copy link
Contributor

@behlendorf No problems during high stress testing. LGTM.

@behlendorf
Copy link
Contributor

Great, thanks for quick reply. Merged as:

e843553 Don't hold mutex until release cv in cv_wait

@behlendorf behlendorf closed this Jan 12, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants