-
Notifications
You must be signed in to change notification settings - Fork 178
Don't hold mutex until release cv in cv_wait #519
Conversation
Fix openzfs/zfs#4166 and second part of openzfs/zfs#4106 |
I will try my existing tests with this. |
Update: Change |
@tuxoko nice! This isn't nearly as disruptive as I'd feared. The ref counts nicely ensure the memory remains valid while cv_wait (and friends) finish up. Moving the mutex to the end of the function removes the lock inversion. We've always relied on the caller not destroying the mutex prematurely so there no new concern there. That said, this kind of thing can be subtle so we'll definitely want to stress the new code... and it sounds like @dweeezil is already on that. Awesome, thanks guys! |
Looks good, see openzfs/zfs#4106 (comment). |
The splat complains a lot in openzfs/zfs#4173 |
If a thread is holding mutex when doing cv_destroy, it might end up waiting a thread in cv_wait. The waiter would wake up trying to aquire the same mutex and cause deadlock. We solve this by move the mutex_enter to the bottom of cv_wait, so that the waiter will release the cv first, allowing cv_destroy to succeed and have a chance to free the mutex. This would create race condition on the cv_mutex. We use xchg to set and check it to ensure we won't be harmed by the race. This would result in the cv_mutex debugging becomes best-effort. Also, the change reveals a race, which was unlikely before, where we call mutex_destroy while test threads are still holding the mutex. We use kthread_stop to make sure the threads are exit before mutex_destroy. Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
@tuxoko What are your thoughts on this so far? I didn't look at the buildbot errors but I can easily get an assert from condvar:broadcast1 in debug builds because mutex is not held. Is this what the bots are triggering? So far, I've not seen the assert in testing with zfs, however, it would seem that |
I've already fixed that. It's a preexisting race in splat. It's just my
patch makes it almost always occur.
zio_wait is absolutely safe from this. Whether there exist such race, I'm
not 100 percent sure.
|
@dweeezil |
@behlendorf |
@behlendorf No problems during high stress testing. LGTM. |
Great, thanks for quick reply. Merged as: e843553 Don't hold mutex until release cv in cv_wait |
If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.
We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.
This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.
Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.
Signed-off-by: Chunwei Chen tuxoko@gmail.com