Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ztest: zthr cancel/resume race in spa_export_common() #7744

Closed
behlendorf opened this issue Jul 25, 2018 · 3 comments
Closed

ztest: zthr cancel/resume race in spa_export_common() #7744

behlendorf opened this issue Jul 25, 2018 · 3 comments
Labels
Component: Test Suite Indicates an issue with the test framework or a test case

Comments

@behlendorf
Copy link
Contributor

System information

Type Version/Name
Distribution Name Any
Distribution Version Any
Linux Kernel Any
Architecture x86_64
ZFS Version zfs-0.7.0-1481-g3a549dc7a1
SPL Version zfs-0.7.0-1481-g3a549dc7a1

Describe the problem you're observing

When running ztest the ASSERT(!t->zthr_cancel); in zthr_resume() may be hit. This is caused by two threads concurrently executing ztest_spa_create_destroy(). The first one cancels then resumes the zthr due to the non-zero reference count caused by the other thread attempting the same operation. Stack traces below.

Describe how to reproduce the problem

Locally run zloop.sh and eventually you'll hit this case. Increasing the frequency of the ztest_spa_create_destroy() tests may make it more likely. This is most commonly observed by buildbot.

Include any warning/errors/backtraces from the system logs

Full logs:

http://build.zfsonlinux.org/artifacts/branch/master/zfs-0.7.0-1480-ge106a7b/CentOS-7-x86_64/ztest/ztest-20180725T173335.tar.xz

Relevant stacks.

Thread 552 (Thread 0x7f18da086700 (LWP 25422)):
#0  0x00007f18e5b84995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthrea
#1  0x00007f18e6d36e95 in cv_wait (cv=cv@entry=0x2b91d28, mp=mp@entry=0x2b91cf8)
#2  0x00007f18e6eabadb in zthr_cancel (t=t@entry=0x2b91cf0) at ../../module/zfs/z
#3  0x00007f18e6de1568 in spa_async_suspend (spa=spa@entry=0x29e0310) at ../../mo
#4  0x00007f18e6de6f52 in spa_export_common (pool=pool@entry=0x621980 <ztest_opts
#5  0x00007f18e6de7231 in spa_destroy (pool=pool@entry=0x621980 <ztest_opts> "zte
#6  0x000000000040e5c3 in ztest_spa_create_destroy (zd=<optimized out>, id=<optim
#7  0x0000000000409cbc in ztest_execute (test=<optimized out>, zi=0x620d00 <ztest
#8  0x000000000041101b in ztest_thread (arg=0x0) at ztest.c:6590
#9  0x00007f18e5b80e25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f18e58aabad in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f189cdfa700 (LWP 26716)):
#0  0x00007f18e5b84995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthrea
#1  0x00007f18e6d36e95 in cv_wait (cv=cv@entry=0x2c32538, mp=mp@entry=0x2c32508)
#2  0x00007f18e6eab96b in zthr_procedure (arg=0x2c32500) at ../../module/zfs/zthr
#3  0x00007f18e5b80e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f18e58aabad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f18da108700 (LWP 25423)):
#0  0x00007f18e57e2277 in raise () from /lib64/libc.so.6
#1  0x00007f18e57e3968 in abort () from /lib64/libc.so.6
#2  0x00007f18e6d35863 in libspl_assert (line=276, func=0x7f18e6f41591 <__FUNCTIO
#3  0x00007f18e6eabba0 in zthr_resume (t=t@entry=0x2b91cf0) at ../../module/zfs/z
#4  0x00007f18e6de1a98 in spa_async_resume (spa=spa@entry=0x29e0310) at ../../mod
#5  0x00007f18e6de6fd5 in spa_export_common (pool=pool@entry=0x621980 <ztest_opts
#6  0x00007f18e6de7231 in spa_destroy (pool=pool@entry=0x621980 <ztest_opts> "zte
#7  0x000000000040e5c3 in ztest_spa_create_destroy (zd=<optimized out>, id=<optim
#8  0x0000000000409cbc in ztest_execute (test=<optimized out>, zi=0x620d00 <ztest
#9  0x000000000041101b in ztest_thread (arg=0x1) at ztest.c:6590
#10 0x00007f18e5b80e25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f18e58aabad in clone () from /lib64/libc.so.6

@sdimitro when you have a chance would you please look at this. There are lots of ways to handle this but I thought you might have a preferred fix.

@behlendorf behlendorf added the Component: Test Suite Indicates an issue with the test framework or a test case label Jul 25, 2018
@sdimitro
Copy link
Contributor

Fixed by #8229

@shartse
Copy link
Contributor

shartse commented Jul 9, 2019

I seem to be hitting this, or something similar again while running zloop on master. Let me know if you need more information. These are some of the stacks I saw:

Thread 477 (Thread 0x7fb46d134700 (LWP 23132)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fb4b1802023 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55ff30f5c698) at ../nptl/pthread_mutex_lock.c:78
#2  0x00007fb4b1df3f71 in mutex_enter (mp=mp@entry=0x55ff30f5c698) at kernel.c:208
#3  0x00007fb4b1f37bf0 in zthr_resume (t=0x55ff30f5c660) at ../../module/zfs/zthr.c:402
#4  0x00007fb4b1e816e3 in spa_async_resume (spa=spa@entry=0x55ff30c53720) at ../../module/zfs/spa.c:7993
#5  0x00007fb4b1e86876 in spa_export_common (pool=<optimized out>, new_state=2, oldconfig=0x0, force=B_FALSE, hardforce=B_FALSE) at ../../module/zfs/spa.c:6233
#6  0x000055ff2ed6977d in ztest_spa_create_destroy (zd=<optimized out>, id=<optimized out>) at ztest.c:2722
#7  0x000055ff2ed64b13 in ztest_execute (test=<optimized out>, zi=0x55ff2ef7e340 <ztest_info+512>, id=17) at ztest.c:6672
#8  0x000055ff2ed6d165 in ztest_thread (arg=0x11) at ztest.c:6719
#9  0x00007fb4b17ff6db in start_thread (arg=0x7fb46d134700) at pthread_create.c:463
#10 0x00007fb4b152888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:9
Thread 464 (Thread 0x7fb46d937700 (LWP 23114)):
#0  0x00007fb4b18059f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55ff30f5c6f0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55ff30f5c668, cond=0x55ff30f5c6c8) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x55ff30f5c6c8, mutex=mutex@entry=0x55ff30f5c668) at pthread_cond_wait.c:655
#3  0x00007fb4b1df43c9 in cv_wait (cv=cv@entry=0x55ff30f5c6c8, mp=mp@entry=0x55ff30f5c668) at kernel.c:339
#4  0x00007fb4b1f37bab in zthr_cancel (t=0x55ff30f5c660) at ../../module/zfs/zthr.c:385
#5  0x00007fb4b1e81278 in spa_async_suspend (spa=spa@entry=0x55ff30c53720) at ../../module/zfs/spa.c:7967
#6  0x00007fb4b1e867ee in spa_export_common (pool=<optimized out>, new_state=2, oldconfig=0x0, force=B_FALSE, hardforce=B_FALSE) at ../../module/zfs/spa.c:6205
#7  0x000055ff2ed6977d in ztest_spa_create_destroy (zd=<optimized out>, id=<optimized out>) at ztest.c:2722
#8  0x000055ff2ed64b13 in ztest_execute (test=<optimized out>, zi=0x55ff2ef7e340 <ztest_info+512>, id=1) at ztest.c:6672
#9  0x000055ff2ed6d165 in ztest_thread (arg=0x1) at ztest.c:6719
#10 0x00007fb4b17ff6db in start_thread (arg=0x7fb46d937700) at pthread_create.c:463
#11 0x00007fb4b152888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@sdimitro
Copy link
Contributor

Finished diagnosing the issue. It is not exactly the same as it situation gets more involved with how async threads overall (not just zthrs) interact with the rest of the spa_export() code path.

For any future reference, I've issued this bug -> #9015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case
Projects
None yet
Development

No branches or pull requests

3 participants