-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System hangs (bad RIP value) when disk used in pool is removed (zfs-0.6.5.1) #3821
Comments
@ab-oe f you're able to reproduce this could you try setting the module option |
@behlendorf unfortunatelly setting spl_taskq_thread_dynamic to 0 doesn't resolve this issue. System hung just like before. |
I performed bisection and see that issue with missing I/O suspend was introduced in b39c22b the z_null_int takes 100% of CPU. System is hardly responsive but it still works. I got following call trace:
|
@ab-oe thanks for posting the debugging. This definitely looks like a duplicate of #3652, and it's clear that the z_null_int thread is getting blocked spinning on the taskq spin lock. Thanks for bisecting the change, that's helpful. Have you tried reverting b39c22b and setting spl_taskq_thread_dynamic to 0. Does it resolve the issue? |
@behlendorf yes it works with reverted b39c22b it works even if spl_taskq_thread_dynamic is set to 1. I haven't found the commit that causes immediate system hang when disk is removed yet. |
Fix proposed in #3833. |
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio based on the ZIO_PRIORITY_* flag passed in. This had the unnoticed side-effect of making the vdev_disk_io_start() synchronous for certain I/Os. This in turn resulted in vdev_disk_io_start() being able to re-dispatch zio's which would result in a RCU stalls when a disk was removed from the system. Additionally, this could negatively impact performance and explains the performance regressions reported in both #3829 and #3780. This patch resolves the issue by making the blocking behavior dependent on a 'wait' flag being passed rather than overloading the passed bio flags. Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to non-rotational devices where there is no benefit to queuing to aggregate the I/O. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3652 Issue #3780 Issue #3785 Issue #3817 Issue #3821 Issue #3829 Issue #3832 Issue #3870
Resolved by 5592404 which will be cherry-picked in to 0.6.5.2 release. |
@behlendorf thank you. The 5592404 fixes this issue. |
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio based on the ZIO_PRIORITY_* flag passed in. This had the unnoticed side-effect of making the vdev_disk_io_start() synchronous for certain I/Os. This in turn resulted in vdev_disk_io_start() being able to re-dispatch zio's which would result in a RCU stalls when a disk was removed from the system. Additionally, this could negatively impact performance and explains the performance regressions reported in both #3829 and #3780. This patch resolves the issue by making the blocking behavior dependent on a 'wait' flag being passed rather than overloading the passed bio flags. Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to non-rotational devices where there is no benefit to queuing to aggregate the I/O. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3652 Issue #3780 Issue #3785 Issue #3817 Issue #3821 Issue #3829 Issue #3832 Issue #3870
Hello,
![call_trace](https://cloud.githubusercontent.com/assets/7203173/10047534/f71f532a-620e-11e5-8b30-2708e6cf6942.jpg)
I created new zpool with one disk and started copying files. Then I unplugged this disk and system hung instead of suspending I/O. After a few tests I was able to capture the call trace:
On version 0.6.4 everything works well. There is no possibility to get I/O suspended with the latest ZoL because system always hangs.
The text was updated successfully, but these errors were encountered: