-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zpool remove cache device blocks pool with no device I/O #11635
Comments
I have reproduced this problem on another server running zfs 2.0.3. It has been >10 minutes since I ran
And here is the end of the kernel debug buffer, which is not growing in size,
At this point I can find no indication of forward progress being made and I will likely have to reset this system. |
|
We where able to get the original command in this issue
|
I think I have a similar issue. Here is another data point:
Looking at just the traces it looks like some form of deadlock? |
Thanks a lot for your work! Same problem here, on different servers, with different kernel/distros and versions of ZFS. Last time - on saturday - I had it on a PowerEdge R730xd with Ubuntu 20.10 and git ZFS (commit: 891568c) No error in dmesg, everything keeps working but the ZFS pool get stuck. Funny thing:
and after
|
To add to my previous report, I also saw the same phenomenon as @Gelma where the cache device was initially installed via a |
I successfully removed several 100% used cache devices from a zpool under heavy load after upgrading to zfs 2.0.5 without any problem. Many thanks for fixing this. |
System information
Describe the problem you're observing
Removing a cache device causes the underlying pool to reproducibly lock up for at least a few minutes until Pacemaker fences off the server.
Describe how to reproduce the problem
zpool remove home1 nvme9n1p1
Include any warning/errors/backtraces from the system logs
While waiting for Pacemaker to STONITH, iostat reports no activity to any of the pool devices, and a system reset is required to restore access to the pool. After a reboot all of the original devices are still present and a second attempt at zpool remove results in the same behavior. However, I was able to run "zpool offline", and I have deliberately faulted the 4 cache devices below to prepare them from physical removal from this system.
This has also been seen by others as reported here
Note, this is a fairly large pool and the 4 devices to be removed are the same model Intel 7.6TB DC P4610 NVMe as the 2 remaining cache devices
The text was updated successfully, but these errors were encountered: