-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to import pool, mirror pool gets corrupted #10942
Comments
Is it always the same vdev (e.g. mirror-1) that reports corruption when this happens? Is your setup that each of the two servers has one local and one remote disk in each vdev, so when you forcibly power cycle the one hosting the pool, the "remote" disks all become unavailable? In general (note that this is my vague understanding and may be wrong) I would expect writes to be considered "succeeded" when they're written to all disks in a mirror vdev, so getting corruption on only one disk of a pair surprises me somewhat, depending on how "forced" the shutdown was, and I'd further expect any sort of mangling to be limited to "oops we're rolling back to the last stable transaction on disk", not "ooh this disk is mangled I can't make progress." It looks like the error you're getting, 52/EBADE, is used internally to mean checksum error, and the log says it tried rolling back several transaction groups and still got checksum errors trying to load the metaslab list from that disk. |
Differently, sometimes it is mirror-1, sometimes it is mirror-3 - randomly
Yes
Yes scrub repairs some errors on these disks.
Yes, each vdev mirrror is created with one local and one remote disk. I think that quite important information is that on ZFS 0.7 with kernel 4.4 I cannot reach to such corruption :( |
I have tested release ZFS 0.7.13 and seems that on this version corruption does not happen. First ZFS version where corruption occur is ZFS 0.8-rc1. |
@arturpzol I suspect this may be your issue. When the device doesn't support a write cache ZFS won't issue cache flush commands since it was told there's no cache to flush.
However, it sounds like some iSCSI target may misreport this. ZFS will only disable cache flushes when it issues a flushto the device and it returns a not supported error. I'm not sure exactly what LIO does, but it's possible with the 4.4 kernel it accepts the command even though it claims not to support it. I'd suggest testing zfs-0.8.4 using the 4.4 kernel if you haven't already. If you're comfortable building from source to can apply the following patch to log a console message when ZFS determines that device doesn't support cache flushing and disables them. diff --git a/module/os/linux/zfs/vdev_disk.c b/module/os/linux/zfs/vdev_disk.c
index e6e7df3..f99baa2 100644
--- a/module/os/linux/zfs/vdev_disk.c
+++ b/module/os/linux/zfs/vdev_disk.c
@@ -621,8 +621,10 @@ BIO_END_IO_PROTO(vdev_disk_io_flush_completion, bio, error)
zio->io_error = -error;
#endif
- if (zio->io_error && (zio->io_error == EOPNOTSUPP))
+ if (zio->io_error && (zio->io_error == EOPNOTSUPP)) {
+ printk(KERN_WARNING "ZFS: Disabling cache flushes");
zio->io_vd->vdev_nowritecache = B_TRUE;
+ }
bio_put(bio);
ASSERT3S(zio->io_error, >=, 0); |
@behlendorf I have tested:
so I think the kernel version doesn't matter. Additionally on all my environments cache flushes is enabled. I allowed myself to modify yours patch:
and kernel logs shows:
also tested ZFS 0.8 with vol_request_sync=1 but without change. Is there anything else I can check? |
Yes it does seem that way. Though it's a little surprising this this code hasn't change in a long time. The next thing I'd suggest checking is that the cache flushes themselves are being issued successfully. This is done by the error = vdev_disk_io_flush(vd->vd_bdev);
+ zfs_dbgmsg("vdev_disk_io_flush(%s) = %d", vd->vd_path, error);
if (error == 0) { The internal log can be read by dumping the |
@behlendorf with debug I can see that flush is executed each time without error:
I have noticed one important point. After hard shutdown and when node is back, automatic resilver is performed (because remote iSCSI disks are back) so if after finished resilver I manually run the scrub corruption does not occur. In order to bring the corruption I need to shutdown the nodes one after the other so exemplary full scenario is:
so now if I power on the node A and disks back, pool can be successfully imported. @behlendorf as you wrote the code for flushing hasn't change in a long time, so maybe resilver (rewritten in ZFS 0.8 rc-1) is not full and some data losts, |
@behlendorf I have repeated the issue with shared storage with two SAS JBODs and two nodes in the cluster. Mirrored vdevs were created with disks from both SAS JBODs. In order to lead to the corruption one JBOD and node with active pool has been powered off and so on. I wanted to eliminate iSCSI initiator or target layer (remote disks), so the issue can be also repeated on normal hardware connection. |
@behlendorf I did Issue can be simulated with following script:
The test takes a few loops (e.g. with ZFS 0.8.3) to gain to the corruption with one of bellow errors:
other case:
of course if we plug back removed devices corruption disappears:
but in case on real hardware when disks are damaged, plug back the disks is not possible. Similar bugs are reported in #10161 and #10910 I tried to disable features: @behlendorf, @ahrens do you see any change of this commit that could affect to the such corruption (e.g. resilver, synchronization or something else) ? |
@behlendorf, @ahrens is there a chance to have a look into that issue by you? I tried to partially revert the commit a1d477c but it has a lot of dependencies and a revert is not possible in easy way. When PR #6900 was in progress some comments about corruption has been posted and new PR openzfs/openzfs#561 has been proposed. I tried it but unfortunately with that PR corruption also occurs. |
@arturpzol thanks for bisecting this to narrow it down. I'll try to find some time to investigate. |
I tried to initialize the pool using
@behlendorf did you have a chance to look into this issue? |
I have made some research and if
Maybe some part of spacemap is saved with wrong shifting or something silently overwrite some part of spacemap. As wrote above this issue was introduced in a1d477c. @behlendorf are you able to look into it or suggest what can be checked next? |
I reproduced this issue with @arturpzol script. Seems that code related to mutex removal from the RT introduced this issue in a1d477c. Here's the code that fixes this issue. For the test purposes I dropped locking the vd->vdev_dtl_lock mutex to avoid deadlock.
|
@arturpzol @arko-pl thank you for doing the hard work of identifying the offending commit and isolating the problematic code! Based on your findings and test case I've opened PR #11218 with a fix similar to the one above (but with locking). It's held up well in my local testing using your test case, however there's nothing quite like the real thing. If it wouldn't be too much to ask would you mind verifying the fix in your test environment. |
Great find, thanks! @behlendorf how do the missing DTL entries cause the pool to not be importable? Without the DTL entries that are encountered as a result of the zil claim's, shouldn't we try both sides of the mirror? Unless there's some block that is not checksummed and we are blindly trusting what we get from disk? |
@ahrens the heart of the problem is that the missing DTL entries cause any resilvering which happens before So the issue is really more the use of |
Interesting. It would be good to preserve that reasoning somewhere - in a comment or the commit message. Maybe we shouldn't be doing resilvering while loading (before SPA_LOAD_NONE), which would be another way of addressing the problem. That should speed up the import as well. We can also make the changes proposed here if you like. |
@behlendorf very thank you for the fix, looks promising, so far the tests are stable without problem with pool importing for several hours. I will let you know about the full results after the tests on a few different environments. |
@behlendorf after a few days of tests on a few different environments fix seems to be stable. I think that bug can be closed. Thank you. |
System information
Describe the problem you're observing
I have mirrored pool which is created with local and remote disks. Remote disks are connected via iSCSI (Open-iSCSI as initiator and LIO as target). Nodes are configured in cluster. When node which has active pool be force shutdowned sometimes pool cannot be imported on second node:
When the node which is shut-downed will be power on and pool is imported with all disks all works correctly (pool can be imported).
I have tested a few kernels 4.19, 5.7 and ZFS version 0.8.3, 2.0.0-rc1_71_g51de0fc, real hardware and virtual machines but in all cases corruption can be repeated. On ZFS 0.7 with kernel 4.4 I cannot repeat this issue.
One of the ZFS parameter may have an impact on this but is set to 0:
Write cache (WCE bit) on all disks is disabled:
It looks like mirror vdev ale not fully synchronized - is it possible?
I have 100% scenario so can debug its deeper - if possible please sugest any advices.
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: