-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disk Failure causes offline pool with enabled multihost #7709
Comments
By default, multihost is configured to automatically suspend the pool if it can no long write to any of the disks. This is the only completely safe behavior since ones the writes stop the pool could be imported by the failover system. It looks like your disk failed in such a way that all of the disks were unreachable for 5 seconds. You can use the
|
I think there may be a bug or non-optimal tuning when multihost is enabled, see #7045 Setting /sys/module/zfs/parameters/zfs_multihost_interval to 2000 helped me being able to issue a 'zpool scrub zpool' on a healthy pool, instead of getting it suspended due to multihost writes failed. |
This popped up on zfs-discuss and may or may not be related: |
We've seen similar problems on occasion, and have worked around it by increasing the timeout value. However, this ticket and the recent discussion on the list make me think that the 5s timeout seems too short for most real hardware problems. Any kind of SCSI or PCI bus reset, network timeout for iSCSI, and even TLER will exceed this limit. It probably makes sense to increase the default to cover common hardware timeout retries. My understanding is that TLER is 5s, so moving up to the 7-10s range would probably avoid this. The other possibility is if/when the previously-submitted writes complete, the kernel can retry the MMP überblock scanning process to see if any have been modified by a different node before re-acquiring the device. In this case, the error will mostly be informative that there is a serious IO error but the system can recover once it is gone. |
I agree the default zfs_multihost_fail_intervals is too low. For the cases I've seen, just one more second would have been enough. I wish I had more examples, but my examples do all fit with your suggestion. My intent is to increase the default in the 0.7 stable branch once I've arrived at a value I can justify and we've tested a bit internally. For master, re-scanning the uberblocks would be good, it's just a matter of finding the time. |
When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, however, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. ZTS tests are added to verify the new functionality. In addition, it makes a few other improvements: * Set mmp_fail_intervals to 10 by default so that a brief, temporary interruption of I/O does not result in MMP suspending the pool. (issue openzfs#7709) * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * It reports nanoseconds remaining in the activity test via /proc/spl/kstat/zfs/<pool>/activity_test (during a tryimport, where the test is normally performed, the pool name is $import) Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. ZTS tests are added to verify the new functionality. In addition, it makes a few other improvements: * Set mmp_fail_intervals to 10 by default so that a brief, temporary interruption of I/O does not result in MMP suspending the pool. (issue openzfs#7709) * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * It reports nanoseconds remaining in the activity test via /proc/spl/kstat/zfs/<pool>/activity_test (during a tryimport, where the test is normally performed, the pool name is $import) Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. ZTS tests are added to verify the new functionality. In addition, it makes a few other improvements: * Set mmp_fail_intervals to 10 by default so that a brief, temporary interruption of I/O does not result in MMP suspending the pool. (issue openzfs#7709) * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * It reports nanoseconds remaining in the activity test via /proc/spl/kstat/zfs/<pool>/activity_test (during a tryimport, where the test is normally performed, the pool name is $import) Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Is there any plan to backport these MMP fixes to 0.7.x? I guess that also raises a separate question of whether there is any plan to make another 0.7.x release or not? |
To reply to my own comment:
I guess I was confused because the #8495 patch was never landed for a 0.7.x release. |
System information
Distribution Name | CentOS
Distribution Version | 7.5
Linux Kernel | 3.10
Architecture | x86_64
ZFS Version | 0.7.9
SPL Version | 0.7.9
Describe the problem you're observing
this is a 2 node setup with 1 shared sas jbod. after a disk failure, the pool went offline due to enabled multihost setting.
Include any warning/errors/backtraces from the system logs
-- snip --
[136990.068305] sd 1:0:7:0: [sdh] tag#1 Sense Key : Recovered Error [current]
[136990.100575] sd 1:0:7:0: [sdh] tag#1 Add. Sense: Write error - recovered with auto reallocation
[139351.836808] sd 1:0:7:0: [sdh] tag#14 Sense Key : Recovered Error [current]
[139351.869630] sd 1:0:7:0: [sdh] tag#14 ASC=0xc <>ASCQ=0x81
[139351.898490] sd 1:0:7:0: [sdh] tag#15 Sense Key : Recovered Error [current]
[139351.931521] sd 1:0:7:0: [sdh] tag#15 ASC=0xc <>ASCQ=0x81
[139354.339299] sd 1:0:7:0: [sdh] tag#23 Sense Key : Recovered Error [current]
[139354.372275] sd 1:0:7:0: [sdh] tag#23 Add. Sense: Peripheral device write fault
[139363.873230] sd 1:0:7:0: [sdh] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[139363.911655] sd 1:0:7:0: [sdh] tag#1 Sense Key : Aborted Command [current]
[139363.943773] sd 1:0:7:0: [sdh] tag#1 Add. Sense: Peripheral device write fault
[139363.977121] sd 1:0:7:0: [sdh] tag#1 CDB: Write(10) 2a 00 24 78 4e d0 00 00 10 00
[139364.011925] blk_update_request: I/O error, dev sdh, sector 611864272
[139374.130283] sd 1:0:7:0: [sdh] tag#37 Sense Key : Recovered Error [current]
[139374.149168] sd 1:0:7:0: [sdh] tag#51 Sense Key : Recovered Error [current]
[139374.149171] sd 1:0:7:0: [sdh] tag#51 Add. Sense: Peripheral device write fault
[139374.230303] sd 1:0:7:0: [sdh] tag#37 Add. Sense: Peripheral device write fault
[139413.127556] WARNING: MMP writes to pool 'dpool01' have not succeeded in over 5s; suspending pool
[139413.168742] WARNING: Pool 'dpool01' has encountered an uncorrectable I/O failure and has been suspended.
--snip--
The text was updated successfully, but these errors were encountered: