Disk Failure causes offline pool with enabled multihost #7709

jserviceorg · 2018-07-12T11:25:36Z

System information

Describe the problem you're observing

this is a 2 node setup with 1 shared sas jbod. after a disk failure, the pool went offline due to enabled multihost setting.

Include any warning/errors/backtraces from the system logs

-- snip --
[136990.068305] sd 1:0:7:0: [sdh] tag#1 Sense Key : Recovered Error [current]
[136990.100575] sd 1:0:7:0: [sdh] tag#1 Add. Sense: Write error - recovered with auto reallocation
[139351.836808] sd 1:0:7:0: [sdh] tag#14 Sense Key : Recovered Error [current]
[139351.869630] sd 1:0:7:0: [sdh] tag#14 ASC=0xc <>ASCQ=0x81
[139351.898490] sd 1:0:7:0: [sdh] tag#15 Sense Key : Recovered Error [current]
[139351.931521] sd 1:0:7:0: [sdh] tag#15 ASC=0xc <>ASCQ=0x81
[139354.339299] sd 1:0:7:0: [sdh] tag#23 Sense Key : Recovered Error [current]
[139354.372275] sd 1:0:7:0: [sdh] tag#23 Add. Sense: Peripheral device write fault
[139363.873230] sd 1:0:7:0: [sdh] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[139363.911655] sd 1:0:7:0: [sdh] tag#1 Sense Key : Aborted Command [current]
[139363.943773] sd 1:0:7:0: [sdh] tag#1 Add. Sense: Peripheral device write fault
[139363.977121] sd 1:0:7:0: [sdh] tag#1 CDB: Write(10) 2a 00 24 78 4e d0 00 00 10 00
[139364.011925] blk_update_request: I/O error, dev sdh, sector 611864272
[139374.130283] sd 1:0:7:0: [sdh] tag#37 Sense Key : Recovered Error [current]
[139374.149168] sd 1:0:7:0: [sdh] tag#51 Sense Key : Recovered Error [current]
[139374.149171] sd 1:0:7:0: [sdh] tag#51 Add. Sense: Peripheral device write fault
[139374.230303] sd 1:0:7:0: [sdh] tag#37 Add. Sense: Peripheral device write fault

[139413.127556] WARNING: MMP writes to pool 'dpool01' have not succeeded in over 5s; suspending pool
[139413.168742] WARNING: Pool 'dpool01' has encountered an uncorrectable I/O failure and has been suspended.
--snip--

behlendorf · 2018-07-12T18:09:41Z

By default, multihost is configured to automatically suspend the pool if it can no long write to any of the disks. This is the only completely safe behavior since ones the writes stop the pool could be imported by the failover system. It looks like your disk failed in such a way that all of the disks were unreachable for 5 seconds.

You can use the zfs_multihost_fail_intervals module parameter to control the automatic suspend behavior. But you'll need to be absolutely sure any failover software your running won't attempt to import the pool on the other system.

       zfs_multihost_fail_intervals (uint)
                   Controls  the  behavior  of  the  pool when multihost write
                   failures are detected.

                   When zfs_multihost_fail_intervals = 0 then multihost  write
                   failures  are ignored.  The failures will still be reported
                   to the ZED which depending on its  configuration  may  take
                   action such as suspending the pool or offlining a device.

                   When  zfs_multihost_fail_intervals > 0 then sequential mul‐
                   tihost write failures will cause the pool to be  suspended.
                   This  occurs when zfs_multihost_fail_intervals * zfs_multi‐
                   host_interval milliseconds have passed since the last  suc‐
                   cessful multihost write.  This guarantees the activity test
                   will see multihost writes if the pool is imported.

                   Default value: 5.

gerardba · 2018-08-22T15:04:46Z

I think there may be a bug or non-optimal tuning when multihost is enabled, see #7045

Setting /sys/module/zfs/parameters/zfs_multihost_interval to 2000 helped me being able to issue a 'zpool scrub zpool' on a healthy pool, instead of getting it suspended due to multihost writes failed.

tonyhutter · 2019-01-11T01:24:07Z

This popped up on zfs-discuss and may or may not be related:
http://list.zfsonlinux.org/pipermail/zfs-discuss/2019-January/033095.html

adilger · 2019-01-11T03:18:07Z

We've seen similar problems on occasion, and have worked around it by increasing the timeout value. However, this ticket and the recent discussion on the list make me think that the 5s timeout seems too short for most real hardware problems. Any kind of SCSI or PCI bus reset, network timeout for iSCSI, and even TLER will exceed this limit.

It probably makes sense to increase the default to cover common hardware timeout retries. My understanding is that TLER is 5s, so moving up to the 7-10s range would probably avoid this.

The other possibility is if/when the previously-submitted writes complete, the kernel can retry the MMP überblock scanning process to see if any have been modified by a different node before re-acquiring the device. In this case, the error will mostly be informative that there is a serious IO error but the system can recover once it is gone.

ofaaland · 2019-01-14T21:44:13Z

I agree the default zfs_multihost_fail_intervals is too low. For the cases I've seen, just one more second would have been enough. I wish I had more examples, but my examples do all fit with your suggestion. My intent is to increase the default in the 0.7 stable branch once I've arrived at a value I can justify and we've tested a bit internally.

For master, re-scanning the uberblocks would be good, it's just a matter of finding the time.

When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, however, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. ZTS tests are added to verify the new functionality. In addition, it makes a few other improvements: * Set mmp_fail_intervals to 10 by default so that a brief, temporary interruption of I/O does not result in MMP suspending the pool. (issue openzfs#7709) * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * It reports nanoseconds remaining in the activity test via /proc/spl/kstat/zfs/<pool>/activity_test (during a tryimport, where the test is normally performed, the pool name is $import) Signed-off-by: Olaf Faaland <faaland1@llnl.gov>

When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. ZTS tests are added to verify the new functionality. In addition, it makes a few other improvements: * Set mmp_fail_intervals to 10 by default so that a brief, temporary interruption of I/O does not result in MMP suspending the pool. (issue openzfs#7709) * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * It reports nanoseconds remaining in the activity test via /proc/spl/kstat/zfs/<pool>/activity_test (during a tryimport, where the test is normally performed, the pool name is $import) Signed-off-by: Olaf Faaland <faaland1@llnl.gov>

adilger · 2019-10-15T07:51:36Z

Is there any plan to backport these MMP fixes to 0.7.x? I guess that also raises a separate question of whether there is any plan to make another 0.7.x release or not?

adilger · 2019-10-15T08:22:04Z

To reply to my own comment:

the question of a new 0.7.x is discussed in New 0.7.x release? #9454 so may as well keep the discussion there.
I think I'm actually looking for Increase default zfs_multihost_fail_intervals and import_intervals #8495 for 0.7.x as I previously commented there.

I guess I was confused because the #8495 patch was never landed for a 0.7.x release.

behlendorf added the Type: Documentation Indicates a requested change to the documentation label Jul 12, 2018

behlendorf assigned ofaaland Aug 22, 2018

ofaaland mentioned this issue Mar 13, 2019

Increase default zfs_multihost_fail_intervals and import_intervals #8495

Merged

12 tasks

behlendorf closed this as completed in db2af93 Mar 13, 2019

adilger mentioned this issue Oct 15, 2019

Single HDD offline cause zpool suspend at ZOL 0.7.9 #8981

Closed

jlbl mentioned this issue Jul 21, 2021

Disk failure with multihost enabled leads to suspended pool #12413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk Failure causes offline pool with enabled multihost #7709

Disk Failure causes offline pool with enabled multihost #7709

jserviceorg commented Jul 12, 2018

behlendorf commented Jul 12, 2018

gerardba commented Aug 22, 2018

tonyhutter commented Jan 11, 2019

adilger commented Jan 11, 2019

ofaaland commented Jan 14, 2019

adilger commented Oct 15, 2019

adilger commented Oct 15, 2019

Disk Failure causes offline pool with enabled multihost #7709

Disk Failure causes offline pool with enabled multihost #7709

Comments

jserviceorg commented Jul 12, 2018

System information

Describe the problem you're observing

Include any warning/errors/backtraces from the system logs

behlendorf commented Jul 12, 2018

gerardba commented Aug 22, 2018

tonyhutter commented Jan 11, 2019

adilger commented Jan 11, 2019

ofaaland commented Jan 14, 2019

adilger commented Oct 15, 2019

adilger commented Oct 15, 2019