Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk Failure causes offline pool with enabled multihost #7709

Closed
jserviceorg opened this issue Jul 12, 2018 · 7 comments
Closed

Disk Failure causes offline pool with enabled multihost #7709

jserviceorg opened this issue Jul 12, 2018 · 7 comments
Assignees
Labels
Type: Documentation Indicates a requested change to the documentation

Comments

@jserviceorg
Copy link

System information

Distribution Name | CentOS
Distribution Version | 7.5
Linux Kernel | 3.10
Architecture | x86_64
ZFS Version | 0.7.9
SPL Version | 0.7.9

Describe the problem you're observing

this is a 2 node setup with 1 shared sas jbod. after a disk failure, the pool went offline due to enabled multihost setting.

Include any warning/errors/backtraces from the system logs

-- snip --
[136990.068305] sd 1:0:7:0: [sdh] tag#1 Sense Key : Recovered Error [current]
[136990.100575] sd 1:0:7:0: [sdh] tag#1 Add. Sense: Write error - recovered with auto reallocation
[139351.836808] sd 1:0:7:0: [sdh] tag#14 Sense Key : Recovered Error [current]
[139351.869630] sd 1:0:7:0: [sdh] tag#14 ASC=0xc <>ASCQ=0x81
[139351.898490] sd 1:0:7:0: [sdh] tag#15 Sense Key : Recovered Error [current]
[139351.931521] sd 1:0:7:0: [sdh] tag#15 ASC=0xc <>ASCQ=0x81
[139354.339299] sd 1:0:7:0: [sdh] tag#23 Sense Key : Recovered Error [current]
[139354.372275] sd 1:0:7:0: [sdh] tag#23 Add. Sense: Peripheral device write fault
[139363.873230] sd 1:0:7:0: [sdh] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[139363.911655] sd 1:0:7:0: [sdh] tag#1 Sense Key : Aborted Command [current]
[139363.943773] sd 1:0:7:0: [sdh] tag#1 Add. Sense: Peripheral device write fault
[139363.977121] sd 1:0:7:0: [sdh] tag#1 CDB: Write(10) 2a 00 24 78 4e d0 00 00 10 00
[139364.011925] blk_update_request: I/O error, dev sdh, sector 611864272
[139374.130283] sd 1:0:7:0: [sdh] tag#37 Sense Key : Recovered Error [current]
[139374.149168] sd 1:0:7:0: [sdh] tag#51 Sense Key : Recovered Error [current]
[139374.149171] sd 1:0:7:0: [sdh] tag#51 Add. Sense: Peripheral device write fault
[139374.230303] sd 1:0:7:0: [sdh] tag#37 Add. Sense: Peripheral device write fault

[139413.127556] WARNING: MMP writes to pool 'dpool01' have not succeeded in over 5s; suspending pool
[139413.168742] WARNING: Pool 'dpool01' has encountered an uncorrectable I/O failure and has been suspended.
--snip--

@behlendorf
Copy link
Contributor

By default, multihost is configured to automatically suspend the pool if it can no long write to any of the disks. This is the only completely safe behavior since ones the writes stop the pool could be imported by the failover system. It looks like your disk failed in such a way that all of the disks were unreachable for 5 seconds.

You can use the zfs_multihost_fail_intervals module parameter to control the automatic suspend behavior. But you'll need to be absolutely sure any failover software your running won't attempt to import the pool on the other system.

       zfs_multihost_fail_intervals (uint)
                   Controls  the  behavior  of  the  pool when multihost write
                   failures are detected.

                   When zfs_multihost_fail_intervals = 0 then multihost  write
                   failures  are ignored.  The failures will still be reported
                   to the ZED which depending on its  configuration  may  take
                   action such as suspending the pool or offlining a device.

                   When  zfs_multihost_fail_intervals > 0 then sequential mul‐
                   tihost write failures will cause the pool to be  suspended.
                   This  occurs when zfs_multihost_fail_intervals * zfs_multi‐
                   host_interval milliseconds have passed since the last  suc‐
                   cessful multihost write.  This guarantees the activity test
                   will see multihost writes if the pool is imported.

                   Default value: 5.

@behlendorf behlendorf added the Type: Documentation Indicates a requested change to the documentation label Jul 12, 2018
@gerardba
Copy link

I think there may be a bug or non-optimal tuning when multihost is enabled, see #7045

Setting /sys/module/zfs/parameters/zfs_multihost_interval to 2000 helped me being able to issue a 'zpool scrub zpool' on a healthy pool, instead of getting it suspended due to multihost writes failed.

@tonyhutter
Copy link
Contributor

This popped up on zfs-discuss and may or may not be related:
http://list.zfsonlinux.org/pipermail/zfs-discuss/2019-January/033095.html

@adilger
Copy link
Contributor

adilger commented Jan 11, 2019

We've seen similar problems on occasion, and have worked around it by increasing the timeout value. However, this ticket and the recent discussion on the list make me think that the 5s timeout seems too short for most real hardware problems. Any kind of SCSI or PCI bus reset, network timeout for iSCSI, and even TLER will exceed this limit.

It probably makes sense to increase the default to cover common hardware timeout retries. My understanding is that TLER is 5s, so moving up to the 7-10s range would probably avoid this.

The other possibility is if/when the previously-submitted writes complete, the kernel can retry the MMP überblock scanning process to see if any have been modified by a different node before re-acquiring the device. In this case, the error will mostly be informative that there is a serious IO error but the system can recover once it is gone.

@ofaaland
Copy link
Contributor

I agree the default zfs_multihost_fail_intervals is too low. For the cases I've seen, just one more second would have been enough. I wish I had more examples, but my examples do all fit with your suggestion. My intent is to increase the default in the 0.7 stable branch once I've arrived at a value I can justify and we've tested a bit internally.

For master, re-scanning the uberblocks would be good, it's just a matter of finding the time.

ofaaland added a commit to ofaaland/zfs that referenced this issue Jan 29, 2019
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, however, is not enough
information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

  zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

ZTS tests are added to verify the new functionality.

In addition, it makes a few other improvements:
* Set mmp_fail_intervals to 10 by default so that a brief, temporary
  interruption of I/O does not result in MMP suspending the pool.
  (issue openzfs#7709)
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* It reports nanoseconds remaining in the activity test via
  /proc/spl/kstat/zfs/<pool>/activity_test (during a tryimport,
  where the test is normally performed, the pool name is $import)

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
ofaaland added a commit to ofaaland/zfs that referenced this issue Feb 7, 2019
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

  zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

ZTS tests are added to verify the new functionality.

In addition, it makes a few other improvements:
* Set mmp_fail_intervals to 10 by default so that a brief, temporary
  interruption of I/O does not result in MMP suspending the pool.
  (issue openzfs#7709)
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* It reports nanoseconds remaining in the activity test via
  /proc/spl/kstat/zfs/<pool>/activity_test (during a tryimport,
  where the test is normally performed, the pool name is $import)

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
ofaaland added a commit to ofaaland/zfs that referenced this issue Feb 27, 2019
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

  zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

ZTS tests are added to verify the new functionality.

In addition, it makes a few other improvements:
* Set mmp_fail_intervals to 10 by default so that a brief, temporary
  interruption of I/O does not result in MMP suspending the pool.
  (issue openzfs#7709)
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* It reports nanoseconds remaining in the activity test via
  /proc/spl/kstat/zfs/<pool>/activity_test (during a tryimport,
  where the test is normally performed, the pool name is $import)

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
@adilger
Copy link
Contributor

adilger commented Oct 15, 2019

Is there any plan to backport these MMP fixes to 0.7.x? I guess that also raises a separate question of whether there is any plan to make another 0.7.x release or not?

@adilger
Copy link
Contributor

adilger commented Oct 15, 2019

To reply to my own comment:

I guess I was confused because the #8495 patch was never landed for a 0.7.x release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Documentation Indicates a requested change to the documentation
Projects
None yet
Development

No branches or pull requests

6 participants