Endless resilvering after hitting unrecoverable data error #2867

jwittlincohen · 2014-11-05T01:06:26Z

Summary of issue: A resilver operation that completes successfully but results in one or more unrecoverable data errors causes ZFS to begin a new resilver (from 0%) every time the pool is imported. I was able to resolve this situation by deleting the corrupted file and referenced snapshots, running zpool clear, and then initiating a successful scrub.

Detailed description: At the time I began the resilver, all disks in the array had passed an extended SMART test and were functioning fine. I initiated a resilver to replace one of the disks in the array (I initially planned to use a 12 drive RAID-Z2, but discovered that this was an inefficient configuration with advanced format disks - 10% overhead compared to 4.5% with 13 disks. I therefore used a temporary drive until I could obtain a 13th drive matching the same make/model as the others). Approximately 80% into the resilver, I was alerted to an unrecoverable data error. The resilver completed successfully but reported 2 data errors impacting a single video file and snapshot that referenced it.

The resilver completed with the following output:

A ZFS pool has finished resilvering:

eid: 284
host: storage-server
time: 2014-10-31 15:44:34-0400
pool: data
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 1.74T in 19h13m with 2 errors on Fri Oct 31 15:44:34 2014
config:

    NAME                                           STATE     READ WRITE CKSUM
    data                                           DEGRADED     0     0    10
      raidz2-0                                     DEGRADED     0     0    20
        ata-HGST_HDN724040ALE640_PK2334PCGU92PB    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1G7S    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GNS    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPNTS    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPRDS    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPJ7S    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1T8S    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGXG3NS    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1J9S    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY45ZS    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGUNGES    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GES    ONLINE       0     0     0
        replacing-12                               OFFLINE      0     0     0
          ata-ST4000DM000-1F2168_Z301YC31          OFFLINE      0     0     0
          ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B  ONLINE       0     0     0

errors: 2 data errors, use '-v' for a list

After rebooting the system, a new resilver began, this time reporting the same 2 data errors as well as 3 checksum errors that were corrected on Disk 8.

A ZFS pool has finished resilvering:

eid: 182
host: storage-server
time: 2014-11-01 10:09:44-0400
pool: data
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 1.74T in 12h40m with 2 errors on Sat Nov 1 10:09:44 2014
config:

    NAME                                         STATE     READ WRITE CKSUM
    data                                         ONLINE       0     0     6
      raidz2-0                                   ONLINE       0     0    12
        ata-HGST_HDN724040ALE640_PK2334PCGU92PB  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1G7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GNS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPNTS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPRDS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPJ7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1T8S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGXG3NS  ONLINE       0     0     3
        ata-HGST_HDN724040ALE640_PK1334PCGY1J9S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY45ZS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGUNGES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B  ONLINE       0     0     0

errors: 2 data errors, use '-v' for a list

I thought that deleting the impacted files might resolve the issue so I deleted the corrupt file and all snapshots that referenced it, restored a pristine copy from the source disk, and rebooted. Again, ZFS initiated a resilver with the following result.

A ZFS pool has finished resilvering:

eid: 62
host: storage-server
time: 2014-11-02 01:28:34-0400
pool: data
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 1.69T in 13h9m with 0 errors on Sun Nov 2 01:28:34 2014
config:

    NAME                                         STATE     READ WRITE CKSUM
    data                                         ONLINE       0     0     0
      raidz2-0                                   ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCGU92PB  ONLINE       0     0     3
        ata-HGST_HDN724040ALE640_PK1334PCGY1G7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GNS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPNTS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPRDS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPJ7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1T8S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGXG3NS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1J9S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY45ZS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGUNGES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B  ONLINE       0     0     0

errors: No known data errors

This time there were 3 checksum errors on Disk 1 but no more data errors. After researching this issue online, and discussing the problem in #zfsonlinux, I followed the recommendation to run a zpool clear and then a scrub. The scrub reported no errors. This time I was able to reboot without any additional resilver operations. It has been running smoothly thus far. However, I am obviously very concerned about the data loss. I'm glad ZFS alerted me to the issue, and am checking my RAM, SAS cables, and controllers as possible culprits.

A ZFS pool has finished scrubbing:

eid: 64
host: storage-server
time: 2014-11-02 16:02:01-0500
pool: data
state: ONLINE
scan: scrub repaired 0 in 15h29m with 0 errors on Sun Nov 2 16:02:01 2014
config:

    NAME                                         STATE     READ WRITE CKSUM
    data                                         ONLINE       0     0     0
      raidz2-0                                   ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCGU92PB  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1G7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GNS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPNTS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPRDS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPJ7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1T8S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGXG3NS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1J9S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY45ZS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGUNGES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B  ONLINE       0     0     0

errors: No known data errors

Password:
root@storage-server:/home/jason# sudo zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
data 47.2T 23.5T 23.8T 49% 1.00x ONLINE -

Relevant output from zpool history:
2014-10-30.20:31:39 zpool replace -f data ata-ST4000DM000-1F2168_Z301YC31 /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B
2014-10-31.15:45:45 zpool detach data ata-ST4000DM000-1F2168_Z301YC31
2014-11-02.01:32:24 zpool clear data
2014-11-02.01:32:38 zpool scrub data
2014-11-02.22:01:26 zpool export data
2014-11-02.22:06:21 zpool import -d /dev/disk/by-id -aN

If you need any further information, do not hesitate to ask.

The text was updated successfully, but these errors were encountered:

behlendorf · 2014-11-06T19:45:11Z

Thanks for the detailed issue report.

jwittlincohen · 2014-11-18T19:01:16Z

Post mortem: This likely isn't relevant to the bug, but I figure I should update the bug report with the cause of the data error. My most recently purchased SAS controller (a Supermicro AOC-SAS2LP-MV8) was apparently defective. I was able to link all known checksum errors to drives on that controller, but no errors on any other drives. In addition, the controller randomly dropped a drive under moderate load. I swapped the defective controller for an identical model (SAS2LP) I had purchased a few months earlier. I have since written another 1 TB of data to the pool, done 44 TB in reads, and a scrub with zero checksum errors, drive drops, or I/O errors.

FransUrbo · 2014-11-18T19:06:16Z

Just as a reference, my #2602 is/was probably ALSO due to a broken controller of the same make and model... I haven't replaced my controllers yet, but everything indicates that this controller is just crap...

jwittlincohen · 2017-04-11T14:39:15Z

Post Post Mortem: I replaced both Supermicro AOC-SAS2LP-MV8 controllers with LSI 9211-8i controllers flashed with IT firmware in Oct/Nov 2014. I regularly scrub twice monthly on my ~50TB pool and have not had a checksum error since (approximately 2.5 years). The AOC-SAS2LP-MV8 should be avoided!

behlendorf · 2017-04-11T16:48:24Z

@jwittlincohen thanks for the follow up! Let's close this old issue out, we have alternate issues open for the problem of the resilver restarting itself.

behlendorf added Bug - Minor labels Nov 6, 2014

behlendorf removed Bug - Minor labels Sep 30, 2016

jwittlincohen mentioned this issue Apr 11, 2017

Random checksum errors on raidz1 and mirror pools -- It's not a hardware fault #5018

Closed

behlendorf closed this as completed Apr 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endless resilvering after hitting unrecoverable data error #2867

Endless resilvering after hitting unrecoverable data error #2867

jwittlincohen commented Nov 5, 2014

behlendorf commented Nov 6, 2014

jwittlincohen commented Nov 18, 2014

FransUrbo commented Nov 18, 2014

jwittlincohen commented Apr 11, 2017 •

edited

Loading

behlendorf commented Apr 11, 2017

Endless resilvering after hitting unrecoverable data error #2867

Endless resilvering after hitting unrecoverable data error #2867

Comments

jwittlincohen commented Nov 5, 2014

behlendorf commented Nov 6, 2014

jwittlincohen commented Nov 18, 2014

FransUrbo commented Nov 18, 2014

jwittlincohen commented Apr 11, 2017 • edited Loading

behlendorf commented Apr 11, 2017

jwittlincohen commented Apr 11, 2017 •

edited

Loading