Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless resilvering after hitting unrecoverable data error #2867

Closed
jwittlincohen opened this issue Nov 5, 2014 · 5 comments
Closed

Endless resilvering after hitting unrecoverable data error #2867

jwittlincohen opened this issue Nov 5, 2014 · 5 comments

Comments

@jwittlincohen
Copy link
Contributor

Summary of issue: A resilver operation that completes successfully but results in one or more unrecoverable data errors causes ZFS to begin a new resilver (from 0%) every time the pool is imported. I was able to resolve this situation by deleting the corrupted file and referenced snapshots, running zpool clear, and then initiating a successful scrub.

Detailed description: At the time I began the resilver, all disks in the array had passed an extended SMART test and were functioning fine. I initiated a resilver to replace one of the disks in the array (I initially planned to use a 12 drive RAID-Z2, but discovered that this was an inefficient configuration with advanced format disks - 10% overhead compared to 4.5% with 13 disks. I therefore used a temporary drive until I could obtain a 13th drive matching the same make/model as the others). Approximately 80% into the resilver, I was alerted to an unrecoverable data error. The resilver completed successfully but reported 2 data errors impacting a single video file and snapshot that referenced it.

The resilver completed with the following output:

A ZFS pool has finished resilvering:

eid: 284
host: storage-server
time: 2014-10-31 15:44:34-0400
pool: data
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 1.74T in 19h13m with 2 errors on Fri Oct 31 15:44:34 2014
config:

    NAME                                           STATE     READ WRITE CKSUM
    data                                           DEGRADED     0     0    10
      raidz2-0                                     DEGRADED     0     0    20
        ata-HGST_HDN724040ALE640_PK2334PCGU92PB    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1G7S    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GNS    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPNTS    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPRDS    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPJ7S    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1T8S    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGXG3NS    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1J9S    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY45ZS    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGUNGES    ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GES    ONLINE       0     0     0
        replacing-12                               OFFLINE      0     0     0
          ata-ST4000DM000-1F2168_Z301YC31          OFFLINE      0     0     0
          ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B  ONLINE       0     0     0

errors: 2 data errors, use '-v' for a list

After rebooting the system, a new resilver began, this time reporting the same 2 data errors as well as 3 checksum errors that were corrected on Disk 8.

A ZFS pool has finished resilvering:

eid: 182
host: storage-server
time: 2014-11-01 10:09:44-0400
pool: data
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 1.74T in 12h40m with 2 errors on Sat Nov 1 10:09:44 2014
config:

    NAME                                         STATE     READ WRITE CKSUM
    data                                         ONLINE       0     0     6
      raidz2-0                                   ONLINE       0     0    12
        ata-HGST_HDN724040ALE640_PK2334PCGU92PB  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1G7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GNS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPNTS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPRDS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPJ7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1T8S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGXG3NS  ONLINE       0     0     3
        ata-HGST_HDN724040ALE640_PK1334PCGY1J9S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY45ZS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGUNGES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B  ONLINE       0     0     0

errors: 2 data errors, use '-v' for a list

I thought that deleting the impacted files might resolve the issue so I deleted the corrupt file and all snapshots that referenced it, restored a pristine copy from the source disk, and rebooted. Again, ZFS initiated a resilver with the following result.

A ZFS pool has finished resilvering:

eid: 62
host: storage-server
time: 2014-11-02 01:28:34-0400
pool: data
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 1.69T in 13h9m with 0 errors on Sun Nov 2 01:28:34 2014
config:

    NAME                                         STATE     READ WRITE CKSUM
    data                                         ONLINE       0     0     0
      raidz2-0                                   ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCGU92PB  ONLINE       0     0     3
        ata-HGST_HDN724040ALE640_PK1334PCGY1G7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GNS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPNTS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPRDS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPJ7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1T8S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGXG3NS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1J9S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY45ZS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGUNGES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B  ONLINE       0     0     0

errors: No known data errors

This time there were 3 checksum errors on Disk 1 but no more data errors. After researching this issue online, and discussing the problem in #zfsonlinux, I followed the recommendation to run a zpool clear and then a scrub. The scrub reported no errors. This time I was able to reboot without any additional resilver operations. It has been running smoothly thus far. However, I am obviously very concerned about the data loss. I'm glad ZFS alerted me to the issue, and am checking my RAM, SAS cables, and controllers as possible culprits.

A ZFS pool has finished scrubbing:

eid: 64
host: storage-server
time: 2014-11-02 16:02:01-0500
pool: data
state: ONLINE
scan: scrub repaired 0 in 15h29m with 0 errors on Sun Nov 2 16:02:01 2014
config:

    NAME                                         STATE     READ WRITE CKSUM
    data                                         ONLINE       0     0     0
      raidz2-0                                   ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCGU92PB  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1G7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GNS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPNTS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPRDS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGYPJ7S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1T8S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGXG3NS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY1J9S  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY45ZS  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGUNGES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK1334PCGY7GES  ONLINE       0     0     0
        ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B  ONLINE       0     0     0

errors: No known data errors

Password:
root@storage-server:/home/jason# sudo zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
data 47.2T 23.5T 23.8T 49% 1.00x ONLINE -

Relevant output from zpool history:
2014-10-30.20:31:39 zpool replace -f data ata-ST4000DM000-1F2168_Z301YC31 /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B
2014-10-31.15:45:45 zpool detach data ata-ST4000DM000-1F2168_Z301YC31
2014-11-02.01:32:24 zpool clear data
2014-11-02.01:32:38 zpool scrub data
2014-11-02.22:01:26 zpool export data
2014-11-02.22:06:21 zpool import -d /dev/disk/by-id -aN

If you need any further information, do not hesitate to ask.

@behlendorf
Copy link
Contributor

Thanks for the detailed issue report.

@jwittlincohen
Copy link
Contributor Author

Post mortem: This likely isn't relevant to the bug, but I figure I should update the bug report with the cause of the data error. My most recently purchased SAS controller (a Supermicro AOC-SAS2LP-MV8) was apparently defective. I was able to link all known checksum errors to drives on that controller, but no errors on any other drives. In addition, the controller randomly dropped a drive under moderate load. I swapped the defective controller for an identical model (SAS2LP) I had purchased a few months earlier. I have since written another 1 TB of data to the pool, done 44 TB in reads, and a scrub with zero checksum errors, drive drops, or I/O errors.

@FransUrbo
Copy link
Contributor

Just as a reference, my #2602 is/was probably ALSO due to a broken controller of the same make and model... I haven't replaced my controllers yet, but everything indicates that this controller is just crap...

@jwittlincohen
Copy link
Contributor Author

jwittlincohen commented Apr 11, 2017

Post Post Mortem: I replaced both Supermicro AOC-SAS2LP-MV8 controllers with LSI 9211-8i controllers flashed with IT firmware in Oct/Nov 2014. I regularly scrub twice monthly on my ~50TB pool and have not had a checksum error since (approximately 2.5 years). The AOC-SAS2LP-MV8 should be avoided!

@behlendorf
Copy link
Contributor

@jwittlincohen thanks for the follow up! Let's close this old issue out, we have alternate issues open for the problem of the resilver restarting itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants