-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endless resilvering after hitting unrecoverable data error #2867
Comments
Thanks for the detailed issue report. |
Post mortem: This likely isn't relevant to the bug, but I figure I should update the bug report with the cause of the data error. My most recently purchased SAS controller (a Supermicro AOC-SAS2LP-MV8) was apparently defective. I was able to link all known checksum errors to drives on that controller, but no errors on any other drives. In addition, the controller randomly dropped a drive under moderate load. I swapped the defective controller for an identical model (SAS2LP) I had purchased a few months earlier. I have since written another 1 TB of data to the pool, done 44 TB in reads, and a scrub with zero checksum errors, drive drops, or I/O errors. |
Just as a reference, my #2602 is/was probably ALSO due to a broken controller of the same make and model... I haven't replaced my controllers yet, but everything indicates that this controller is just crap... |
Post Post Mortem: I replaced both Supermicro AOC-SAS2LP-MV8 controllers with LSI 9211-8i controllers flashed with IT firmware in Oct/Nov 2014. I regularly scrub twice monthly on my ~50TB pool and have not had a checksum error since (approximately 2.5 years). The AOC-SAS2LP-MV8 should be avoided! |
@jwittlincohen thanks for the follow up! Let's close this old issue out, we have alternate issues open for the problem of the resilver restarting itself. |
Summary of issue: A resilver operation that completes successfully but results in one or more unrecoverable data errors causes ZFS to begin a new resilver (from 0%) every time the pool is imported. I was able to resolve this situation by deleting the corrupted file and referenced snapshots, running zpool clear, and then initiating a successful scrub.
Detailed description: At the time I began the resilver, all disks in the array had passed an extended SMART test and were functioning fine. I initiated a resilver to replace one of the disks in the array (I initially planned to use a 12 drive RAID-Z2, but discovered that this was an inefficient configuration with advanced format disks - 10% overhead compared to 4.5% with 13 disks. I therefore used a temporary drive until I could obtain a 13th drive matching the same make/model as the others). Approximately 80% into the resilver, I was alerted to an unrecoverable data error. The resilver completed successfully but reported 2 data errors impacting a single video file and snapshot that referenced it.
The resilver completed with the following output:
A ZFS pool has finished resilvering:
eid: 284
host: storage-server
time: 2014-10-31 15:44:34-0400
pool: data
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 1.74T in 19h13m with 2 errors on Fri Oct 31 15:44:34 2014
config:
errors: 2 data errors, use '-v' for a list
After rebooting the system, a new resilver began, this time reporting the same 2 data errors as well as 3 checksum errors that were corrected on Disk 8.
A ZFS pool has finished resilvering:
eid: 182
host: storage-server
time: 2014-11-01 10:09:44-0400
pool: data
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 1.74T in 12h40m with 2 errors on Sat Nov 1 10:09:44 2014
config:
errors: 2 data errors, use '-v' for a list
I thought that deleting the impacted files might resolve the issue so I deleted the corrupt file and all snapshots that referenced it, restored a pristine copy from the source disk, and rebooted. Again, ZFS initiated a resilver with the following result.
A ZFS pool has finished resilvering:
eid: 62
host: storage-server
time: 2014-11-02 01:28:34-0400
pool: data
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 1.69T in 13h9m with 0 errors on Sun Nov 2 01:28:34 2014
config:
errors: No known data errors
This time there were 3 checksum errors on Disk 1 but no more data errors. After researching this issue online, and discussing the problem in #zfsonlinux, I followed the recommendation to run a zpool clear and then a scrub. The scrub reported no errors. This time I was able to reboot without any additional resilver operations. It has been running smoothly thus far. However, I am obviously very concerned about the data loss. I'm glad ZFS alerted me to the issue, and am checking my RAM, SAS cables, and controllers as possible culprits.
A ZFS pool has finished scrubbing:
eid: 64
host: storage-server
time: 2014-11-02 16:02:01-0500
pool: data
state: ONLINE
scan: scrub repaired 0 in 15h29m with 0 errors on Sun Nov 2 16:02:01 2014
config:
errors: No known data errors
Password:
root@storage-server:/home/jason# sudo zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
data 47.2T 23.5T 23.8T 49% 1.00x ONLINE -
Relevant output from zpool history:
2014-10-30.20:31:39 zpool replace -f data ata-ST4000DM000-1F2168_Z301YC31 /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK2334PCGBTJ5B
2014-10-31.15:45:45 zpool detach data ata-ST4000DM000-1F2168_Z301YC31
2014-11-02.01:32:24 zpool clear data
2014-11-02.01:32:38 zpool scrub data
2014-11-02.22:01:26 zpool export data
2014-11-02.22:06:21 zpool import -d /dev/disk/by-id -aN
If you need any further information, do not hesitate to ask.
The text was updated successfully, but these errors were encountered: