-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted block "out of nowhere", no cksm errors #10019
Comments
Yes, that's right. Since a permanent error was reported the correct block couldn't be reconstructed using the parity data. That means ZFS is unable to determine which of the disk vdevs returned invalid data so none of them show a CKSUM error. Unfortunately, it probably wouldn't be very fruitful to try and debug what caused a one off event like this. But I agree, it's troubling and it would be good to understand what exactly happened. Definitely let us know the results of the scrub. |
That's definitely worrisome.
Are you using aes-gcm? |
In the meantime IRC was able to help me check my ECC counts.
Well, would've loved to blame faulty RAM for this.
Oh, I guess I naively expected all of them to get a bad boy point in some column there. Is there any info from zfs available on what actually makes it mark that block as faulty then? |
Everything on the pool is using aes-256-ccm currently, except for a handful of test datasets, including one that lies next to the faulty dataset, but not the faulty dataset itself. They did get snapped, replicated and pruned closely together by the backup software though. |
Since the faulty dataset uses aes-256-ccm, I think it's unlikely that your corruption is caused by the new GCM routines added in #9749. |
I think it should, in cases when there is no physical disk to blame, at least account the error in the raidz vdev node statistics to show that this vdev has some problems - simply to show where an operation that should have succeeded (in this case: reading the referenced data from that vdev) failed. The problem could certainly be in the block that points to the inaccessible data, but as the checksum of that checked out we should suppose the problem to be where the checksum failed. Simply because else we couldn't rely on the health of on-disk data anymore and should concentrate on running around in circles while screaming, at least till someone found a (preferably the) bug that writes garbage with an intact checksum to the pool. |
@GregorKopka actually, now that you mention it that is the expected behavior. So I'm not sure why it wasn't reported, that may be a bug and should be investigated. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Stale Bot should not close defects. |
we used to have a volunteer that closed tickets when they went inactive, but they were told it was too heavy handed. now we've got a stale bot that decides based on extremely heavy handed rules. perfection. |
The bugs basically fix themselfs this way :P |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Bad bot. |
System information
(a few days old git master, built with
make deb-dkms
; possibly one of the final versions of my then unmerged systemd PR branch, nothing that changes the kernel module anyway)Describe the problem you're observing
My server, up for 5 days now since the last reboot, found a permanent error in a snapshot created a few hours ago, yet reports no cksum, read or write errors. I have done nothing but let a couple of backup scripts run on small dummy data, so basically a lot of zfs list, snap and destroy action, but usually just empty snapshots.
/proc/spl/kstat/zfs/vdev_raidz_bench
confirms AVX2 is in use, my datasets are encrypted.The system is a Xeon 1225v3 with 24GB of ECC RAM, so fairly mature hardware at this point.
Describe how to reproduce the problem
I wish I knew. Or do I?
Include any warning/errors/backtraces from the system logs
Does this indicate one of those blockpointers with a valid checksum but invalid contents other people have fought with?
I started the scrub as a reaction the error, last scrub was early January.
If someone tells me how, I can check the RAM's ECC error counters to see if I've hit two bit corruption and should consider using my statistical luck by winning the lottery and staying away from lightning strikes, sharks and coconuts in the future.
If you need any other logs or other info, let me know. The
zpool history
output should be fun to sift through, considering there were snapshots being created and deleted every couple minutes.I do not know how to poke around in the broken block with zdb; if this issue is of interested, I can follow someone's instructions on that.
Given my pool name, general desinterest in the dataset that holds the snapshot and the fact I run git master, I'll just
zfs destroy
the faulty snapshot and move on with my life if there's no interest in debugging this, although it does feel a little spooky.The text was updated successfully, but these errors were encountered: