Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted block "out of nowhere", no cksm errors #10019

Open
InsanePrawn opened this issue Feb 18, 2020 · 13 comments
Open

Corrupted block "out of nowhere", no cksm errors #10019

InsanePrawn opened this issue Feb 18, 2020 · 13 comments
Labels
Bot: Not Stale Override for the stale bot Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@InsanePrawn
Copy link
Contributor

InsanePrawn commented Feb 18, 2020

System information

Type Version/Name
Distribution Name Debian
Distribution Version 9
Linux Kernel 4.19.0-0.bpo.6-amd64
Architecture AMD64
ZFS Version zfs-0.8.0-596_g4d5b4a33d
SPL Version 0.8.0-596_g4d5b4a33d

(a few days old git master, built with make deb-dkms; possibly one of the final versions of my then unmerged systemd PR branch, nothing that changes the kernel module anyway)

Describe the problem you're observing

My server, up for 5 days now since the last reboot, found a permanent error in a snapshot created a few hours ago, yet reports no cksum, read or write errors. I have done nothing but let a couple of backup scripts run on small dummy data, so basically a lot of zfs list, snap and destroy action, but usually just empty snapshots.

/proc/spl/kstat/zfs/vdev_raidz_bench confirms AVX2 is in use, my datasets are encrypted.

The system is a Xeon 1225v3 with 24GB of ECC RAM, so fairly mature hardware at this point.

Describe how to reproduce the problem

I wish I knew. Or do I?

Include any warning/errors/backtraces from the system logs

zfs-0.8.0-596_g4d5b4a33d
zfs-kmod-0.8.0-596_g4d5b4a33d


  pool: yolopool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Feb 18 18:36:31 2020
	2.60T scanned at 1.74G/s, 543G issued at 364M/s, 3.86T total
	0B repaired, 13.73% done, 0 days 02:40:01 to go
config:

	NAME                            STATE     READ WRITE CKSUM
	yolopool                        ONLINE       0     0     0
	  raidz2-0                      ONLINE       0     0     0
	    ST2000DL004_S2H7J9EC300853  ONLINE       0     0     0
	    ST2000DL004_S2H7J9EC300956  ONLINE       0     0     0
	    WD20EFRX_WMC4M2885301       ONLINE       0     0     0
	    WD20EFRX_WMC4M2886107       ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        yolopool/data/zrepltest3/abc/kindaroot@zrepl_20200218_163544_000:<0x0>

Does this indicate one of those blockpointers with a valid checksum but invalid contents other people have fought with?

I started the scrub as a reaction the error, last scrub was early January.

If someone tells me how, I can check the RAM's ECC error counters to see if I've hit two bit corruption and should consider using my statistical luck by winning the lottery and staying away from lightning strikes, sharks and coconuts in the future.

If you need any other logs or other info, let me know. The zpool history output should be fun to sift through, considering there were snapshots being created and deleted every couple minutes.

I do not know how to poke around in the broken block with zdb; if this issue is of interested, I can follow someone's instructions on that.
Given my pool name, general desinterest in the dataset that holds the snapshot and the fact I run git master, I'll just zfs destroy the faulty snapshot and move on with my life if there's no interest in debugging this, although it does feel a little spooky.

@behlendorf
Copy link
Contributor

Does this indicate one of those blockpointers with a valid checksum but invalid contents

Yes, that's right. Since a permanent error was reported the correct block couldn't be reconstructed using the parity data. That means ZFS is unable to determine which of the disk vdevs returned invalid data so none of them show a CKSUM error.

Unfortunately, it probably wouldn't be very fruitful to try and debug what caused a one off event like this. But I agree, it's troubling and it would be good to understand what exactly happened. Definitely let us know the results of the scrub.

@behlendorf behlendorf added the Type: Question Issue for discussion label Feb 18, 2020
@AttilaFueloep
Copy link
Contributor

That's definitely worrisome.

my datasets are encrypted.

Are you using aes-gcm?

@InsanePrawn
Copy link
Contributor Author

Definitely let us know the results of the scrub.

scan: scrub repaired 0B in 0 days 09:29:18 with 0 errors

In the meantime IRC was able to help me check my ECC counts.

 ~ % sudo edac-util
edac-util: No errors to report.
 ~ % sudo edac-util --report=full
mc0:csrow0:mc#0csrow#0channel#0:CE:0
mc0:csrow0:mc#0csrow#0channel#1:CE:0
mc0:csrow1:mc#0csrow#1channel#0:CE:0
mc0:csrow1:mc#0csrow#1channel#1:CE:0
mc0:csrow2:mc#0csrow#2channel#0:CE:0
mc0:csrow2:mc#0csrow#2channel#1:CE:0
mc0:noinfo:all:UE:0
mc0:noinfo:all:CE:0

Well, would've loved to blame faulty RAM for this.

Does this indicate one of those blockpointers with a valid checksum but invalid contents

Yes, that's right. Since a permanent error was reported the correct block couldn't be reconstructed using the parity data. That means ZFS is unable to determine which of the disk vdevs returned invalid data so none of them show a CKSUM error.

Oh, I guess I naively expected all of them to get a bad boy point in some column there.

Is there any info from zfs available on what actually makes it mark that block as faulty then? dmesg | grep -i zfs has nothing relevant at all.
(I'm sure this question has been answered elsewhere, feel free to direct me there.)

@InsanePrawn
Copy link
Contributor Author

Are you using aes-gcm?

Everything on the pool is using aes-256-ccm currently, except for a handful of test datasets, including one that lies next to the faulty dataset, but not the faulty dataset itself. They did get snapped, replicated and pruned closely together by the backup software though.

@AttilaFueloep
Copy link
Contributor

Since the faulty dataset uses aes-256-ccm, I think it's unlikely that your corruption is caused by the new GCM routines added in #9749.

@GregorKopka
Copy link
Contributor

Does this indicate one of those blockpointers with a valid checksum but invalid contents

Yes, that's right. Since a permanent error was reported the correct block couldn't be reconstructed using the parity data. That means ZFS is unable to determine which of the disk vdevs returned invalid data so none of them show a CKSUM error.

I think it should, in cases when there is no physical disk to blame, at least account the error in the raidz vdev node statistics to show that this vdev has some problems - simply to show where an operation that should have succeeded (in this case: reading the referenced data from that vdev) failed.

The problem could certainly be in the block that points to the inaccessible data, but as the checksum of that checked out we should suppose the problem to be where the checksum failed. Simply because else we couldn't rely on the health of on-disk data anymore and should concentrate on running around in circles while screaming, at least till someone found a (preferably the) bug that writes garbage with an intact checksum to the pool.

@behlendorf
Copy link
Contributor

at least account the error in the raidz vdev node statistics to show that this vdev has some problems

@GregorKopka actually, now that you mention it that is the expected behavior. So I'm not sure why it wasn't reported, that may be a bug and should be investigated.

@behlendorf behlendorf added Type: Defect Incorrect behavior (e.g. crash, hang) and removed Type: Question Issue for discussion labels Dec 22, 2020
@stale
Copy link

stale bot commented Dec 22, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@GregorKopka
Copy link
Contributor

Stale Bot should not close defects.

@stale stale bot removed the Status: Stale No recent activity for issue label Dec 24, 2021
@bghira
Copy link

bghira commented Dec 24, 2021

we used to have a volunteer that closed tickets when they went inactive, but they were told it was too heavy handed. now we've got a stale bot that decides based on extremely heavy handed rules. perfection.

@psy0rz
Copy link

psy0rz commented Jan 4, 2022

we used to have a volunteer that closed tickets when they went inactive, but they were told it was too heavy handed. now we've got a stale bot that decides based on extremely heavy handed rules. perfection.

The bugs basically fix themselfs this way :P

@stale
Copy link

stale bot commented Jan 5, 2023

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Jan 5, 2023
@GregorKopka
Copy link
Contributor

Bad bot.

@behlendorf

@behlendorf behlendorf added Bot: Not Stale Override for the stale bot and removed Status: Stale No recent activity for issue labels Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bot: Not Stale Override for the stale bot Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

6 participants