Corrupted block "out of nowhere", no cksm errors #10019

InsanePrawn · 2020-02-18T18:26:29Z

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	9
Linux Kernel	4.19.0-0.bpo.6-amd64
Architecture	AMD64
ZFS Version	zfs-0.8.0-596_g4d5b4a33d
SPL Version	0.8.0-596_g4d5b4a33d

(a few days old git master, built with make deb-dkms; possibly one of the final versions of my then unmerged systemd PR branch, nothing that changes the kernel module anyway)

Describe the problem you're observing

My server, up for 5 days now since the last reboot, found a permanent error in a snapshot created a few hours ago, yet reports no cksum, read or write errors. I have done nothing but let a couple of backup scripts run on small dummy data, so basically a lot of zfs list, snap and destroy action, but usually just empty snapshots.

/proc/spl/kstat/zfs/vdev_raidz_bench confirms AVX2 is in use, my datasets are encrypted.

The system is a Xeon 1225v3 with 24GB of ECC RAM, so fairly mature hardware at this point.

Describe how to reproduce the problem

I wish I knew. Or do I?

Include any warning/errors/backtraces from the system logs

zfs-0.8.0-596_g4d5b4a33d
zfs-kmod-0.8.0-596_g4d5b4a33d


  pool: yolopool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Feb 18 18:36:31 2020
	2.60T scanned at 1.74G/s, 543G issued at 364M/s, 3.86T total
	0B repaired, 13.73% done, 0 days 02:40:01 to go
config:

	NAME                            STATE     READ WRITE CKSUM
	yolopool                        ONLINE       0     0     0
	  raidz2-0                      ONLINE       0     0     0
	    ST2000DL004_S2H7J9EC300853  ONLINE       0     0     0
	    ST2000DL004_S2H7J9EC300956  ONLINE       0     0     0
	    WD20EFRX_WMC4M2885301       ONLINE       0     0     0
	    WD20EFRX_WMC4M2886107       ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        yolopool/data/zrepltest3/abc/kindaroot@zrepl_20200218_163544_000:<0x0>

Does this indicate one of those blockpointers with a valid checksum but invalid contents other people have fought with?

I started the scrub as a reaction the error, last scrub was early January.

If someone tells me how, I can check the RAM's ECC error counters to see if I've hit two bit corruption and should consider using my statistical luck by winning the lottery and staying away from lightning strikes, sharks and coconuts in the future.

If you need any other logs or other info, let me know. The zpool history output should be fun to sift through, considering there were snapshots being created and deleted every couple minutes.

I do not know how to poke around in the broken block with zdb; if this issue is of interested, I can follow someone's instructions on that.
Given my pool name, general desinterest in the dataset that holds the snapshot and the fact I run git master, I'll just zfs destroy the faulty snapshot and move on with my life if there's no interest in debugging this, although it does feel a little spooky.

The text was updated successfully, but these errors were encountered:

behlendorf · 2020-02-18T18:50:47Z

Does this indicate one of those blockpointers with a valid checksum but invalid contents

Yes, that's right. Since a permanent error was reported the correct block couldn't be reconstructed using the parity data. That means ZFS is unable to determine which of the disk vdevs returned invalid data so none of them show a CKSUM error.

Unfortunately, it probably wouldn't be very fruitful to try and debug what caused a one off event like this. But I agree, it's troubling and it would be good to understand what exactly happened. Definitely let us know the results of the scrub.

AttilaFueloep · 2020-02-19T00:22:21Z

That's definitely worrisome.

my datasets are encrypted.

Are you using aes-gcm?

InsanePrawn · 2020-02-19T03:47:58Z

Definitely let us know the results of the scrub.

scan: scrub repaired 0B in 0 days 09:29:18 with 0 errors

In the meantime IRC was able to help me check my ECC counts.

 ~ % sudo edac-util
edac-util: No errors to report.
 ~ % sudo edac-util --report=full
mc0:csrow0:mc#0csrow#0channel#0:CE:0
mc0:csrow0:mc#0csrow#0channel#1:CE:0
mc0:csrow1:mc#0csrow#1channel#0:CE:0
mc0:csrow1:mc#0csrow#1channel#1:CE:0
mc0:csrow2:mc#0csrow#2channel#0:CE:0
mc0:csrow2:mc#0csrow#2channel#1:CE:0
mc0:noinfo:all:UE:0
mc0:noinfo:all:CE:0

Well, would've loved to blame faulty RAM for this.

Does this indicate one of those blockpointers with a valid checksum but invalid contents

Yes, that's right. Since a permanent error was reported the correct block couldn't be reconstructed using the parity data. That means ZFS is unable to determine which of the disk vdevs returned invalid data so none of them show a CKSUM error.

Oh, I guess I naively expected all of them to get a bad boy point in some column there.

Is there any info from zfs available on what actually makes it mark that block as faulty then? dmesg | grep -i zfs has nothing relevant at all.
(I'm sure this question has been answered elsewhere, feel free to direct me there.)

InsanePrawn · 2020-02-19T03:57:05Z

Are you using aes-gcm?

Everything on the pool is using aes-256-ccm currently, except for a handful of test datasets, including one that lies next to the faulty dataset, but not the faulty dataset itself. They did get snapped, replicated and pruned closely together by the backup software though.

AttilaFueloep · 2020-02-19T18:28:29Z

Since the faulty dataset uses aes-256-ccm, I think it's unlikely that your corruption is caused by the new GCM routines added in #9749.

GregorKopka · 2020-03-09T20:15:50Z

Does this indicate one of those blockpointers with a valid checksum but invalid contents

Yes, that's right. Since a permanent error was reported the correct block couldn't be reconstructed using the parity data. That means ZFS is unable to determine which of the disk vdevs returned invalid data so none of them show a CKSUM error.

I think it should, in cases when there is no physical disk to blame, at least account the error in the raidz vdev node statistics to show that this vdev has some problems - simply to show where an operation that should have succeeded (in this case: reading the referenced data from that vdev) failed.

The problem could certainly be in the block that points to the inaccessible data, but as the checksum of that checked out we should suppose the problem to be where the checksum failed. Simply because else we couldn't rely on the health of on-disk data anymore and should concentrate on running around in circles while screaming, at least till someone found a (preferably the) bug that writes garbage with an intact checksum to the pool.

behlendorf · 2020-12-22T00:50:26Z

at least account the error in the raidz vdev node statistics to show that this vdev has some problems

@GregorKopka actually, now that you mention it that is the expected behavior. So I'm not sure why it wasn't reported, that may be a bug and should be investigated.

stale · 2021-12-22T01:27:51Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

GregorKopka · 2021-12-24T14:10:43Z

Stale Bot should not close defects.

bghira · 2021-12-24T17:35:37Z

we used to have a volunteer that closed tickets when they went inactive, but they were told it was too heavy handed. now we've got a stale bot that decides based on extremely heavy handed rules. perfection.

psy0rz · 2022-01-04T20:34:07Z

we used to have a volunteer that closed tickets when they went inactive, but they were told it was too heavy handed. now we've got a stale bot that decides based on extremely heavy handed rules. perfection.

The bugs basically fix themselfs this way :P

stale · 2023-01-05T05:03:26Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

GregorKopka · 2023-03-15T06:59:15Z

Bad bot.

@behlendorf

behlendorf added the Type: Question Issue for discussion label Feb 18, 2020

bghira mentioned this issue Aug 11, 2020

silent corruption gives input/output error but cannot be detected with scrub, experienced on 0.7.5 and 0.8.3 versions #10697

Closed

behlendorf added Type: Defect Incorrect behavior (e.g. crash, hang) and removed Type: Question Issue for discussion labels Dec 22, 2020

InsanePrawn mentioned this issue Jul 4, 2021

ZFS corruption related to snapshots post-2.0.x upgrade #12014

Open

stale bot added the Status: Stale No recent activity for issue label Dec 22, 2021

marker5a mentioned this issue Dec 23, 2021

Sometimes raw send on encrypted datasets does not work when copying snapshots back #12594

Closed

stale bot removed the Status: Stale No recent activity for issue label Dec 24, 2021

stale bot added the Status: Stale No recent activity for issue label Jan 5, 2023

behlendorf added Bot: Not Stale Override for the stale bot and removed Status: Stale No recent activity for issue labels Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted block "out of nowhere", no cksm errors #10019

Corrupted block "out of nowhere", no cksm errors #10019

InsanePrawn commented Feb 18, 2020 •

edited

Loading

behlendorf commented Feb 18, 2020

AttilaFueloep commented Feb 19, 2020

InsanePrawn commented Feb 19, 2020

InsanePrawn commented Feb 19, 2020

AttilaFueloep commented Feb 19, 2020

GregorKopka commented Mar 9, 2020

behlendorf commented Dec 22, 2020

stale bot commented Dec 22, 2021

GregorKopka commented Dec 24, 2021

bghira commented Dec 24, 2021

psy0rz commented Jan 4, 2022

stale bot commented Jan 5, 2023

GregorKopka commented Mar 15, 2023

Corrupted block "out of nowhere", no cksm errors #10019

Corrupted block "out of nowhere", no cksm errors #10019

Comments

InsanePrawn commented Feb 18, 2020 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

behlendorf commented Feb 18, 2020

AttilaFueloep commented Feb 19, 2020

InsanePrawn commented Feb 19, 2020

InsanePrawn commented Feb 19, 2020

AttilaFueloep commented Feb 19, 2020

GregorKopka commented Mar 9, 2020

behlendorf commented Dec 22, 2020

stale bot commented Dec 22, 2021

GregorKopka commented Dec 24, 2021

bghira commented Dec 24, 2021

psy0rz commented Jan 4, 2022

stale bot commented Jan 5, 2023

GregorKopka commented Mar 15, 2023

InsanePrawn commented Feb 18, 2020 •

edited

Loading