ZFS corruption related to snapshots post-2.0.x upgrade #12014

jgoerzen · 2021-05-08T13:30:19Z

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	Buster
Linux Kernel	5.10.0-0.bpo.5-amd64
Architecture	amd64
ZFS Version	2.0.3-1~bpo10+1
SPL Version	2.0.3-1~bpo10+1

Describe the problem you're observing

Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May  3 16:58:33 2021
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>

Of note, the <0xeb51> is sometimes a snapshot name; if I zfs destroy the snapshot, it is replaced by this tag.

Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub without rebooting after seeing this kind of zpool status output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:

[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[393801.328129] PANIC at arc.c:3790:arc_buf_destroy()
[393801.328130] Showing stack for process 363
[393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P     U     OE     5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020
[393801.328134] Call Trace:
[393801.328140]  dump_stack+0x6d/0x88
[393801.328149]  spl_panic+0xd3/0xfb [spl]
[393801.328153]  ? __wake_up_common_lock+0x87/0xc0
[393801.328221]  ? zei_add_range+0x130/0x130 [zfs]
[393801.328225]  ? __cv_broadcast+0x26/0x30 [spl]
[393801.328275]  ? zfs_zevent_post+0x238/0x2a0 [zfs]
[393801.328302]  arc_buf_destroy+0xf3/0x100 [zfs]
[393801.328331]  arc_read_done+0x24d/0x490 [zfs]
[393801.328388]  zio_done+0x43d/0x1020 [zfs]
[393801.328445]  ? zio_vdev_io_assess+0x4d/0x240 [zfs]
[393801.328502]  zio_execute+0x90/0xf0 [zfs]
[393801.328508]  taskq_thread+0x2e7/0x530 [spl]
[393801.328512]  ? wake_up_q+0xa0/0xa0
[393801.328569]  ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs]
[393801.328574]  ? taskq_thread_spawn+0x50/0x50 [spl]
[393801.328576]  kthread+0x116/0x130
[393801.328578]  ? kthread_park+0x80/0x80
[393801.328581]  ret_from_fork+0x22/0x30

However I want to stress that this backtrace is not the original cause of the problem, and it only appears if I do a scrub without first rebooting.

After that panic, the scrub stalled -- and a second error appeared:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Sat May  8 08:11:07 2021
	152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total
	0B repaired, 0.00% done, no estimated completion time
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>
        rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>

I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.

I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?

It is a laptop
It uses ZFS crypto (the others use LUKS)

I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.

Describe how to reproduce the problem

I can't at will. I have to wait for a spell.

Include any warning/errors/backtraces from the system logs

See above

Potentially related bugs

I already mentioned permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload #11688 which seems similar, but a scrub doesn't immediately resolve the issue here
A quite similar backtrace also involving arc_buf_destroy is in silent corruption for thousands files gives input/output error but cannot be detected with scrub - at least for openzfs 2.0.0 #11443. The behavior described there has some parallels to what I observe. I am uncertain from the discussion what that means for this.
In silent corruption gives input/output error but cannot be detected with scrub, experienced on 0.7.5 and 0.8.3 versions #10697 there are some similar symptoms, but it looks like a different issue to me

The text was updated successfully, but these errors were encountered:

jgoerzen · 2021-05-08T13:44:45Z

Two other interesting tidbits...

When I do the reboot after this issue occurs, the mounting of the individual zfs datasets is S L O W. Several seconds each, and that normally just flies by. After scrubbing, it is back to normal speed of mounting.

The datasets that have snapshot issues vary with each one. Sometimes it's just one, sometimes many. But var is almost always included. (Though its parent, which has almost no activity ever, also is from time to time, so that's odd.)

jstenback · 2021-05-08T20:43:12Z

Same symptoms here, more or less. See also issue #11688.

glueckself · 2021-05-09T01:16:12Z

I also have the symptom with the corrupted snapshots, without kernel panics so far.

So far it only affected my Debian system with Linux 5.10 and zfs 2.0.3 (I've turned the server off for today, I can check the exact versions tomorrow). Also, while the system has the 2.0.3 zfs utils + module, the pool is still left on 0.8.6 format. I wasn't able to execute zfs list -r -t all <affected dataset> - it displayed cannot iterate filesystems and only a few snapshots (instead of tens it should've). Also, I couldn't destroy the affected snapshots because it said they didn't exist anymore. I couldn't send the dataset with syncoid at all.

On the corrupted system, after I got the mail from ZED, I manually ran a scrub at first, after which the zpool status said that there were no errors. However, the next zpool status, seconds after the first, again said that there were errors. Subsequent scrubs didn't clean the errors.

I've rebooted the server into an Ubuntu 20.10 live with zfs 0.8.4-1ubuntu11 ~~(again, sorry that I haven't noted the version, can add it tomorrow)~~ and after a scrub the errors were gone. Following scrubs haven't detected errors anymore. zfs list -r -t all ... again displayed a large list of snapshots.

The errors didn't seem to affect the data on the zvols (all 4 affected snapshots are of zvols). The zvols are used as disks for VMs with ext4 on them. ~~I will verify them tomorrow.~~
EDIT: I checked one of the VM disks, neither fsck nor dpkg -V (verify checksums of all files installed from a package) could find any errors (except mismatching dpkg-checksums of config files I've changed - that is to be expected).

I have two other Ubuntu 21.04 based Systems with zfs-2.0.2-1ubuntu5 which are not affected until now. However, they have their pools already upgraded to 2. All are snapshotted with sanoid and have the datasets encrypted.

~~My next step will be to downgrade zfs back to 0.8.6 on the Debian system and see what happens.~~

EDIT:
More points I've noted while investigating with 0.8.4-1ubuntu11:

Creating new snapshots continued working for affected datasets, however destroying them didn't (right now I have 127 "frequently" (sanoids term for the most often snapshot - in my case 15 minutes) instead of the 10 sanoid is configured to keep.
With 0.8, the destroying of the affected snapshots worked. Scrubbing afterwards didn't find any errors.

EDIT 2:

On 2.0.2 (Ubuntu 21.04 again), sanoid managed to successfully prune (destroy) all remaining snapshots that past their valid-time. A scrub afterwards didn't find any errors. I'll be running the 2.0.2 for a while and see what happens.

dcarosone · 2021-05-21T07:40:16Z

I'm seeing this too, on Ubuntu 21.04, also using zfs encryption

I have znapzend running, and it makes a lot of snapshots. Sometimes, some of them are bad, and can't be used (for example, attempting to send them to a replica destination fails). I now use the skipIntermediates option, and so at least forward progress is made on the next snapshot interval.

In the most recent case (this morning) I had something like 4300 errors (many more than I'd seen previously). There are no block-level errors (read/write/cksum). They're cleared after destroying the affected snapshots and scrubbing (and maybe a reboot, depending on .. day?)

Warning! Speculation below:

this may be related to a race condition?
znapzend wakes up and makes recursive snapshots of about 6 first-level child datasets ot rpool (ROOT, home, data, ...) all at the same time (as well as a couple of other pools, some of those still using LUKS for encryption underneath instead).
I have been having trouble with the ubuntu-native zsysd, whch gets stuck at 100% cpu. Normally I get frustrated and just disable it.
However, recently, I have been trying to understand what it's doing and what's going wrong (it tries to collect every dataset and snapshot and property in memory on startup). It seems like this has happened several times in the past few days while I have been letting zsysd run (so more contention for libzfs operations)
Update I haven't seen this again since disabling zsysd .. ~3 weeks and counting.

aerusso · 2021-06-12T23:20:06Z

@jgoerzen Can you

Capture the zpool events -v report when one of these "bad" snapshots is created?
Try to zfs send that snapshot (i.e., to zfs send ... | cat >/dev/null; notice the need to use cat).
Reboot, and try to zfs send the snapshot.

In my case, #11688 (which you already reference), I've discovered that rebooting "heals" the snapshot -- at least using the patchset I mentioned there

jgoerzen · 2021-06-12T23:36:27Z

I'll be glad to. Unfortunately, I rebooted the machine yesterday, so I expect it will be about a week before the problem recurs.

It is interesting to see the discussion today in #11688. The unique factor about the machine that doesn't work for me is that I have encryption enabled. It wouldn't surprise me to see the same thing here, but I will of course wait for it to recur and let you know.

jgoerzen · 2021-06-21T13:38:09Z

Hello @aerusso,

The problem recurred over the weekend and I noticed it this morning.

Unfortunately, the incident that caused it had already expired out of the zpool events buffer (apparently), as it only went as far back as less than an hour ago. However, I did find this in syslog:

Jun 20 01:17:39 athena zed: eid=34569 class=authentication pool='rpool' bookmark=12680:0:0:98
Jun 20 01:17:39 athena zed: eid=34570 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=12680:0:0:242
Jun 20 01:17:40 athena zed: eid=34571 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=12680:0:0:261
...
Jun 20 17:17:39 athena zed: eid=37284 class=authentication pool='rpool' bookmark=19942:0:0:98
Jun 20 17:17:39 athena zed: eid=37285 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=19942:0:0:242
Jun 20 17:17:40 athena zed: eid=37286 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=19942:0:0:261
...
Jun 20 18:17:28 athena zed: eid=37376 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=21921:0:0:2072
Jun 20 18:17:29 athena zed: eid=37377 class=authentication pool='rpool' priority=2 err=5 flags=0x80 bookmark=21921:0:0:2072
Jun 20 18:17:29 athena zed: eid=37378 class=data pool='rpool' priority=2 err=5 flags=0x80 bookmark=21921:0:0:2072
Jun 20 18:17:40 athena zed: eid=37411 class=authentication pool='rpool' bookmark=21923:0:0:0

It should be noted that my hourly snap/send stuff runs at 17 minutes past the hour, so that may explain this timestamp correlation.

zpool status reported:

  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:04:12 with 0 errors on Sun Jun 13 00:28:13 2021
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x5c81>:<0x0>
        <0x3188>:<0x0>
        rpool/crypt/debian-1@athena-hourly-2021-06-20_23.17.01--2d:<0x0>
        rpool/crypt/debian-1/var@athena-hourly-2021-06-20_23.17.01--2d:<0x0>
        <0x4de6>:<0x0>

Unfortunately I forgot to attempt to do a zfs send before reboot. Those snapshots, though not referenced directly, would have been included in a send -I that would have been issued. From my logs:

Jun 20 18:17:03 athena simplesnapwrap[4740]: Running: /sbin/zfs send -I rpool/crypt/debian-1/var@__simplesnap_bakfs1_2021-06-20T22:17:02__ rpool/crypt/debian-1/var@__simplesnap_bakfs1_2021-06-20T23:17:03__
Jun 20 18:17:03 athena simplesnap[2466/simplesnapwrap]: internal error: warning: cannot send 'rpool/crypt/debian-1/var@athena-hourly-2021-06-20_23.17.01--2d': Invalid argument

So I think that answers the question.

After a reboot but before a scrub, the zfs send you gave executes fine.

cbreak-black · 2021-07-04T12:28:54Z

I have similar symptoms, on an encrypted single-ssd ubuntu 21.04 boot pool, using stock zfs from ubuntu's repos. Deleting the affected snapshots and scrubbing previously cleared the errors, but on reoccurence, repeated scrubbing (without deleting them) caused a deadlock. My system has ECC memory, so it's probably not RAM related.

Does this problem happen with slower pools (like hard disk pools?)
Does this problem happen with pools that have redundancy?
Does this problem happen with with pools that don't trim (hard disk pools again?)

aerusso · 2021-07-04T13:16:46Z

@cbreak-black Was there a system restart between the occurrence of the corrupted snapshot and the problems? Restarting has "fixed" this symptom for me (though you will need to scrub twice for the message to disappear, I believe).

I have a suspicion that this may be a version of #10737 , which has an MR under way there. The behavior I am experiencing could be explained by that bug (syncoid starts many zfs sends on my machine, some of which are not finished; SSDs do the send much faster, so are more likely to get deeper into the zfs send before the next command in the pipeline times out; a reboot heals the issue, for me; there's no on disk corruption, as far as I can tell).

I'm holding off on trying to bisect this issue (at least) until testing that MR. (And all the above is conjecture!)

cbreak-black · 2021-07-04T13:26:04Z

@aerusso No, without a restart I got into the scrub-hang, and had to restart hard. Afterwards, the scrub finished, and several of the errors vanished. The rest of the errors vanished after deleting the snapshots and scrubbing again.

InsanePrawn · 2021-07-04T19:16:11Z

Can I join the club too? #10019
Note how it's also at 0x0. Sadly I deleted said snapshot ~~and dataset~~ by now.

aerusso · 2021-07-04T20:21:30Z

@InsanePrawn I can't seem to find commit 4d5b4a33d in any repository I know of (and neither can github, apparently, either). However, in your report you say this was a "recent git master" and the commit I'm currently betting on being guilty is da92d5c, which was committed in November of the previous year, so I can't use your data point to rule out my theory!

Also, it sounds like you didn't have any good way to reproduce the error --- however, you were using a test pool. Compared to my reproduction strategy (which is just, turn my computer on and browse the web, check mail, etc.) it might be easier to narrow in on a test case (or might have been easier a year and a half ago, when this was all fresh). Anyway, if you have any scripts or ideas of what you were doing that caused this besides "snapshots being created and deleted every couple minutes", it might be useful too. (I already tried lots of snapshot creations and deletions during fio on several datasets in a VM).

InsanePrawn · 2021-07-04T21:02:36Z

Yeah, idk why I didn't go look for the commit in my issue - luckily for us, that server (and pool; it does say yolo, but it's my private server's root pool. it's just that i won't cry much if it breaks; originally due to then unreleased crypto) and the git repo on it still exist. Looks like 4d5b4a33d was two systemd-generator commits by me after 610eec4

InsanePrawn · 2021-07-04T21:06:19Z

FWIW the dataset the issue appeared on was an empty filesystem (maybe a single small file inside) dataset that had snapshots (without actual fs activity) taken in quick intervals (somewhere between 30s and 5m intervals) in parallel with a few (5-15) other similarly empty datasets.
Edit: These were being snapshotted and replicated by zrepl, probably in a similar manner to what znapzend does.

The pool is a raidz2 on 3.5" spinning SATA disks.
I'm afraid I have nothing more to add in terms of reproduction :/

Edit: Turns out the dataset also still exists, the defective snapshot however does not anymore. I doubt that's helpful?

aerusso · 2021-07-05T02:38:25Z

@InsanePrawn Does running the zrepl workload reproduce the bug on 2.0.5 (or another recent release?)

I don't think the snapshot is terribly important --- unless you're able to really dig into it with zdb (which I have not developed sufficient expertise to do). Rather, I think it's the workload, hardware setup, and (possibly, but I don't understand the mechanism at all) the dataset itself. Encryption also is a common theme, but that might just affect the presentation (i.e., there's no MAC to fail in the unencrypted, unauthenticated, case).

Getting at zpool events -v showing the error would probably tell us something (see mine).

cbreak-black · 2021-08-11T09:42:29Z

I've since added redundancy to my pool (it's now a mirror with two devices), and disabled autotrim. The snapshot corruption still happens. Still don't know what is causing it. And I also don't know if the corruption happens when creating the snapshot, and only later gets discovered (when I try to zfs send the snapshots), or if snapshots get corrupted some time in between creation and sending.

aerusso · 2021-08-14T19:51:59Z

@cbreak-black Can you enable the all-debug.sh ZEDlet, and put the temporary directory somewhere permanent (i.e., not the default of /tmp/zed.debug.log)?

This will get the output of zpool events -v as it is generated, and will give times, which you can conceivably triangulate with your other logs. There's other information in those logs that is probably useful, too.

I'll repeat this here: if anyone gets me a reliable reproducer on a new pool, I have no doubt we'll be able to solve this in short order.

wohali · 2021-09-01T21:19:05Z

Just mentioning here that we saw this on TrueNAS 12.0-U5 with OpenZFS 2.0.5 as well -- see #11688 (comment) for our story.

rincebrain · 2021-09-21T06:43:42Z

Since I don't see anyone mentioning it here yet, #11679 contains a number of stories about the ARC getting confused when encryption is involved and, in a very similar looking illumos bug linked from there, eating data at least once.

gamanakis · 2021-09-30T12:41:31Z

@jgoerzen are you using raw send/receive? If yes this is closely related to #12594.

jgoerzen · 2021-09-30T12:51:15Z

@gamanakis Nope, I'm not using raw (-w).

phreaker0 · 2021-10-12T08:41:55Z

it's present in v2.1.1 as well:

Okt 09 01:01:14 tux sanoid[2043026]: taking snapshot ssd/container/debian-test@autosnap_2021-10-08_23:01:14_hourly
Okt 09 01:01:16 tux sanoid[2043026]: taking snapshot ssd/container/debian-test@autosnap_2021-10-08_23:01:14_frequently
Okt 09 01:01:16 tux kernel: VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
Okt 09 01:01:16 tux kernel: PANIC at arc.c:3836:arc_buf_destroy()
Okt 09 01:01:16 tux kernel: Showing stack for process 435
Okt 09 01:01:16 tux kernel: CPU: 2 PID: 435 Comm: z_rd_int_1 Tainted: P           OE     5.4.0-84-generic #94-Ubuntu
Okt 09 01:01:16 tux kernel: Hardware name: GIGABYTE GB-BNi7HG4-950/MKHM17P-00, BIOS F1 05/24/2016
Okt 09 01:01:16 tux kernel: Call Trace:
Okt 09 01:01:16 tux kernel:  dump_stack+0x6d/0x8b
Okt 09 01:01:16 tux kernel:  spl_dumpstack+0x29/0x2b [spl]
Okt 09 01:01:16 tux kernel:  spl_panic+0xd4/0xfc [spl]
Okt 09 01:01:16 tux kernel:  ? kfree+0x231/0x250
Okt 09 01:01:16 tux kernel:  ? spl_kmem_free+0x33/0x40 [spl]
Okt 09 01:01:16 tux kernel:  ? kfree+0x231/0x250
Okt 09 01:01:16 tux kernel:  ? zei_add_range+0x140/0x140 [zfs]
Okt 09 01:01:16 tux kernel:  ? spl_kmem_free+0x33/0x40 [spl]
Okt 09 01:01:16 tux kernel:  ? zfs_zevent_drain+0xd3/0xe0 [zfs]
Okt 09 01:01:16 tux kernel:  ? zei_add_range+0x140/0x140 [zfs]
Okt 09 01:01:16 tux kernel:  ? zfs_zevent_post+0x234/0x270 [zfs]
Okt 09 01:01:16 tux kernel:  arc_buf_destroy+0xfa/0x100 [zfs]
Okt 09 01:01:16 tux kernel:  arc_read_done+0x251/0x4a0 [zfs]
Okt 09 01:01:16 tux kernel:  zio_done+0x407/0x1050 [zfs]
Okt 09 01:01:16 tux kernel:  zio_execute+0x93/0xf0 [zfs]
Okt 09 01:01:16 tux kernel:  taskq_thread+0x2fb/0x510 [spl]
Okt 09 01:01:16 tux kernel:  ? wake_up_q+0x70/0x70
Okt 09 01:01:16 tux kernel:  ? zio_taskq_member.isra.0.constprop.0+0x60/0x60 [zfs]
Okt 09 01:01:16 tux kernel:  kthread+0x104/0x140
Okt 09 01:01:16 tux kernel:  ? task_done+0xb0/0xb0 [spl]
Okt 09 01:01:16 tux kernel:  ? kthread_park+0x90/0x90
Okt 09 01:01:16 tux kernel:  ret_from_fork+0x1f/0x40

phreaker0 · 2021-10-13T19:04:30Z

@aerusso you wrote that da92d5c may be the cause of this issue. My workstation at work panics after a couple of days and I need to reset it. Could you provide a branch of 2.1.1 with this commit reverted (as revert causes merge conflicts I can't fix myself) so I could test if the machine no longer crashes?

aerusso · 2021-10-14T06:53:32Z

@phreaker0 Unfortunately, the bug that da92d5c introduced (#10737) was fixed by #12299, which I believe is present in all maintained branches now. It does not fix #11688, (which I suspect is the same as this bug).

I'm currently running 0.8.6 on Linux 5.4.y, and am hoping to wait out this bug (I don't have a lot of time right now, or for the foreseeable future). But, If you have a reliable reproducer (or a whole lot of time) you could bisect while running 5.4 (or some other pre-5.10 kernel). I can help anyone who wants to do that. If we can find the guilty commit, I have no doubt this can be resolved.

grahamperrin · 2024-12-28T16:08:25Z

#12014 (comment)

… determine if all the issues around encrypted snapshots have been fixed … if there are still bugs, then I'd really like fresh, clear bug reports. …

@robn link from issue 13755 to some more recent point in this issue 12014?

ckruijntjens · 2024-12-31T16:33:00Z

I Now use a different approach with sanoid/syncoid. I use sendoptions to recusive send the snapshots. Now i don't have errors on the enrypted pool. Its now running 6 days without errors. before i always had errors within 5 days.

I will keep you informed if it keeps running without errors.

ckruijntjens · 2025-01-01T19:31:52Z

Guys,

I can confirm. If i use the following i get encryption errors within the 5 days.

syncoid --recursive

But when i use it like this the errors are not happening.

syncoid --sendoptions="R"

I am using it now to send it in raw mode. So now the errors are not happening anymore.

AndrewJDR · 2025-01-01T23:19:17Z

Just throwing this out there for those of you doing automated tests (@HankB and @Germano0) and are trying to reduce the amount of time it takes to reproduce the issue:
Since there's now some evidence that the bug is exacerbated by the manual walking of the dataset tree that syncoid performs during transfers, it may make sense to use more complex (deeper and broader) dataset hierarchies in your testcases. And perhaps when generating writes to the datasets, distribute those writes to files strewn throughout that entire dataset hierarchy...

HankB · 2025-01-02T04:29:16Z

it may make sense to use more complex (deeper and broader) dataset hierarchies in your testcases.

Agreed.

The script I wrote eventually goes 5 deep for nested data sets with a total of 35 in my present test.

distribute those writes to files strewn throughout that entire dataset hierarchy

And writes files throughout the pool (using find -type f and skipping randomly from 0-20 files through the resulting list on each pass. There is a mix of compressible (test) files and in-compressible (random) files.

I loop each pass with 750s sleep in between with the syncoid command and modify script in their own loops. Since the syncoid and modify script take different times to complete, that results in un-synchronized overlap between the two. sanoid runs on a 15 minute schedule and includes frequent snapshots.

I'm open to other suggestions, but in the interest of reproducible results, I can't make too many changes at this point. I'd suggest that someone interested in additional strategies try them either on available H/W or VMs and this can reduce the overall time to get an answer.

best,

Edit: I'm adding a daily scrub to the testing process (manually invoked.) It occurs to me that the corruption could be happening and just not detected.

MichaelKirgus · 2025-02-04T20:11:12Z

I also have the symptom with the corrupted snapshots, but with one kernel panic 😒:
kernel: PANIC at arc.c:3685:arc_buf_destroy()
Its an TrueNAS Scale system (6.6.44-production+truenas, zfs-2.2.99-1) with 2 pools that are using AES-256-GCM.
All child datasets are encrypted.
No READ WRITE CKSUM errors from discs, 1 pool 2 x RAIDZ3 8 wide and the other pool 2 x MIRROR 3 wide.
The system is more of capable with 512 GB ECC RAM and AMD EPYC 3451.
I use periodic snapshots and the remote replication feature that sends zfs snapshots to an remote system which can take a few hours if there are a lot of changes.
After a few days the pool is marked "unhealthy" with the "characteristic" :<0x1> at a bunch of snapshots.
Yesterday I started an scrub and the system crashed.
Today I scrubbed the pool again and now the errors are gone...but i dont think it lasts long...

Sadly the system with this specific RAIDZ3 and encryption configuration is quite new.
How can I assist to solve the problem?
Its an productive system, but maybe i can collect some interesting data to dig deeper.

Its an very nasty bug, its corrupts the snapshots (or other metadata) slowly and at some time there it forces you to scrub and (forcefully sadly reboot) the system...

HankB · 2025-02-05T00:30:09Z

On Tue, Feb 4, 2025 at 2:11 PM Michael Kirgus ***@***.***> wrote: How can I assist to solve the problem?

Hi Michael, I don't know what data to collect that would be helpful. My focus is on identifying steps that reliably reproduce the issue in an environment that can be shared. (I can reproduce this on my laptop in a matter of hours but don't care to share the contents of my storage with the world.) To that end I'm working on scripts to populate a pool with synthetic data and subsequently modify it and then use `syncoid` to send the pool to other storage. It has been a little frustrating and it is taking a lot of time. I can usually provoke the corruption within two days but when no corruption occurs, I need to run a *lot* longer to insure the corruption is not occurring. My most recent test was with the kernel and ZFS versions listed in this issue and that ran for over 10 days and over 1100 `syncoid` invocations and did not result in corruption. At the moment I'm revisiting my methods to see if I can improve on the ability to reveal the issue. If you have any ideas for conditions that tend to cause this (if you're experiencing the same issue) Feel free to share them with me. You can review my efforts at https://github.com/HankB/provoke_ZFS_corruption. Feel free to contribute with informative issues or PRs. Better yet, if you have any spare H/W available, try duplicating my tests (or adapting the synthetic data and processing to your typical work flow.) I'm not a kernel dev or ZFS expert by any stretch of the imagination so don't be afraid to suggest something because you think I already thought of it. best,

…

Message ID: ***@***.***>

-- Beautiful Sunny Winfield

dcarosone · 2025-02-05T02:39:31Z

it seems plausible that the issue is some kind of lock or resource contention, so setting up conditions that provoke or exacerbate that presumably may make it easier to reproduce.

I know both of these were true in my pool, back when i saw the problem

new recursive snapshots taken while the previous is still sending; try a short snapshot interval, limited send bandwidth
multiple subtrees getting snapshots; test/a test/b test/c all getting "independent" snapshots and sends, in the same pool

Also perhaps artificially slow pool storage via vm config.

HankB · 2025-02-05T05:04:19Z

On Tue, Feb 4, 2025 at 8:40 PM Daniel Carosone ***@***.***> wrote: it seems plausible that the issue is some kind of lock or resource contention, so setting up conditions that provoke or exacerbate that presumably may make it easier to reproduce.

I agree.

I know both of these were true in my pool, back when i saw the problem - new recursive snapshots taken while the previous is still sending; try a short snapshot interval, limited send bandwidth - multiple subtrees getting snapshots; test/a test/b test/c all getting "independent" snapshots and sends, in the same pool Also perhaps artificially slow pool storage via vm config.

I'm running on bare metal. I'd be delighted if someone would pick up the mantle on testing on a VM. With my latest test I have reduced the time to provoke corruption to less than a day. Results can be viewed at https://hankb.github.io/provoke_ZFS_corruption/tests/2025-02-03_methodology/results/ The only potential contention I can spot is a 'stir' (fairly heavy disk I/O to a lot of files) completing about 2s before the next `sanoid` run. I think that 2 seconds is not long enough to flush all data to the drive. I suppose what I should do is stop the timer that normally starts the sanoid snapshots and kick sanoid off manually (or perhaps not use `sanoid` and just create snapshots directly.) Tomorrow. best,

…

Message ID: ***@***.***>

-- Beautiful Sunny Winfield

siilike · 2025-02-05T09:43:43Z

I have one pool that hasn't had this issue in 6 days (and from what I can see from the logs didn't have in the previous "restart cycle"). The only difference other than it having less disk I/O is that it has recordsize 128k while the others have recordsize 1M.

I suggest taking snapshots every minute instead and why not also run syncoid and stir concurrently and continuously.

siilike · 2025-02-05T09:51:31Z

I finally found my old tests under another ticket: #11688 (comment)

Among other things I wrote:

There is a strong correlation with how much data is written to the dataset: after creating a third dataset and moving some write-heavy apps there the situation has slightly improved.
Usually it goes bad after the 2nd snapshot -- after deleting some old snapshots it works fine for one round of replication, and once another snapshot is taken one of them is not possible to replicate. This also applies when replication is not done between the snapshots.
No relation to taking snapshots while a send is in progress.
Dataset is random, but mostly happens to the same ones (write-heavy or at least some writes)

And there is a test script that quickly produces the issue:

#!/bin/sh

for i in $(seq 1 100)
do
        echo $i
        zfs snapshot data2/test@$i
        zfs send data2/test@$i | cat > /dev/null
done

for i in $(seq 101 200)
do
        echo $i
        dd if=/dev/zero of=test bs=1 count=1 2>/dev/null
        zfs snapshot data2/test@$i
        zfs send data2/test@$i | cat > /dev/null
done

What I don't remember, though, is whether the script was run on a freshly restarted and scrubbed dataset or not.

HankB · 2025-02-05T15:29:29Z

@siilike Thanks for the comments and link to the other issue.

I suggest taking snapshots every minute instead and why not also run syncoid and stir concurrently and continuously.

I agree and that's what I intend to explore with my next test. I also plan to modify the scripts to halt further operations as soon as corruption is identified.

And there is a test script that quickly produces the issue:

I'll give that a try as well. I particularly like sending the data to /dev/null. I also note that this does not require syncoid or sanoid.

Sieboldianus · 2025-02-05T16:14:16Z

When setting up syncoid/sanoid, I also observed issues with fast syncoid pulls and added a wait time of 10 seconds between datasets. Might be totally not correlated, as I am not using --recursive but --sendoptions=Rw. See the script below.

pull_snapshots.sh

#!/bin/bash

################################################################################
#
# Shell script to pull zfs snapshots with syncoid using a list of key:value pairs
# Works with sendoptions=R, where syncoid's --recursive flag cannot be used
#
# IMPORTANT: Use bash, not sh! e.g.: > bash pull_snapshots.sh
#
################################################################################


# Exit as soon as a command fails
set -e

# Accessing an empty variable will yield an error
set -u

FILE_SNAPSHOTS="$1"
# FILE_SNAPSHOTS=snapshots.txt

pull_syncoid() {
    ZFS_SOURCE=$1
    ZFS_TARGET=$2
    # we use </dev/null to redirect standard input for the syncoid (ssh) command
    # which might otherwise consume the while loop's input
    syncoid --no-sync-snap --no-privilege-elevation \
      --sshkey /home/recvuser/.ssh/id_ed25519 \
      --sshconfig=/home/recvuser/.ssh/config \
      --sendoptions=Rw \
      'proxmox':"$ZFS_SOURCE" "$ZFS_TARGET"
    # optionally add --debug above;

}


while IFS=: read -u 3 -r zfs_source zfs_target; do
  # skip commented lines
  [[ "$zfs_source" =~ ^[[:space:]]*# ]] && continue
  # skip empty lines
  [ -z "$zfs_source" ] && continue
  echo "$(date +%Y-%m-%d) $(date +%H:%M) Pulling: $zfs_source -> $zfs_target"

  pull_syncoid "$zfs_source" "$zfs_target";
  # wait 2 seconds between pulls
  sleep 10

done 3< "$FILE_SNAPSHOTS"

This script is called via cron_autostart.sh:

#!/bin/bash

# Exit as soon as a command fails
set -e

# Accessing an empty variable will yield an error
set -u

# To avoid having to type the absolute path to a command
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/sbin"

# If started as root, then re-start as user "mastodon":
if [ "$(id -u)" -eq 0 ]; then
    echo "Running as root. This should never be reached.";
    exit 1
fi

echo "This runs as user $(id -un)";
# prints "recvuser"

TODAY=$(date -I)

if [[ ( -f /home/recvuser/maintenance) ]]; then
    echo "Maintenance mode, aborting autostart."
    exit 1
fi

echo "Starting pull snapshots at ${TODAY}"

cd ~
bash pull_snapshots.sh "snapshots.txt"

echo "Checking Scrub Status.."

zpool status -v

echo "All Done. $(date +%Y-%m-%d) $(date +%H:%M).";
echo "Goodbye.";

# check if scrub is currently running;
# do nothing if it is, shutdown otherwise
if [[ $(zpool status | grep -q "scrub in progress") ||  ( -f /home/recvuser/maintenance) ]]; then
    echo "Scrub still running or maintenance mode, not shutting down"
else
    sudo /sbin/shutdown
fi;

I don't think I observed the error, although I sometimes get

CRITICAL ERROR: ssh -F /home/recvuser/.ssh/config -i /home/recvuser/.ssh/id_ed25519 -S /tmp/syncoid-proxmox-1738319699-9069 proxmox ' zfs send -R -w -I '"'"'tank_hdd/backups'"'"'@'"'"'autosnap_2024-12-01_00:00:03_monthly'"'"' '"'"'tank_hdd/backups'"'"'@'"'"'autosnap_2025-01-31_00:00:37_daily'"'"' | lzop | mbuffer -q -s 128k -m 16M' | mbuffer -q -s 128k -m 16M | lzop -dfc | pv -p -t -e -r -b -s 8429428704 | zfs receive -s -F 'tank/hdd/backups' 2>&1 failed: 256 at /usr/sbin/syncoid line 889.

I thought this was related to an unstable internet connection, as resume always works and neither sending nor receiving pools are corrupted. Might not be the issue described here (?).

MichaelKirgus · 2025-02-05T16:21:55Z

In my case TrueNAS does the replication and snapshotting with his #middleware, but in the end it comes to changes at the metadata and I/O which has to be handled by ZFS and the hardware somehow.
The appearance with syncoid or sanoid are only symptoms of an bug much deeper.
The system creates an recursive snapshot with in my case over 30 datasets with itself 2-3 child datasets...90 datasets.

Dataset is random, but mostly happens to the same ones (write-heavy or at least some writes)

Yes, it is "random"...but i had errors on datasets with static configuration data which are not touched at all and had errors.

The only potential contention I can spot is a 'stir' (fairly heavy disk I/O
to a lot of files) completing about 2s before the next sanoid run

I want to mention that all my pools are SSD only with an LSI HBA...the iowait of the system is always under 5 % or so...

I will try to setup an test environment with a encrypted pool and 2 virtual systems to replicate between each of them.
ZFS is an awesome file system, this needs to be fixed 🤓.

siilike · 2025-02-05T16:37:30Z

I tried to reproduce the issue with the script, but it did not work, so I likely ran it when the pool had gone bad already.

MichaelKirgus · 2025-02-05T19:24:38Z

@siilike What pool settings are used in your environment? I started with this (this is default setting in TrueNAS):
Sync: STANDARD
Compression Level: LZ4
Atime: Off
Deduplication: Off
Checksum: On
Copies: 1
Record Size: 128K

I wonder if the sync property can make an difference.
If i understand it right, ZFS should only "lie" to other processes but hopefully not to itself...

And there is a test script that quickly produces the issue

I executed the test script twice, but it seems not to corrupt any metadata so far.
The system is booted with the debug kernel, hopefully this will help if it crashes afterwards if i scrub...

siilike · 2025-02-05T19:40:56Z

As far as I understand, the issue is that syncoid does something that triggers this bug in a particular pool. Other pools on the same machine are not affected.

The script I posted works for triggering the creation of invalid snapshots in a pool that has already gone bad. But it does not cause the pool to go bad i.e. it does not trigger the actual bug.

MichaelKirgus · 2025-02-05T20:35:35Z

As far as I understand, the issue is that syncoid does something that triggers this bug in a particular pool. Other pools on the same machine are not affected.

The script I posted works for triggering the creation of invalid snapshots in a pool that has already gone bad. But it does not cause the pool to go bad i.e. it does not trigger the actual bug.

I think the root cause is the zfs send in combination with recursive snapshots. If an snapshot that was recursive created and then an child dataset was send. If in this situation one of his parents are send, modified, deleted or (enumerated?), this seems to trigger this metadata corruption.
But it seems only to occur if dataset encryption is enabled.
Does you system uses hardware accelerated encryption via AESNI?

Sadly i am not an ZFS developer, but the productive system was running fine for years without the encryption.
The system has send terabytes of data to an remote location, sometimes this can be an time consuming task.
I think the combination of encryption, specific dataset properties, recursive snapshots and the timing of zfs send can somehow trigger an race condition.

wohali · 2025-02-05T21:52:59Z

As far as I understand, the issue is that syncoid does something that triggers this bug in a particular pool. Other pools on the same machine are not affected.

Just confirming: we saw this issue with pyznap, and never used syncoid or sanoid.

I think the root cause is the zfs send in combination with recursive snapshots.

Just confirming: we never did any recursive snapshots.

Also confirming: new replication to another server without encryption enabled eliminated the problem.

HankB · 2025-02-07T15:08:56Z

Suggestions to run my test operations have borne fruit in that the first test produced corruption in about 15 minutes. Current results at https://hankb.github.io/provoke_ZFS_corruption/ I've also mentioned this in the thread at https://zfsonlinux.topicbox.com/groups/zfs-discuss/Ta136a4d5eb5a9af8-M23a47f6202c782813cda2d84/reproducing-corruption-related-to-native-encryption

jimsalterjrs · 2025-02-07T20:47:19Z

Just confirming: we saw this issue with pyznap, and never used syncoid or sanoid.

Thank you for this; I'm getting the impression that there has been a serious correlation!=causation problem around this bug recently, no matter how many times I remind people that syncoid doesn't hit any APIs, it literally just orchestrates underlying OpenZFS commands! =)

HankB · 2025-02-07T22:38:57Z

I'm getting the impression that there has been a serious correlation!=causation

Mr Salter,
Thanks for the clarification. I've tried to be clear that syncoid is not causing the corruption but rather the pattern of calls is triggering a bug in ZFS.

I'm continuing to use syncoid for my testing because it is convenient and is known to trigger the issue. Most recent tests are not using sanoid because I'm not sure what it will do if invoked back to back to back. Ordinarily it runs at 15 minute intervals. I just invoke zfs snap -r ... with names crafted to be easy to deal with for a script that destroys older snapshots. I'm also time stamping everything in local time to make it easy (for me) to correlate.

I'm running the snapshot command + a process that modifies random files in a tight loop. Another process runs syncoid in a tight loop and a third loop trims snapshots about every minute.

best,

gamanakis · 2025-02-11T10:10:25Z

@HankB I was playing with your scripts and the following occurred:
After a couple of hours, I ran out of space on the VM and was intentionally interrupting and restarting syncoid (in addition to running thrash_stir, thrash_syncoid, trim_snaps scripts in parallel) to see what may happen. Then it started throwing i/o errors and suspending the pool.

Trying to delete the "%recv" snapshot, created by "receive -s" which syncoid uses by default, led to the same errors (of note: no kernel panic). After a couple of scrubs and rebooting the VM the pool was ok. It might be worth noting that syncoid uses "receive -s" by default, which got me into thinking perhaps this may be the culprit.

@wohali from what I can tell pyznap does not use "receive -s" by default. Do you know if you were using this feature?

rbrewer123 · 2025-02-11T12:17:30Z

@gamanakis I experienced the snapshot corruption on an encrypted zfs dataset regularly, and I only used pyznap with mostly-default settings. So it seems unlikely that recv -s is the only way to trigger the issue. I still use pyznap, but I no longer use zfs-encrypted datasets, and I no longer experience corruption.

HankB · 2025-02-11T13:42:02Z

@gamanakis Thank you for your help.

@HankB I was playing with your scripts and the following occurred: ... Then it started throwing i/o errors and suspending the pool.

I've seen the send pool nearly filling and implemented the trim script to manage that. I usually start with the pool ~50% and have not seen that again.

I've looked for and not seen I/O errors. If present that would cast doubt that the problem is a ZFS bug and rather point to I/O errors instead.

After a couple of scrubs and rebooting the VM the pool was ok. ...

I've never bothered to try to recover as my interest is in provoking the corruption reproducibly and quickly.

syncoid uses "receive -s" by default, which got me into thinking perhaps this may be the culprit.

I don't see how something on the receiving end could cause this as it's decoupled from the send. Once the stream leaves the zfs send process there is no interaction except for possible back pressure if the data cannot be written as fast as it can be written. I suppose that could be part of the pattern.

I need to add a script to loop the script that manages snapshots and provide better instructions now that I'm settling on a procedure. I also believe I have provoked the corruption on FreeBSD and need to investigate that and report if that seems likely.

Germano0 · 2025-02-11T15:29:48Z

Is there anything I can do to help you?

HankB · 2025-02-11T16:59:19Z

@Germano0 Yes, probably.

I will, of course, defer to the ZFS and kernel devs for any direction they would like this to take. My desire was to see if there was a way to reproducibly provoke corruption as the first step in tracking down the regression. My repo (https://github.com/HankB/provoke_ZFS_corruption) is a bit rough but provides that capability. I'm working to improve that, particularly the instructions.

The next step is to determine which commit introduced the regression via git bisect. I have not yet started that. Anyone following my process will likely find errors and omissions and things that are just not clear. Those should probably be referred to me at my repo and not here. It would certainly help to have others try to reproduce my results.

I'll repeat my offer to either start another issue relevant to this issue in OpenZFS or continue discussion at my repo if anyone thinks this is unnecessarily cluttering this issue.

wohali · 2025-02-12T06:57:09Z

@wohali from what I can tell pyznap does not use "receive -s" by default. Do you know if you were using this feature?

It was likely enabled, yes. I don't recally for sure, but we move enough data that restarting a full transfer would be very bad.

I agree that I don't see how something on the receiving end could trigger this bug, though.

dcarosone · 2025-02-12T08:52:20Z

When I was seeing it with znapzend, there was no -s. And, yes, it shouldn't matter how it's received, other than perhaps via some timing and back-pressure effect.

jgoerzen added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels May 8, 2021

glueckself mentioned this issue May 9, 2021

permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload #11688

Open

vicsn mentioned this issue May 12, 2021

ZFS hangs on kernel error: VERIFY3(0 == dmu_bonus_hold_by_dnode #12001

Open

lathiat mentioned this issue Sep 26, 2021

ZFS on Linux null pointer dereference #11679

Closed

rincebrain mentioned this issue Sep 29, 2021

Sometimes raw send on encrypted datasets does not work when copying snapshots back #12594

Closed

ickc mentioned this issue Jan 2, 2025

Syncoid keeps corrupting my datasets jimsalterjrs/sanoid#703

Closed

ZFS corruption related to snapshots post-2.0.x upgrade #12014

ZFS corruption related to snapshots post-2.0.x upgrade #12014

Comments

jgoerzen commented May 8, 2021 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Potentially related bugs

jgoerzen commented May 8, 2021 • edited Loading

jstenback commented May 8, 2021

glueckself commented May 9, 2021 • edited Loading

dcarosone commented May 21, 2021 • edited Loading

aerusso commented Jun 12, 2021

jgoerzen commented Jun 12, 2021

jgoerzen commented Jun 21, 2021

cbreak-black commented Jul 4, 2021

aerusso commented Jul 4, 2021

cbreak-black commented Jul 4, 2021

InsanePrawn commented Jul 4, 2021 • edited Loading

aerusso commented Jul 4, 2021

InsanePrawn commented Jul 4, 2021 • edited Loading

InsanePrawn commented Jul 4, 2021 • edited Loading

aerusso commented Jul 5, 2021

cbreak-black commented Aug 11, 2021

aerusso commented Aug 14, 2021

wohali commented Sep 1, 2021

rincebrain commented Sep 21, 2021

gamanakis commented Sep 30, 2021 • edited Loading

jgoerzen commented Sep 30, 2021

phreaker0 commented Oct 12, 2021

phreaker0 commented Oct 13, 2021

aerusso commented Oct 14, 2021

grahamperrin commented Dec 28, 2024

ckruijntjens commented Dec 31, 2024

ckruijntjens commented Jan 1, 2025

AndrewJDR commented Jan 1, 2025

HankB commented Jan 2, 2025 • edited Loading

MichaelKirgus commented Feb 4, 2025

HankB commented Feb 5, 2025 via email

dcarosone commented Feb 5, 2025

HankB commented Feb 5, 2025 via email

siilike commented Feb 5, 2025

siilike commented Feb 5, 2025

HankB commented Feb 5, 2025

Sieboldianus commented Feb 5, 2025 • edited Loading

MichaelKirgus commented Feb 5, 2025

siilike commented Feb 5, 2025

MichaelKirgus commented Feb 5, 2025

siilike commented Feb 5, 2025

MichaelKirgus commented Feb 5, 2025

wohali commented Feb 5, 2025

HankB commented Feb 7, 2025

jimsalterjrs commented Feb 7, 2025

HankB commented Feb 7, 2025

gamanakis commented Feb 11, 2025 • edited Loading

rbrewer123 commented Feb 11, 2025

HankB commented Feb 11, 2025

Germano0 commented Feb 11, 2025

HankB commented Feb 11, 2025

wohali commented Feb 12, 2025

dcarosone commented Feb 12, 2025 • edited Loading

jgoerzen commented May 8, 2021 •

edited

Loading

jgoerzen commented May 8, 2021 •

edited

Loading

glueckself commented May 9, 2021 •

edited

Loading

dcarosone commented May 21, 2021 •

edited

Loading

InsanePrawn commented Jul 4, 2021 •

edited

Loading

InsanePrawn commented Jul 4, 2021 •

edited

Loading

InsanePrawn commented Jul 4, 2021 •

edited

Loading

gamanakis commented Sep 30, 2021 •

edited

Loading

HankB commented Jan 2, 2025 •

edited

Loading

Sieboldianus commented Feb 5, 2025 •

edited

Loading

gamanakis commented Feb 11, 2025 •

edited

Loading

dcarosone commented Feb 12, 2025 •

edited

Loading