-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS Receive of encrypted incremental data stream causes a PANIC #13445
Comments
I would suggest trying a 2.1.4-based build, as there were fixes for encrypted send/recv and blocksize changing in the interim. |
I experience the same issue on ZFS 2.1.4 on Debian Bullseye amd64 and on another TrueNAS system. This practically renders my ZFS setup useless because I cannot perform any snapshot based replication and/or backup. Is it something dataset specific that could be fixed? Is there a workaround for this? It's been nearly three months since this has been reported. 😢 |
You can definitely work around it, especially if you know the date of the last snapshot that was successfully sent. The idea is to find all files and folders that were created/modified since the last snapshot was sent, back it up, roll the dataset back, and then put the backed up files back on. Create a new incremental snapshot, and send it. It should work. |
Interesting approach! But sadly wouldn't work for me. I wonder how this is possible. Userspace operations causing a literal kernel panic. I know everybody just focusses on issues / bugs that he experiences himself, but I think this bug is really, really serious. Literally crashed a TrueNAS storage server (reboot!) and causes a panic message in Linux. 😢 |
The original bug appears to be #13699, people seem to be piling in various other things. I am going to bite my tongue and only say that if you find a way to convince OpenZFS leadership that native encryption bugs are worth more than "I hope someone fixes it", please let me know. |
I get what you are saying... |
I haven't dug into this bug, so I don't know if the flaw in this code is
specific to encryption, but it doesn't seem like trying it can make things
worse...
…On Fri, Aug 5, 2022 at 4:56 AM Alexander Schreiber ***@***.***> wrote:
The original bug appears to be #13699
<#13699>, people seem to be piling
in various other things.
I am going to bite my tongue and only say that if you find a way to
convince OpenZFS leadership that native encryption bugs are worth more than
"I hope someone fixes it", please let me know.
I get what you are saying...
Do you think I could work around it by sending the dataset decrypted
instead? I mean, complete an initial sync decrypted, then perform decrypted
incremental send? Is it just the encryption that's causing it?
—
Reply to this email directly, view it on GitHub
<#13445 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUI7LOUN6B7YDL6GPIWQLVXTJLPANCNFSM5VSUYXLA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I sent an unencrypted (that is, not raw (-w)) version of the dataset and can confirm that sending incremental snapshots to it works flawlessly... Given that I have sent the dataset raw and encrypted twice and I was unable to send incremental snapshots, I conclude the issue lies with the encryption / sending "raw". |
Actually it’s an issue with receiving encrypted streams. The send function works as expected. If you downgraded the receiver to 2.0.2 it would probably receive just fine. |
This comment was marked as off-topic.
This comment was marked as off-topic.
I don't think so. #14119 is fixing an issue where receiving non-raw unencrypted incremental sends into an encrypted target without the key loaded where accidentally allowed, causing havoc down the stack. |
I ran into this issue on
The value returned by
On Debian, it's
It's hitting this VERIFY statement, which is only called when raw send is enabled: Line 1783 in b9d9845
EOPNOTSUPP appears to be coming from dnode_set_blksz :Line 1814 in b9d9845
I'll post more details if I can narrow it down to a simple test case, I don't know yet exactly how to reproduce it in isolation. |
Hi, I just wanted to report, I think I'm having the issue. Has there been any progress in tackling this issue? I'm having trouble figuring out how I'm going to make a backup of my main pool, since replication always fails. I can post the dmesg log I get. |
Are you sending an incremental snapshot, or the initial snapshot? If the former, identify which files were changed since the last snapshot, copy them out of the dataset/pool, delete them from the pool, put them back, delete the newest snapshot, and make a new one. Replication should work now. If the later, create a new dataset of the dataset failing to replicate and copy all of the files from the old set to the new one. Delete old dataset once data is copied. |
Hold on, I'm confused about your former option. That's the problem. It's an incremental snapshot, a series of them actually. |
I’m venturing a guess your affected dataset is an older dataset from an older ZFS version. On newer versions of ZFS, some dataset metadata left behind by older versions don’t seem to play nice with the new versions upon sending and receiving them. The idea here is to go back as many snapshots as needed, to get to the point where the last snapshot sent was still working on the backup. Make a note of every file that’s been modified, or created on the dataset since the last working snapshot, and copy them out of the pool. Roll back the source dataset to the last snapshot that sent successfully to the backup. Do the same for the destination. Delete all the files on source that you copied out, that still existed at that point. Copy files back into the dataset, and create a new snapshot. Sending the snapshot should be successful this time around. I’ve discovered this to be an issue that eventually goes away on its own. Just deal with it a couple of times, and it will likely go away for you too. |
So is there a way to detect which version of zfs the snapshot is from? Now, I don't mind losing the old snapshots. I'm ok with only having relatively recent snapshots. But I'm on Debian Bullseye, so my zfs version shouldn't have changed. Think it's 2.0.3. The pool itself, appears to be from December 2021 on truenas core, freebsd. I'm not sure what version of zfs they were using back then. If it helps, here's my syncoid script
|
I don't really think your description of the problem, or how to work around it, is accurate, @cyberpower678. More precisely, if someone who can reproduce this reliably wants to add a few debug prints the next time this happens to them, it's probably pretty straightforward to see why it's getting upset? I have a couple of wild guesses, but without a reproducer I can take home with me, I'm going to get to ask whoever to try things and see how it breaks. e: reading back, no, this is just the same known issue, probably. Go ask Debian to cherrypick #13782 if you want it to work, I think, or run 2.1.7 or newer. |
It’s accurate for me. When I did this, what ever was causing zfs receive to either corrupt the dataset, or throw a panic was gone and the subsequent receive was successful. After 3 more instances of this happening, and doing my tedious workaround, it stopped happening entirely. It’s been almost a year since the last occurrence. |
@rincebrain may have a point that I may not be on point here. I’m laying out to you what worked for me. The general idea is if older snapshots made it to the destination just fine, you keep those. Only delete the snapshots, that are failing to send/receive successfully. Identify what has changed since the last successful snapshot, rollback to the last working snapshot after copying the changed/added files out, and then completely replace said files (delete and copy back) in the dataset, and create a new snapshot based on the last successful one. Try sending that to the destination. If it still panics, then the only option for now is to simply create a new dataset on the source, and move the files to the new dataset, deleting the old one and send the whole thing again. Really annoying, but most effective in the long-run, from my experience. |
Or you could just apply the patch that fixes the error on the receiver. |
That would be a great solution. Too bad I'm not familiar with how ZFS works internally to actually fix it myself. :-( So, I just work around issues until they are fixed, or are no longer issues. |
I wasn't suggesting you write a fix. I was suggesting you use 2.1.7 or newer, which have this fixed, or get your distro du jour to apply the relevant fix. |
Hey so just want to make sure I'm understanding correctly, 2.1.7 fixes this right? Debian has that in the backports, along with a newer kernel. I can check that out. Would I have to upgrade my pool too? If there's any way I can get some debug print statements for you guys, I can provide that. The error seems pretty consistent. Always with the one dataset. |
Oh. Whoops. Now that I reread your initial comment, that does make more sense. lol That being said, I haven't dealt with this issue since I initially opened it here. My encrypted incremental have been replicating without issue. |
You don't need to pull a newer kernel to use the newer version of ZFS, you can just pull the zfs packages from backports. You also shouldn't need to run "zpool upgrade" or otherwise tweak any feature flags. My understanding is that redoing the snapshot, essentially, would just reroll the dice on whether this triggered, because if I understand the bug correctly, in certain cases it wasn't setting that it used large blocks before attempting to make a large block, so at which point specific blocks got included in the send stream (plus variation in your local system) could affect whether this broke for you or not. So I would speculate it's conceivably possible for someone to write a zstream command that reordered when various things were in the send stream so this didn't come up? I'm just spitballing though. |
Pool upgrades shouldn't be needed, but it shouldn't hurt. Though I'm not sure about upgrading ZFS in your particular installation since it's TrueNAS. The latest TrueNAS SCALE is on 2.1.6 zfs --versionzfs-2.1.6-1 |
Just wanted to report the same issue happened with 2.1.7 with Debian 6.0.0. This is the same on the source and destination. The source and destination have an initial snapshot from 12/21/21. The source has many snapshots after that, as it's the primary server. The source was sending a snapshot from 1/30/23, and that's when the kernel panic happened on the destination. So you want me to delete the new snapshots and paste the changed files in the old snapshot? That sounds difficult.
|
Hey,
I'm not sure if Debian is going to include the later versions. Hopefully they do. Didn't that stack trace provide more information? I do have the blocksize set to greater than or equal to 1M on all my datasets. Thing is, all the snapshots, from when I first created the pool, should have had that block size. |
The bug that first patch, at least, wants to fix, is that it wasn't correctly setting the "we can use large blocks here" flag on the dataset in a timely fashion, so then attempting to create a large block on it would, naturally, fail. I would expect Debian to pull in 2.1.9 eventually, though I can't make any promises on what timetable. |
I've ran into a similar issue replicating from Truenas Scale to Out of nowhere, I had a presumably broken snapshot of the replicated dataset on the FreeBSD side. I've been able to resolve the issue by destroying all snapshots of the dataset except for the broken one and by running: zfs rollback <dataset@broken-snap> # <- rollback to the broken snapshot was the key for me
zfs destroy -r <dataset> Afterwards, I've just issued another replication run and Truenas was happily replicating my snapshots to FreeBSD including the previously destroyed dataset. I know this github issue is about Linux but maybe it can help someone struggling for days trying to get their replication fixed without the possibility to update to zfs 2.1.9. Note: Before the rollback, I've upgraded the os to A little more infoTruenas version: Truenas zfs version: zfs-2.1.6-1
zfs-kmod-2.1.6-1 FreeBSD zfs version: zfs-2.1.4-FreeBSD_g52bad4f23
zfs-kmod-2.1.4-FreeBSD_g52bad4f23 FreeBSD crash info: # cat /var/crash/info.last
Dump header from device: /dev/vtbd0s1b
Architecture: amd64
Architecture Version: 2
Dump Length: 416559104
Blocksize: 512
Compression: none
Dumptime: 2023-03-08 21:40:01 +0100
Hostname: REDACTED
Magic: FreeBSD Kernel Dump
Version String: FreeBSD 13.1-RELEASE-p3 GENERIC
Panic String: Solaris(panic): zfs: adding existent segment to range tree (offset=276dff5000 size=1000)
Dump Parity: 3924248330
Bounds: 0
Dump Status: good |
Since FreeBSD 13 is running OpenZFS in the tree now, seems perfectly germane to mention. Did you try just the rollback and then receive again, out of curiosity, before destroying the rest? |
Ok cool, well I've destroyed it first before receiving, just to make sure the replication for this particular dataset starts from scratch. |
I've ran into the same issue once more. This time I only ran To find the broken snapshot, I've used https://github.com/bahamas10/zfs-prune-snapshots and started from oldest to newest using the |
Having the same issue replicating a dataset from TrueNAS Scale to PBS. System information SENDER
System information RECEIVER
Describe the problem you're observingDuring an incremental receive, ZFS caused a panic. no system hungup. only the dataset is busy. Describe how to reproduce the problemWhen I need to replicate a new snapshot with big change (~100 GB), but not always happening. Can't really replicate it. Include any warning/errors/backtraces from the system logs
|
Openzfs 2.1.11 both sender (truenas) and receiver (debian) same panic as OP. So, I have to scrap using zfs replication as we cannot store unencrypted and it's too much effort to send all the data again. |
I would love to, but I worked around the issue by reloading the changed data and recreating a snapshot of it and sending again, which worked around the panic. I haven’t had a panic lately like this, but rather a high amount regarding range trees that I recently fixed/worked around too. |
So right now, my systems are running fine without any panics or ZFS warnings. |
I should be able to try that sometime this week and report back, never built zfs but doesn't seem too bad on Debian. I am fortunate in that zfs on our backup machine is only used for backups which is not working now anyway, so, doesn't matter if zfs doesn't run if I mess something up. |
I applied said patch to current master and first wanted to rollback as I did previously to have the same conditions, I typed: zfs rollback -R Backup/Truenas/tank/Data@backup-20230626033001 And got: kernel:[ 252.381901] PANIC: zfs: adding existent segment to range tree (offset=1e24c8c000 size=1000) Before the patch, I could in fact rollback. zfs -V shows (if this is correct): zfs-2.2.99-1 Now I also cannot even zpool import the pool anymore as that panics with same message. Heck, can't even boot as on boot I get the same message. |
@sfatula My patch is supposed to affects only receive. I can't imagine how it can affect the rollback. especially in a way that would mess up the pool space map. I wonder if you have more than one problem, like bad RAM. Unfortunately there is no solution to already corrupted space maps other than read-only pool import and data evacuation. |
I restored 2.1.11 and machine boots fine, pool works fine. Obviously can't zfs receive still, but it does work. Not sure what the issue might be then. Any other ideas? |
That’s a bug that exists for many including myself. You can set zfs tunables to turn those panics to warnings. The space map isn’t corrupted, it’s being corrupted by newer versions of ZFS on older datasets.
It’s not related to your patch but rather something else in newer ZFS versions corrupting older ZFS datasets. This happened to me and I was able to recover from it. It tends to happen with datasets containing large amounts of files. |
Do you have any pointers to when this happened to you, and what you did? "Newer versions eat my data" is a pretty serious statement. |
I can't find any evidence it's zfs 2.2, but, I did a zdb -b and nothing was detected, no leaks, etc. were reported. Would it possibly make sense to try and apply the patch to a zfs 2.1 version? Maybe I could retry then. zdb -AAA -b Backup Traversing all blocks to verify nothing leaked ... loading concrete vdev 0, metaslab 245 of 246 ...
|
I applied the patch to zfs 2.1.12 (manually of course) and I was able to get the send/recv to work, whereas, it repeatedly failed previous. Same snapshots, so, it would appear that 15039 had a positive effect for me! Thanks @amotin |
@amotin I hope you can gt the patch in a future version of Truenas Scale, I now just got the exact same error replicating to drives attached to the same machine, so, source and destination encrypted but same Scale machine. So, my current Scale backup strategy not working and no good way to apply patch to Scale zfs. |
@sfatula Yes, that is the plan: https://ixsystems.atlassian.net/browse/NAS-122583 . |
Am I to understand this issue only presents when using raw sends? i.e., if I decrypt and re-encrypt on a send/recv, the issue should not manifest? |
My PR fixes only raw send issues. Non-raw does not care about indirect block size and should not be affected. I am not sure "re-encrypt" part actually works (while in theory it I think it could), but if yes, it should work. |
System information
Describe the problem you're observing
During an incremental receive, ZFS caused a panic and a system hangup.
Describe how to reproduce the problem
It happens randomly.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: