-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PANIC: zfs: adding existent segment to range tree #15030
Comments
Got about 12 hours from it, before it got wonky and panic again. First few entries seem to be related to
The ZFS panic does not show up until a bit later:
|
I was having similar symptoms on a TrueNAS Scale system. My backup pool of 2 disks got corrupted somehow. Scrub is clean, but any time I'm pushing a particular dataset to it, it panics. I was able to reproduce the panic by manually deleting one snapshot from the dataset. Export and reimport works, but deleting snapshots immediately panics the system and needs hard reset since normal shutdown never completes after zfs panic. Since the backup pool is somewhat important for me, I enabled |
I left it running for about a week with I've had no more I assume this is from zfs constantly rewriting spacemaps in condensing operation. Running the pool in recovery mode for a while eventually removed the duplicated range descriptions during the rewrite operations.... or something like that. |
I'm also seeing this on a pool using kernel 6.5-rc6. Here is my zdb output: https://dpaste.com/DL9KS6SQ5 I've set boot variables for |
@reefland Is there a way to force spacemap condensing to happen faster to try to resolve this? |
@satmandu - sorry, no idea. No clue how to monitor the spacemap condensing operation - trial and error. Hopefully someone more knowledgeable can answer.
|
Thanks. My pool has clean scrubs, but my system still panics if I mount the pool without the special zfs options... but I also haven't had the pool mounted with the special options for more than 16 hours... |
If rewriting the spacemaps removes the assertion, that might suggest you still have lurking excitement eventually, since in theory that should be equally upset by duplicate range insertion, no? My wild guess would be when it threw out the log after you imported the pool and unlocked the datasets with replaying disabled, it no longer had a pending insertion that was invalid. But that's just a wild guess. |
I hit this again... and had to set |
I've had this multiple times on different systems (it has not duplicated on the same system). It's just a process I follow now to recover the system. At least its a workaround. I'm no closer to finding a cause / pattern. |
I destroyed a zfs subvolume (that may have had this issue because that's where intense writes from docker were happening) and still have the issue when importing the parent volume. I will copy all the data to a fresh volume and recreate the volume. If I delete all the data and the volume still has these issues, could someone look at an image of the disk to help figure out how to repair this issue, even if figuring out how this damage is being created isn't proximally resolved? Would that be useful at all? |
@satmandu - I can't comment on your disk image question. But I've had to destroy a number of ZVOLs (extract data, delete ZVOL, recreate, restore, etc) related to this as well. I've since switched all ZVOLs to forced sync writes. Yeah, performance impact, but I haven't had any ZVOLs get corrupted since. I was frequently getting this after a ZFS panic:
Multiple scrubs were not able to fix it. I'm using CRI / K3s instead of Docker, but we're probably using ZVOLs for similar reasons. |
I wonder if we can create a test case of amplifying read/writes to replicate the issue. Or maybe if there is an existing test case we should be trying on our hardware? This sounds like it should be replicable in a test environment where one is reaching the I/O limits of a disk subsystem or a disk. |
I've also encounters this when moving a nvme to another system, I don't know what I did wrong. Only thing I can think of it inserting into the new system while PSU was plugged in but pc was off. Forgot to unplug it. I also noticed getting i/o errors, I don't think the drive is faulty but have not tested it yet. Computer passed memtest |
I've been encountering this on 2.3.0-rc2 after deleting large files (4+GB each) from a pool with dedup active.
After it occurs,
These continue indefinitely and are only resolved by a reboot, after which a scrub works as normal and doesn't find any problems. Seeing as how a txg sync never happened, though, the pool may be consistent, but that suggests data loss, no? Other pool activity appears to continue while those hung tasks remain hung forever. If it helps at all, there are SHA512 and BLAKE3 ZAPs on that pool, and they're quite big for the pool, so I'm trying to get rid of the old SHA512 data so a big chunk can go away. It was twice as large before I got rid of the SHA256 ZAP that was also there from even earlier. Here's a
|
Confirmed that the delete of two CentOS ISOs (11GB each) that caused the most recent instance of this did not actually get committed to the pool, so must have been in that hung txg. Upon further testing, it happens when trying to remove one of those files, in particular. Reboot after the hang and the files are back. Removed the "safe" one and that was fine. Ran a zpool sync. No problem. Remove the other one, and it immediately panicked exactly as above, and that txg never got committed. Reboot after that, and the problem file is back again. Deleted it, and the same thing happened again. So I can recreate this panic on demand, with next to no load on the system at all. Anything y'all want me to look at specifically? |
Also note that the file appears to be gone after deleting it, even though that txg is uncommitted. It only reappears after reboot and import of the pool. |
This reliably triggers for me if I try to delete a specific snapshot. Any ideas on how to fix this? I'll try doing a scrub but from the above, it seems like that won't help.
|
Unfortunately, a scrub didn't help. The scrub shows no errors, but the issue persists. |
In case this is useful:
This pool seems to be beyond recovery, so I'll try destroying and rebuilding it. |
Not only does a scrub not find it, since the checksums are "correct..." It also can destroy data from older snapshots, apparently, as well as cause snapshots to be unremovable, file systems to not be un-mountable, and all sorts of other weird stuff. It's very hard to trace the cause, as there may be multiple bugs at play, with multiple different as well as related causes. I do have an easier time eventually reproducing the bug on test pools that have dedup enabled and have any one or more of the following situations on top of it:
And other similar situations. In short, it seems at least from my un-scientific observations (since I haven't been able to explicitly pinpoint the origin of the failure), that there may be some code in the fast dedup that isn't accounting for the possibility of different compression/checksum/encryption algos, record sizes, etc. Perhaps it's some sort of off-by-one or other length-related miscalculation? I'm of course totally spitballing there. Also, I've anecdotally encountered this more often when blake3 is the checksum/dedup algo in use. Skein and SHA variants, in almost the same amount of test runs, have had it happen less, but the set of test data that's based on is small enough that it's not THAT strong of an indicator.... I Just happened to also notice a few other bugs involving blake3, so it seemed worth mentioning. The test suite being run is not very great, and I have only been able to reproduce the issue without a forced ungraceful halt/panic once, in dedicated testing, among dozens of test batteries that otherwise did involve either hard stopping a vm, killing a network connection, and other forms of simulating actual hard system failures. The thing is that it is similar to behavior I've encountered for years with dedup on ZFS, since the 0.x days, but which has until now usually required some combination of time, high load, and I don't know what other triggers to cause, because it's had delayed onset seeming to be as much as a year from the time data was originally written. Or at least I think it's delayed onset, given the data that has been destroyed by the failures. I suppose it could be happening later on, but again there's been no way I am familiar with to trace the origin of it. I've NEVER been able to trace it deterministically to more than dedup, of some variety, being in use, and something causing one or more of the txg commit queue tasks to hard lock up. The above list is even just me trying to filter out trends... If there's any specific things I can do to provide helpful data on this, I'm all ears! |
With this issue, considering it's a silent corruption thing, that is probably the best course of action anyway. But there's no guarantee you won't hit it again, if using the same configuration for everything. 🤷♂️ |
Yikes. I didn't leave the pool around long enough to find out, I destroyed and recreated from backup when I saw the scrub didn't help. My pool is a very standard configuration, essentially the defaults with encryption. I've not changed anything else or enabled deduplication or changed the record size. For now, I'm running |
@Neurrone Could you give us the version of ZFS you encountered this on? What Linux (or BSD) distribution, and how old the zpool was (i.e., roughly what versions of ZFS the pool was used with)? It would be helpful for understanding if this is some old Ubuntu-specific bug from the 2.1 era, or if this is definitely due to a modern ZFS problem. |
FWIW, I've encountered the specific behavior of the posted issue on:
|
@dodexahedron Has this only affected encrypted datasets? And are these datasets new, or do they "go way back" perhaps being products of Also, the ZFS kernel version (NOT just the userspace version) is very helpful for this kind of triage. You can get that with I think that encryption may not be seen as a priority by the big players in the ZFS development space. Issues that only affect encrypted datasets probably do not reflect on the overall status of the rest of the project (or at least I have not seen any such pernicious problems outside encryption). |
No, I honestly don't think encryption has anything at all to do with it, in the end. Long-winded details interspersed with direct answers to your questions, though: Last system to encounter this stack trace was running I have not yet seen this stack trace on test machines with rc5 and otherwise identical, but I do still consistently see the other bad behavior noted earlier. Pools on the test machines are never from sends/receives, for these. They are always original pools, and are pretty young by the time they hit this, on the test systems. However, a much older pool that also was never received from elsewhere, originally, had thrown this stack trace as well, plus the data loss and pool dysfunction noted, which was what started me looking into it and testing with this stack trace in mind, specifically. But it non-deterministically occurs along with other dedup-related issues, which I suspect are probably closely related anyway and this is just a new symptom from a new feature. They're lockups too, but with different stack traces. So seems like race conditions/concurrency issues by my guess, and I have a hunch this is the result of damage caused by the other problems in the first place. I can somewhat reliably recreate them on brand new pools, now, by running certain workloads on top of them, only ever happening when dedup is in the picture, and very bursty IOP load with a high-ish amount of sync IO from the consumers (iSCSI and NFS serving up VMDKs for VMs on ESXi hosts). The time it takes to hit a problem is variable, but it does usually happen at some point in the next day or two - sometimes much quicker - with the same workload and configuration. The easiest way I've been able to make it happen is to throw up a large MSSQL database on top of it and then do something like rebuild all indexes on a 20GB table, which definitely causes a ton of IO, a lot of it being sync. While I have had encrypted datasets have had this problem, I have not bothered continuing to test out with encryption, since it also happens without it just as easily. So I think encryption is a red herring here as well. Or, if anything, it's just exacerbating it when it's also in play. Once a lockup occurs, I've tried to revive it various different ways, including the most recent I attempted, which was writing a 1 to spl_taskq_kick. But that doesn't even result in any changes to any threads, and no other visible changes in the state of the machine at all in fact, so something is definitely hard locked up. It always requires a hard power cycle to successfully reboot. However, I should note that the extent of the data destruction seems to be significantly less (or at least as far as I can tell) with rc5, but there's still loss of anything written from the time of the lockup forward, of course. And that doesn't even get replayed and I'm thinking never even actually got written to the ZIL in the first place. Upon reboot, it just imports the pool like nothing happened and life goes on until missing data is noticed (which, with iSCSI on top of zvols, is sometimes easier to notice, given ZFS has no clue if what was given to it is valid at a higher level). |
Oh also, no other errors of any kind are indicated by the system, the hardware, or ZFS leading up to this stuff. The first indication will be the stuck IO, and then upon investigation, the hung task once it actually happens, along with inability to run pretty much all zfs/zpool commands, inability to unload the zfs module, inability to even stop other modules that have indirect hard dependencies on it (like scst, serving up those zvols), and other issues stemming from having storage locked up in a way that the rest of the system can't really see. |
@aerusso The pool was created on TrueNas Scale 24.04 with ZFS 2.2.4-2 in May 2024. I migrated to Proxmox 8.3 and reimported this pool in Dec 2024 In Proxmox, the version of ZFS is zfs-2.2.6-pve1 and zfs-kmod-2.2.6-pve1. |
Would it be helpful to consider another angle of attack: allowing for better recovery if this corruption happens, as suggested in #13995? |
After I recreated the pool that got corrupted, its happening again but I have an advance warning of it this time because of the script I run daily to check for metaslab corruption. I'm at my wits end about what else to try, since disabling encryption wouldn't help from the above comments. I'd be happy to invest the time to work with ZFS devs to get to the bottom of this if this is prioritized. Otherwise, I'll be migrating everything off ZFS to BTRFS, since I can't be living in fear of my data being corrupted every month, and the inconvenience of the downtime from recreating the pool from backups without a fix in sight.
The pool is a mirror of Seagate Exos X20 20TB HDDs with encryption enabled and all other settings unchanged from defaults. No dedup is being used. It is storing bulk media and is a backup target for my VMs. |
I noticed the pool has the block cloning feature enabled, but the tunable is set to off by default in Proxmox, so block cloning shouldn't be happening. > cat /sys/module/zfs/parameters/zfs_bclone_enabled
0 So whatever this is shouldn't involve block cloning. |
I haven't run into any panics, but running
I hope I don't have any corruption! I have encryption enabled, I regularly create and cull snapshots with Sanoid, and I do NOT have dedup enabled. I haven't run into the bug that OP has. I am hoping perhaps it's a missed case in zdb rather than corruption. zfs-2.2.7-2 |
All of my pools have this issue. Including one that I just created a week ago. This is across 3 different computers. WTH is going on here? |
same here (FreeBSD 14.2-RELEASE):
|
I don't know if that assert failing is definite confirmation of corruption, it seemed to be for my pool when it crashed last month. Hoping a ZFS dev can chime in to confirm either way. |
FWIW: I'm on FreeBSD 13.3-RELEASE-p2. Pool has been the same since... FreeBSD 9? maybe.
Features:
|
Hi to people who have arrived from @Neurrone's via blog post! Please be aware that If your pool is running fine, then there's almost certainly no problem. |
System information
Describe the problem you're observing
Randomly (not at boot or pool import) system shows panic messages (see below) on the console. The system seemed to be operational. Upon trying to reboot I was unable to get a clean shutdown. Had several
Failed unmounting
messages with references to/root
and/var/log
. Along with a failed to start Journal Service. Eventually it hung at asystemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID ...
Had to kill the power after waiting ~20 minutes.Upon reboot, everything seems fine. No issues with pool import,
zpool status
was good. After an hour or so I would see the first panic message. Usually within 24 to 48 hours, had to reboot again.. it would not be a clean shutdown, required manual power off.The PANIC messages I have is very close to #13483 however, that discussion appears to be on pool import issues. I'm not having any issues with pool import so I'm documenting a new issue as the panic seems to have a different cause.
Include any warning/errors/backtraces from the system logs
NOTE: System is using ZFS Boot Menu, so this is ZFS on Root, but does not require the commonly used separate BOOT and ROOT pools. This method allows for a single ZFS pool (with encryption).
After reboot, status is good:
I tried to issue a
sudo zpool scrub k3s06
and it hanged (did not return to prompt). I open another SSH session,zpool status
does not indicate a scrub is in progress. Checked the console and it had panic message scrolling on the screen.I did another unclean shutdown again and rebooted. Tried some of the suggestions within #13483
Then issues another scrub, which completed successfully:
As expected from the tunables, panic messages now became warning messages:
This seem to run clean as well:
This time I was able to get a clean shutdown. Rebooted without the tunables.
System is now running 3+ hours under normal load, no panic messages have been seen since yet. Will keep monitoring.
The text was updated successfully, but these errors were encountered: