-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS kernel panic VERIFY3(sa.sa_magic == SA_MAGIC) in zpl_get_file_info() on Ubuntu 22.04 #16276
Comments
I am curious if you're using Lustre or not, since one of those seemed to be Lustre calling the interface wrong. (I'm assuming not, or you'd have mentioned it, but it seemed like a reasonable question to be explicit about.) It reminds me of the illumos bug where Those SA_MAGIC values being similarly wild makes me wonder if it got the same wild value from something, e.g. a buffer that it shouldn't have, and wrote it out happily, and now reading it back is going "oh no". I suppose, in theory, if this is reading quota metadata and it's calculated quota data, we could do the same thing I added in the native encryption failure to decrypt quota metadata case and just trigger regenerating it, but that stacktrace suggests it's finding an insane SA, which I don't think the quota stuff uses directly, it just reads it to try and figure out how much space it's using, I imagine... So it'd probably be useful to backport the changes to ASSERTs from the future that let you print custom info when they trigger, and then make it print the object it's on when it panics like that, so that you can then go poking around with zdb. Alternately, one imagines asking zdb to walk the relevant dataset will panic in a similar way if you use flags that will make zdb attempt to do similar things...assuming it's an actual on-disk problem and not an in-memory bug, which seeing what object id it's complaining about would be informative for. If it's a dnode update race, then if Ubuntu hasn't cherrypicked it, some of the fixes around that might be useful. |
We're not using Lustre; it's all straight NFS (originally v3, now mostly v4). We routinely ask ZFS for all user quota information, so I think raw quota information must be good. Based on the code it looks like the objectid isn't directly available at the VERIFY3, as The infrequency of this panic is puzzling to me. Regardless of whether this is on-disk corruption or (consistent) in-memory corruption, almost all of the ZFS filesystem activity is NFS activity, and NFS clients should retry any operation that fails because the fileserver crashes and restarts, which should trigger it again. I doubt Ubuntu 22.04 has cherry-picked very much into their ZFS, and I can't spot signs of it in their confusing git repository of ZFS. |
They've written their own data loss-inducing patches before, so I wouldn't recommend trusting them not to break things, in general. For debugging, I'd probably smuggle the object id in through one of the fields of the |
If I'm reading the code correctly, using zdb to search for this requires using |
If zdb doesn't panic, or the object isn't the same every time this breaks, my guess would be that it's mangling something in memory (and panicking before it makes it out to disk). You could try a kASAN kernel, see if someone's reaching out and mangling something they shouldn't be, potentially. I don't recall or see many changes that might be immediately obviously relevant, though I wouldn't be astonished if this is somehow a race in dnode_sync. You could also try 2.1.15 or 2.2.4 and see if they play any better here, potentially. |
Here is what I think I see about what is happening in the code, to try to keep track of this.
One possible theory is that there is a bug in the attribute update code, one that probably only triggers some of the time (perhaps depending on what else is already in memory). |
can you try to put together a reproducer? I'd look into this if I can reproduce it locally |
Unfortunately we cannot reproduce this issue under any sort of
controlled conditions (and we've tried). This panic happens only rarely
and is extremely unpredictable. We don't know the trigger conditions,
but they're clearly quite rare, with only a small number of crashes ever
happening and generally months between crashes (I think we haven't had
any since I filed this report).
|
mhm got it... would it be possible for you to try current git master to rule out the possibility of this being fixed there? |
These servers are our production ZFS fileservers, so we're not in a position to run current git master on them in order to test things.
|
ok :) one last question, when you say you tried to figure out a reproducer, did you also try creating the test pool with ZoL 0.6.x or the original OI/OpenSolaris? |
All of the pools involved in the crashes are new, only the filesystems themselves are old. Authentic versions of test filesystems probably can't be created now except by pre-2012 or so Solaris installs, and we don't have any OI/OpenSolaris machines any more to even start trying that. We did not try to deliberately make version 4 filesystems, upgrade them, and then test them, since we couldn't find any way to reproduce the crashes on our existing authentic filesystems upgraded from version 4 to version 5. We did try poking at existing filesystems in various ways that did not cause crashes.
|
System information
Describe the problem you're observing
We operate a number of ZFS fileservers exporting filesystems via NFS; these are the descendants of what were originally Solaris fileservers set up in 2008, with filesystems moved from generation to generation with 'zfs send | zfs recv'. As a result, we wound up with some filesystems that were still on ZFS filesystem version 4. In early December of 2023, we did 'zfs upgrade' to upgrade all of those to ZFS filesystem version 5. Since then, two fileservers have experienced a ZFS kernel panic due to an assertion failure, one once and one fileserver twice. These panics can't be provoked by simply reading the filesystems (we do this regularly for backups, and we did this for testing). We are not using encryption.
It's possible that this is related to #12659, #13144, and/or #13937, however our situation seems somewhat different than all of these (eg, lack of encryption).
Include any warning/errors/backtraces from the system logs
This is from our most recent panic and crash (and automated reboot because we set ZFS panics to be kernel panics and we reboot on panics), which happened the second time on the same fileserver. Past panics have had different values for the actual
sa.sa_magic
field: 1446876386, 1682687197, and 1682672016. The stack traces are otherwise consistent with each other (well, fromzpl_get_file_info()
backward).The text was updated successfully, but these errors were encountered: