-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Directories corrupted, commands hanging indefineltly #5346
Comments
Some suggestions for you to answer, to increase the chances of a reply to your issue: o What distro are you running Just my $2.02... |
Running arch linux, zfs-linux 0.6.5.8_4.8.4_1-1 (archzfs-linux), Mirrored sandisk extreme pro:
Created with:
|
@johnramsden |
Oh, I think it should be |
Yea, mistyped. Any clue what may have caused this? |
@johnramsden |
As in I got the inode with
By the way, at this I had to destroy the dataset, however in order to figure out what was wrong I have sent the dataset to my freenas box. So I'm not doing the debugging on Linux anymore, I I'm not sure if the commands are still completely identical freebsd to linux. |
@johnramsden |
It got stuck.
|
@johnramsden how did you resolve this in the end? I think I'm having the same issue. BTW, I'm on FreeBSD, so if this is the same thing, the issue should probably be punted upstream. |
@panta82 no resolution, I remade my pool. I tried looking for an openzfs bug tracker but was unable to find one. All I could find were links to the illumos bug tracker. Do you happen to know where the best place to open a new bug report would be? |
Nope. I get the feeling zfs project is one of those mailing list + irc instead of github + slack operations, if you know what I mean. So I guess I am rebuilding my pool. Or dropping zfs altogether, dont know yet. |
@johnramsden Have a look at the OpenZFS participate wiki page. I think you can send your bug/issue you may discovered to the openZFS devs via email: developeropen-zfs.org. Don't forget to include the link to this page. |
The author of this issue, @johnramsden, clearly had a corrupted fat zap. The underlying issue is how it happened and the secondary question is how to work around the corruption. As to the workaround, it should be possible to either roll back and/or destroy the dataset containing the corrupted directory, after, of course, preserving the data as necessary. As to the cause, AFAIK, there aren't any known issues that will cause this type of corruption in a fat zap so long as It does seem we could add some defensive code in I'd also suggest to anyone encountering this problem to build their ZFS module with debugging enabled. There are a bunch of extra ASSERTS which will be enabled and might help narrow down the corruption. |
@dweeezil If you are referring to a property being set i'm just running defaults.
I haven't seen the problem since but i'm on the lookout. In the meantime I have build my current setup very modular so it's possible to destroy a dataset if this reoccurs. |
I have the exact same issue on FreeBSD ZFS. The zap corruption is a zeroed zap_leaf_phys_t{}. If ZFS is compiled with debug then the zero zap_leaf_phys_t{} will hit the Without debug the readdir() will loop forever in zap_leaf_lookup_closest()
More details below from both kernels. It's easy enough to avoid the infinite loop in zap_leaf_lookup_closest(), Debug kernel details (hits assert() in zap_get_leaf_byblk()) #12 assfail3 (a=, ... from frame 13: (kgdb) p *$l->l_dbuf (kgdb) set $zlp = (zap_leaf_phys_t *)0xfffffe0015319000 (kgdb) x/4096xw 0xfffffe0015319000 From frame 14: NON-Debug kernel details (spins in zap_leaf_lookup_closest()): #13 zap_leaf_lookup_closest (l=0xfffff801a1aa0500, h=0, cd=0, zeh=0xfffffe08b3a8f508) Variable values from zap_leaf_lookup_closest(): Initial variable values: Inside the for (chunk = zap_leaf_phys(l)->l_hash[lh]; .... ) loop: |
Nice find @baukus. |
Are you sure that your hardware is correct? Was there any power loss issues? I mean you use SSDs in your pool, and today's SSDs are not all trustworthy. Please check your device against https://github.com/rkojedzinszky/zfsziltest If that passes multiple times your hardware can be trusted. If it fails even with 1 error you should not use them longer. |
@rkojedzinszky : SSDs are not involved. |
I'm not any closer to finding the root cause. After hacking zdb to not assert() nor spin on the invalid directory, I get the following (which shows that there are not any DVAs allocated for offsets 0x4000 to 0xc000; the first ZAP leaf block is at offset 0x4000).
|
Hmm, what exactly doesa hole in the ZAP directory "file" do? |
There are two reasons: 1) Finding a znode's path is slower than printing any other znode information at verbosity < 5. 2) On a corrupted pool like the one mentioned below, zdb will crash when it tries to determine the znode's path. But with this patch, zdb can still extract useful information from such pools. openzfs/zfs#5346
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
I believe the issue was actually due to corrupted memory, I ended up finding two faulty DIMMs (on a non-ECC desktop system) after running memtest. After replacing my bad memory I haven't had any corruption issues. |
@johnramsden Awesome to hear you managed to fix your issues and thanks for checking out the issue so long after it was created :) |
I'm having an issue with that from what I can tell is related to zfs, I could be wrong though.
I have the directory that I'm unable to run
ls
on, or ratherls
never returns.When I run
ls
on the directory withstrace
I get the following, i'm not sure if this is of any help.I also had a seemingly related problem with a few months ago that I "resolved" by renaming a directory and forgetting about it. The problem that occurred last time was I was unable to delete a directory that seem to be corrupted, I would end up with
rm -rf
just not returning.I was able to do the same with this directory by renaming it but again I was unable to delete it. When I attempt to, the command just hangs.
Running
strace
on it I get:I ran
perf top
and it shows thatzfs_leaf_lookup_closest
at the top with 90%+ overhead.The text was updated successfully, but these errors were encountered: