-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU stuck when getdents gets run on some directories #4583
Comments
@Marlinc It would appear you've got a corrupted fat zap but the question is how it happened and whether the zap iterator code could handle the situation better. It would appear you've got a directory named "upstart" which is corrupted. If you could find its inode number (with |
@dweeezil this is on my laptop so something might have happened with shutting down or something? Not sure actually. The command is still running but this is the result in the meantime.
I'll update it when its done. I changed the owner of the directory to root so that no automated tools running as my user can access the directory and cause my laptop to hang. |
@Marlinc It sounds like zdb is spinning the same way. |
@dweeezil looks like it, it hasn't stopped yet. Are you on IRC by any chance? |
@Marlinc I think in the mean time, you're going to have to settle for renaming the directory and living with potential space leakage for the time being. You should be able to send/receive the dataset and then destroy the original one if you wanted to eliminate the corruption. That said, however, it would be interesting to get a better handle on the type of corruption which occurred here. Since it appears to be a normal directory, there shouldn't be a way for a crash to leave it in a corrupted state, unless the SSD lies about its flushes. If you wanted to pursue this further, the next easy thing to try would be to compile a debugging build of ZoL without even installing it and then run the zdb with asserts enable and see if any of them are tripped. |
@dweeezil I'm currently compiling ZFS in a container. Should I enable any extra options while compiling? |
Is there a way to run the user land ZFS commands from within the container? That way I don't have to install them on the host. |
@Marlinc The userland utilities can be run on the host directly from the build area. You can, for example, run zdb as |
@dweeezil I guess I'll have to compile on the host then. I didn't want to install all of the required tools on my host OS. I'll make a clone and do it in that instead. |
@dweeezil okay, I compiled it. What now? |
@Marlinc I was interested as to whether a debug build would help isolate the corruption by tripping an ASSERT. You need to configure with |
@dweeezil its unfortunately still hanging. |
@dweeezil that didn't do anything either. Don't I have to reload the kernel module for that? Or is it purely user space tools? Anyway, same result as before. I cleaned the build and applied the patch. |
@Marlinc I wasn't able to look into this any more during the week. A couple of patches in the "zdb" branch of my repo might be handy. First would be e2482cd which with a single "-z" option suppresses the full decode of a zap and with "-zz" suppresses any interpretation of the zap. If you tried with simply a "-z", we might be able to get some interesting stats about the zap. Next is 0959207 which add "-a" to dump a megabyte of raw data from a zap. This allows to grab a copy of the zap in a file for off-line analysis. I'm not suggesting this yet. Does your system have ECC memory? I have a feeling there was a bit-flip in either a zap leaf header (probably lh_prefix or lh_prefix_len) or in a leaf entry (probably le_cd). Something is causing the iteration to loop infinitely. We could work up a patch to break the loop if necessary but since your goal is to remove the directory, you'd wind up with leaked space. If you do have ECC memory, then there's potentially a bug here which merits further investigation. Have you tried to |
My server at home just run into the exact same issue. |
I just hit this problem on Arch Linux running the zfsonlinux stable release. Here is the output from strace where I tried rm -rf:
If I have snapshots should I be able to fix this with a rollback? or do I need to zfs send/recieve a backup off my server? I was able to rename the folder from
|
This is an example strace of
rm
on one of these directories:From there it just gets stuck and the CPU executing that system call just keeps hanging.
The text was updated successfully, but these errors were encountered: