-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mounting file system takes forever when there's a lot of file systems #1484
Comments
Which distribution are you running and are you using legacy mountpoints? |
Debian GNU/Linux tested on Lenny and Squeeze (am about to test on Sid as soon as it's installed) and no legacy. |
It sounds like this code is not parallelized. That is not surprising considering that the zvol block device creation code is not parallelized either. It probably is possible to tackle both similarly. Someone just needs to look into it. My idea for parallizing the zvol code is to create a task queue to handle them asynchronously. In principle, the same task queue could also be used for mounting non-legacy datasets at pool import. |
It would be difficult to parallelize it if/when there's a tree that needs to be mounted in order.
The logistics to find what to parallelize and what not would be a nightmare... I'm currently trying to find real numbers behind my statement, and my branch (which might be slower than the original one because of the resent changes in sharesmb and shareiscsi), but creating 700+ file systems it will start out quite good - about .5 to a second to create and mount one 'tstXXX' branch (four file systems). I'm going to test with the released code and also one which disables libshare altogether, but my initial tests (not documented properly), indicates it's the 'create && mount' part that takes time. I've been looking into libmount (even have a patch that don't work) show that we can't use the higher level API - it does not understand a ZFS dataset. It thinks it's a file and tries to loop mount it (which of course fails, since there IS not such file). I'm going to look into the lower level API after I've got some real numbers and comparations between the different branches and see if we can use libmount for locking and mtab updates only and just do the mount ourself. |
Fwiw, this is what I get from libmount (v2.20):
|
Doing an strace on a 'zfs create' command shows that we're spending a long time reading the mtab... A cache is created, but it's not used where it's needed (libzfs_mnttab_find()), instead it reads the file directly, which with a lot of entries is expensive. |
I wrote a script a while ago that I think might help think about what can for your example, if the mountpoints were the same as their name it would first wave second wave third wave fourth wave fifth wave sixth wave Note that your mountpoint does not need to have anything to do with |
This sounds like a goot/working idea, BUT this was in a script. That is, you had a 'mother process' ('The Script (tm)' that kept track on all the file systems, and then just 'forked of' all the mounts in succession I presume. How did you make sure that the parent fs had finished mounting, before you continued with the child? And what did you do about mount failures? So in my example, I'll end up with 100 /sbin/mount and 100 /sbin/mount.zfs. I'll need to wait for these to finish, before I spawn another 100 + 100 mounting.... I'm not sure this will be much better. It, for one, will consume a huge amount of memory, possibly without any real speed increase. And technically/code vice, this will require communication between parent and child processes, so that we don't lose track of any possible mount failure.... So I'll say 'a mess' again :) Besides, each mount needs to run in succession, one at a time because of mtab updates/locking - only one process may write to the mtab at a time. So these 100+100 process won't run simultaneous anyway... |
Yes indeed, I was able to leverage 'the script' to organize the ordering and have it be the parent process that the individual mounts could 'communicate' with one another. Thats why I thought it would be helpful for thinking about how to organize the mounts, in waves. We decided that continuing after a mount failure would work best for our purposes, and matched the way zfs mount -a worked. I also considered stopping the whole process on the first failure and returning information about the failure, don't know which way we should go in ZoL though. And yes, if we did try to track all the mountpoint dependencies and let failures closer to root stop further mounts then my script would be alot easier to write than trying to write it into the project! sorry about the formatting on that other post, I replied from my e-mail and did not check the newline formatting on github. It would have been more legible like this: first wave second wave third wave fourth wave fifth wave sixth wave |
How many file systems did you try this on, and how much did you gain by using 'The Script'? |
We were not going for speed, we were going for a more controlled, visible mount process than zfs mount -a provides. So unfortunately, we did not record the time it took to run the script. A few of the servers it ran on had around 600 file systems. If I were to alter this script to go for speed, firstly I would put it on github for everyone (the current version is in php 😦 and not multithreaded) would you want its runtime compared to zfs mount -a? |
Ah. But that's easy, just ad some fprintf()'s and a 'verbose' flag... Unfortunatly, I'm (and this issue) is talking about speed. It take >3h to mount 700 file systems, and that's just not acceptable! And the comparation would be nice! |
@FransUrbo Parallelizing a tree is not very hard. Just have each task mount 1 filesystem and queue any children as tasks. It will recursively iterate down the tree. |
@ryao that's basically what's done now, and the child still need to wait for the parent to finish, so it's uncertain if any speed increase would actually take place... |
Here's some statistics. It now obvious that it ISN'T the 'create & mount' part that takes forever (not solely any way), but libshare who's extremely inefficient. There is NO sharing, no fs have the 'share*' options set or changed, this is only simple commands:
The numbers are from time (real / user /sys). 'My branch' is the branch I'm working on and which I'm running on my live server - the latest iscsi, smbfs and crypto parts. This is 'about the same' as original (at the most, only a few minutes differ), so that's what I use in all the following tests.
So what won can see here, it's not my fault :). I thought my SMB/iSCSI additions made the problem worse, but this 'proves' that it isn't. It's not improving the issue, but it's not the big culprit :) But why numbers don't improve much (other than the 'create & mount' part) in my restricive tests I don't know. It's basically only init making sure that we don't check for availability to often (which I might have done previously). But with all libshare disabled, it took about a minute to mount all 700+ filesystems! THAT is ... scary considering it took over four HOURS to mount them otherwise... |
Running
show that /etc/mtab is opened four times and read 1991 times (!!). And since that file is ~66k (712 lines), that of course take a while each time... In a umount, it's opened 3 and read 673 and in a mount it's also opened 3 times but read 1329 times... |
@FransUrbo OK, I'm alive again. It looks like you've clearly showed libshare is the issue, but where do we stand on fixing this? |
I have actually no idea... I've been wetting this for some time now, and I don't have the foggiest on where to even begin. I'm not even 100% sure that it's the mtab reads that are at fault, but it stands to reason - read 2000 times, from scratch is a lot. Only way to make sure if it IS the mtab open/reads, is to redesign that part and do the tests again (unless someone is really good at using valgrind or whatever :). One possible way (which is even uglier, I know !!) to deal with this, might be to open and read it once (at the beginning), and 'convert' it's info to a SQL(Light) db which we then can search through. OR even better, 'convert' it to a hdb (from Berkeley DB) db. But if we do that once every single time someone runs 'zfs list ...' etc, it's not absolutly clear that we'll gain any time. It will for sure be slower for X number of filesystems (which most people have)... Maybe not even use mtab for ZFS? Either statically just ignore it or a flag to zfs/zpool (etc?) that bypasses that part. If we really need to keep tabs on what FS is mounted via an external system, then use an internal format (hdb or whatever) which only 'we' read/write. Do we really need/want (I don't) want the ZFS' show up in a 'mount' call? We do if any FS is 'legacy' down the tree... As I said, I just don't know. There might be a number of things we could do, but they might be worse, or just not good enough for 'everyone else'... |
@FransUrbo Out of curiosity have you tried enabling the mnttab cache? This would drastically reduce the number of accesses to /etc/mtab. The cache was originally disabled to ensure the mount helper was always in sync with the zfs utility. We could probably find a safe way to re-enable the cache. You can enable the cache like this. diff --git a/cmd/zfs/zfs_main.c b/cmd/zfs/zfs_main.c
index cb5c871..a5be3cf 100644
--- a/cmd/zfs/zfs_main.c
+++ b/cmd/zfs/zfs_main.c
@@ -6395,7 +6395,7 @@ main(int argc, char **argv)
/*
* Run the appropriate command.
*/
- libzfs_mnttab_cache(g_zfs, B_FALSE);
+ libzfs_mnttab_cache(g_zfs, B_TRUE);
if (find_command_idx(cmdname, &i) == 0) {
current_command = &command_table[i];
ret = command_table[i].func(argc - 1, argv + 1); |
As you said, enabling the cache would probably speed up mounts, but at a huge cost. With 700+ nested filesystems, it's imperative that we're up to date with the actual state. Besides, the cache isn't used everywhere, some functions simply goes directly to mtab (even localy opens, reads, rewinds and closes it). |
@FransUrbo Well perhaps the best solution here is to use the cache everywhere and keep it up to date. It was originally disabled out of an abundance of paranoia and expedience early on. We've never really seriously revisited that decision. Keeping the cache up to data should be very doable with something like inotify(7). I think it's worth exploring. |
I saved 11 minutes on the create (from 290 to 279) and thirteen on mounting (from 504 to 491) with the mtab cache enabled, no other changes than that... So it's still pointless. 504 minutes is still eight and a half hour to mount 700 filesystems! There are places where the cache isn't used at all if memory serves me right, but I looked into adding/rewriting that part, but I could find no way to actually access it (the handle wasn't availible/'reachable')... |
@FransUrbo shall we close this out. It's my understanding things are considerably better. There's probably still room for improvement but this issue has gotten a bit stale. |
I think that's ok. It's "reasonably" fast now - no one (can/would) expects 1k+ filesystems to mount, _snap_, like that... I also think we can close all issues related to this... I've even considered closing my PR about this. I still think the idea behind it is valid, but it's not necessary. CREATING 1k+ filesystems on the other hand needs more work/investigation, but that can be another issue. |
@FransUrbo Sounds good. I'm closing this issue and let's get a new one open for the filesystem creation case. Presumably that's being limited by the sync tasks but it would be good to investigate. |
not sure if I have commented in the right place - does my comment on #845 belong here? |
well, I do :) - why would it have to take forever? |
I'm trying to debug some sharesmb/shareiscsi command speed issues, and created 700 filesystems.
Booting this system 'takes forever', mostly because it needs to fork and wait for 'mount.zfs' to finish.
Looking at lib/libzfs/libzfs_mount.c:do_mount(), it references libmount, which seems to be availible in util-linux since version 2.18, which was released (according to the FTP site) 19-Jan-2012.
Current version is 2.23.1 - https://www.kernel.org/pub/linux/utils/util-linux/v2.23/libmount-docs/.
So 'we' really need to investigate this, and possibly add support for libmount...
The text was updated successfully, but these errors were encountered: