Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mounting file system takes forever when there's a lot of file systems #1484

Closed
FransUrbo opened this issue May 30, 2013 · 27 comments
Closed
Labels
Type: Performance Performance improvement or performance problem

Comments

@FransUrbo
Copy link
Contributor

I'm trying to debug some sharesmb/shareiscsi command speed issues, and created 700 filesystems.

Booting this system 'takes forever', mostly because it needs to fork and wait for 'mount.zfs' to finish.

Looking at lib/libzfs/libzfs_mount.c:do_mount(), it references libmount, which seems to be availible in util-linux since version 2.18, which was released (according to the FTP site) 19-Jan-2012.

Current version is 2.23.1 - https://www.kernel.org/pub/linux/utils/util-linux/v2.23/libmount-docs/.

So 'we' really need to investigate this, and possibly add support for libmount...

@ryao
Copy link
Contributor

ryao commented May 30, 2013

Which distribution are you running and are you using legacy mountpoints?

@FransUrbo
Copy link
Contributor Author

Debian GNU/Linux tested on Lenny and Squeeze (am about to test on Sid as soon as it's installed) and no legacy.

@ryao
Copy link
Contributor

ryao commented May 30, 2013

It sounds like this code is not parallelized. That is not surprising considering that the zvol block device creation code is not parallelized either. It probably is possible to tackle both similarly. Someone just needs to look into it.

My idea for parallizing the zvol code is to create a task queue to handle them asynchronously. In principle, the same task queue could also be used for mounting non-legacy datasets at pool import.

@FransUrbo
Copy link
Contributor Author

It would be difficult to parallelize it if/when there's a tree that needs to be mounted in order.

share
share/tests
share/tests/multitst
share/tests/multitst/tst001
share/tests/multitst/tst001/sub001
share/tests/multitst/tst001/sub001/lst001
share/tests/multitst/tst001/sub001/lst002
share/tests/multitst/tst001/sub002
share/tests/multitst/tst001/sub002/lst001
share/tests/multitst/tst001/sub002/lst002
share/tests/multitst/tst002
[...]

The logistics to find what to parallelize and what not would be a nightmare...

I'm currently trying to find real numbers behind my statement, and my branch (which might be slower than the original one because of the resent changes in sharesmb and shareiscsi), but creating 700+ file systems it will start out quite good - about .5 to a second to create and mount one 'tstXXX' branch (four file systems).
But after about 60 (total about 400), it will take about a minute to create one single file system!

I'm going to test with the released code and also one which disables libshare altogether, but my initial tests (not documented properly), indicates it's the 'create && mount' part that takes time.

I've been looking into libmount (even have a patch that don't work) show that we can't use the higher level API - it does not understand a ZFS dataset. It thinks it's a file and tries to loop mount it (which of course fails, since there IS not such file).

I'm going to look into the lower level API after I've got some real numbers and comparations between the different branches and see if we can use libmount for locking and mtab updates only and just do the mount ourself.

@FransUrbo
Copy link
Contributor Author

Fwiw, this is what I get from libmount (v2.20):

root@DebianZFS-Sid64-SCST:~# zfs mount share
libmount: debug mask set to 0xffff.
libmount:      CXT: [0x2357080]: ----> allocate
libmount:    UTILS: mtab: /etc/mtab
libmount:    UTILS: /etc/mtab: irregular/non-writable
libmount:    UTILS: utab: /run/mount/utab
libmount:      CXT: [0x2357080]: mount: preparing
libmount:      CXT: [0x2357080]: use default optmode
libmount:      CXT: [0x2357080]: OPTSMODE: ignore=0, append=0, prepend=1, replace=0, force=0, fstab=1, mtab=1
libmount:      CXT: [0x2357080]: fstab not required -- skip
libmount:      CXT: [0x2357080]: merging mount flags
libmount:      CXT: [0x2357080]: final flags: VFS=00000000 user=00000000
libmount:      CXT: [0x2357080]: mount: evaluating permissions
libmount:      CXT: [0x2357080]: mount: fixing optstr
libmount:      CXT: appling 0x00000000 flags 'atime,dev,exec,rw,suid,nomand'
libmount:      CXT: appling 0x00000000 flags '(null)'
libmount:      CXT: [0x2357080]: fixed options [rc=0]: vfs: 'rw' fs: 'xattr,zfsutil' user: '(null)', optstr: 'rw,xattr,zfsutil'
libmount:      CXT: [0x2357080]: preparing source path
libmount:      CXT: [0x2357080]: srcpath 'share'
libmount:    CACHE: [0x2357280]: alloc
libmount:    CACHE: [0x2357280]: add entry [ 1] (path): share: share
libmount:      CXT: [0x2357080]: trying to setup loopdev for share
libmount:      CXT: [0x2357080]: enabling AUTOCLEAR flag
libmount:      CXT: [0x2357080]: trying to use /dev/loop0
libmount:      CXT: [0x2357080]: failed to setup device
libmount:      CXT: [0x2357080]: mount: preparing failed
libmount:      CXT: [0x2357080]: <---- reset [status=0] ---->
libmount:    CACHE: [0x2357280]: free
libmount:      CXT: [0x2357080]: <---- free
cannot mount 'share': No such device or address

@FransUrbo
Copy link
Contributor Author

Doing an strace on a 'zfs create' command shows that we're spending a long time reading the mtab...

A cache is created, but it's not used where it's needed (libzfs_mnttab_find()), instead it reads the file directly, which with a lot of entries is expensive.

@stevenburgess
Copy link

I wrote a script a while ago that I think might help think about what can
be mounted in parallel. I organized the file systems based on how many /
characters they had in their mountpoint, then grouped them by that number.
So we could mount all filesystems with 1 slash in their mountpoint at the
same time, then all file systems with 2 at the same tame, after all the 1s
are done.

for your example, if the mountpoints were the same as their name it would
mount like this, each wave mounted in parallel.

first wave
share

second wave
share/tests

third wave
share/tests/multitst

fourth wave
share/tests/multitst/tst001
share/tests/multitst/tst002

fifth wave
share/tests/multitst/tst001/sub002
share/tests/multitst/tst001/sub001

sixth wave
share/tests/multitst/tst001/sub001/lst001
share/tests/multitst/tst001/sub001/lst002
share/tests/multitst/tst001/sub002/lst001
share/tests/multitst/tst001/sub002/lst002

Note that your mountpoint does not need to have anything to do with
your FS name, my program only worked on the mountpoint.

@FransUrbo
Copy link
Contributor Author

This sounds like a goot/working idea, BUT this was in a script. That is, you had a 'mother process' ('The Script (tm)' that kept track on all the file systems, and then just 'forked of' all the mounts in succession I presume.

How did you make sure that the parent fs had finished mounting, before you continued with the child? And what did you do about mount failures?

So in my example, I'll end up with 100 /sbin/mount and 100 /sbin/mount.zfs. I'll need to wait for these to finish, before I spawn another 100 + 100 mounting.... I'm not sure this will be much better. It, for one, will consume a huge amount of memory, possibly without any real speed increase.

And technically/code vice, this will require communication between parent and child processes, so that we don't lose track of any possible mount failure.... So I'll say 'a mess' again :)

Besides, each mount needs to run in succession, one at a time because of mtab updates/locking - only one process may write to the mtab at a time. So these 100+100 process won't run simultaneous anyway...

@stevenburgess
Copy link

Yes indeed, I was able to leverage 'the script' to organize the ordering and have it be the parent process that the individual mounts could 'communicate' with one another. Thats why I thought it would be helpful for thinking about how to organize the mounts, in waves.

We decided that continuing after a mount failure would work best for our purposes, and matched the way zfs mount -a worked.

I also considered stopping the whole process on the first failure and returning information about the failure, don't know which way we should go in ZoL though.

And yes, if we did try to track all the mountpoint dependencies and let failures closer to root stop further mounts then my script would be alot easier to write than trying to write it into the project!

sorry about the formatting on that other post, I replied from my e-mail and did not check the newline formatting on github. It would have been more legible like this:

first wave
share

second wave
share/tests

third wave
share/tests/multitst

fourth wave
share/tests/multitst/tst001
share/tests/multitst/tst002

fifth wave
share/tests/multitst/tst001/sub002
share/tests/multitst/tst001/sub001

sixth wave
share/tests/multitst/tst001/sub001/lst001
share/tests/multitst/tst001/sub001/lst002
share/tests/multitst/tst001/sub002/lst001
share/tests/multitst/tst001/sub002/lst002

@FransUrbo
Copy link
Contributor Author

How many file systems did you try this on, and how much did you gain by using 'The Script'?

@stevenburgess
Copy link

We were not going for speed, we were going for a more controlled, visible mount process than zfs mount -a provides. So unfortunately, we did not record the time it took to run the script.

A few of the servers it ran on had around 600 file systems. If I were to alter this script to go for speed, firstly I would put it on github for everyone (the current version is in php 😦 and not multithreaded) would you want its runtime compared to zfs mount -a?

@FransUrbo
Copy link
Contributor Author

Ah. But that's easy, just ad some fprintf()'s and a 'verbose' flag...

Unfortunatly, I'm (and this issue) is talking about speed. It take >3h to mount 700 file systems, and that's just not acceptable!

And the comparation would be nice!

@ryao
Copy link
Contributor

ryao commented Jun 2, 2013

@FransUrbo Parallelizing a tree is not very hard. Just have each task mount 1 filesystem and queue any children as tasks. It will recursively iterate down the tree.

@FransUrbo
Copy link
Contributor Author

@ryao that's basically what's done now, and the child still need to wait for the parent to finish, so it's uncertain if any speed increase would actually take place...

@FransUrbo
Copy link
Contributor Author

Here's some statistics. It now obvious that it ISN'T the 'create & mount' part that takes forever (not solely any way), but libshare who's extremely inefficient.

There is NO sharing, no fs have the 'share*' options set or changed, this is only simple commands:

create = script that calls 'zfs create <path/fs>' in 100x3 levels (see earlier posts)
list   = 'zfs list'
umount = 'zfs umount -a'
mount  = 'zfs mount -a'

The numbers are from time (real / user /sys).

'My branch' is the branch I'm working on and which I'm running on my live server - the latest iscsi, smbfs and crypto parts. This is 'about the same' as original (at the most, only a few minutes differ), so that's what I use in all the following tests.

Original:
    * Latest git pull
    create:         210m42.084s / 6m11.440s / 69m26.544s
    list:             0m31.733s /  0m0.492s /   0m8.824s
    umount:           1m45.024s /  0m3.396s /   0m6.268s
    mount:          247m40.296s /  3m0.704s /  39m9.200s

w/ my branch + restrictive_retrieve:
   * My iscsi+crypto branch w/ first attempt to improve perf in share{smb,iscsi}
    create:         190m21.672s / 4m52.496s / 56m47.840s
    list:             0m38.944s /  0m0.444s /  0m11.100s
    umount:           1m46.997s /  0m3.088s /  0m10.380s
    mount:          330m27.940s / 2m55.376s / 43m11.576s

w/ my branch + restrictive_retrieve2:
   * My iscsi+crypto branch w/ second attempt to improve perf in share{smb,iscsi}
    create:         177m53.576s /  5m0.260s / 55m22.500s
    list:             0m27.125s /  0m0.340s /   0m7.472s
    umount:           1m31.593s /  0m2.712s /   0m7.340s
    mount:          336m53.791s / 2m42.728s / 43m13.360s

w/ my branch + share{smb,nfs,iscsi} disabled:
    * My iscsi+crypto branch w/ only libshare_{nfs,smb,iscsi}_init() disabled
    create:         202m56.507s /  6m3.120s / 66m24.688s
    list:             0m32.492s /  0m0.576s /   0m9.816s
    umount:           1m44.874s /  0m3.316s /   0m5.416s
    mount:          254m35.382s / 2m42.684s / 39m21.608s

w/ my branch + all libshare disabled:
    * My iscsi+crypto branch w/ all of libshare diabled
    create:           71m7.642s /  2m7.016s / 22m56.880s
    list:             0m17.980s /  0m0.152s /   0m5.824s
    umount:           1m38.414s /  0m3.088s /   0m3.532s
    mount:            1m15.633s /  0m5.132s /  0m32.300s

So what won can see here, it's not my fault :). I thought my SMB/iSCSI additions made the problem worse, but this 'proves' that it isn't. It's not improving the issue, but it's not the big culprit :)

But why numbers don't improve much (other than the 'create & mount' part) in my restricive tests I don't know. It's basically only init making sure that we don't check for availability to often (which I might have done previously).

But with all libshare disabled, it took about a minute to mount all 700+ filesystems! THAT is ... scary considering it took over four HOURS to mount them otherwise...

@FransUrbo
Copy link
Contributor Author

Running

strace -s 3000 -f zfs create share/tests/multitst/tst101

show that /etc/mtab is opened four times and read 1991 times (!!). And since that file is ~66k (712 lines), that of course take a while each time...

In a umount, it's opened 3 and read 673 and in a mount it's also opened 3 times but read 1329 times...

@FransUrbo FransUrbo mentioned this issue Jun 4, 2013
@behlendorf
Copy link
Contributor

@FransUrbo OK, I'm alive again. It looks like you've clearly showed libshare is the issue, but where do we stand on fixing this?

@FransUrbo
Copy link
Contributor Author

I have actually no idea... I've been wetting this for some time now, and I don't have the foggiest on where to even begin.

I'm not even 100% sure that it's the mtab reads that are at fault, but it stands to reason - read 2000 times, from scratch is a lot. Only way to make sure if it IS the mtab open/reads, is to redesign that part and do the tests again (unless someone is really good at using valgrind or whatever :).
There is a cache, but it's not used everywhere (and can't (easily?) be because that 'global' (that tracks the mtab pointer) isn't available everywhere).

One possible way (which is even uglier, I know !!) to deal with this, might be to open and read it once (at the beginning), and 'convert' it's info to a SQL(Light) db which we then can search through. OR even better, 'convert' it to a hdb (from Berkeley DB) db. But if we do that once every single time someone runs 'zfs list ...' etc, it's not absolutly clear that we'll gain any time. It will for sure be slower for X number of filesystems (which most people have)...

Maybe not even use mtab for ZFS? Either statically just ignore it or a flag to zfs/zpool (etc?) that bypasses that part. If we really need to keep tabs on what FS is mounted via an external system, then use an internal format (hdb or whatever) which only 'we' read/write.

Do we really need/want (I don't) want the ZFS' show up in a 'mount' call? We do if any FS is 'legacy' down the tree...

As I said, I just don't know. There might be a number of things we could do, but they might be worse, or just not good enough for 'everyone else'...

@behlendorf
Copy link
Contributor

@FransUrbo Out of curiosity have you tried enabling the mnttab cache? This would drastically reduce the number of accesses to /etc/mtab. The cache was originally disabled to ensure the mount helper was always in sync with the zfs utility. We could probably find a safe way to re-enable the cache. You can enable the cache like this.

diff --git a/cmd/zfs/zfs_main.c b/cmd/zfs/zfs_main.c
index cb5c871..a5be3cf 100644
--- a/cmd/zfs/zfs_main.c
+++ b/cmd/zfs/zfs_main.c
@@ -6395,7 +6395,7 @@ main(int argc, char **argv)
        /*
         * Run the appropriate command.
         */
-       libzfs_mnttab_cache(g_zfs, B_FALSE);
+       libzfs_mnttab_cache(g_zfs, B_TRUE);
        if (find_command_idx(cmdname, &i) == 0) {
                current_command = &command_table[i];
                ret = command_table[i].func(argc - 1, argv + 1);

@FransUrbo
Copy link
Contributor Author

As you said, enabling the cache would probably speed up mounts, but at a huge cost. With 700+ nested filesystems, it's imperative that we're up to date with the actual state. Besides, the cache isn't used everywhere, some functions simply goes directly to mtab (even localy opens, reads, rewinds and closes it).

@behlendorf
Copy link
Contributor

@FransUrbo Well perhaps the best solution here is to use the cache everywhere and keep it up to date. It was originally disabled out of an abundance of paranoia and expedience early on. We've never really seriously revisited that decision. Keeping the cache up to data should be very doable with something like inotify(7). I think it's worth exploring.

http://www.linuxjournal.com/article/8478

@FransUrbo
Copy link
Contributor Author

I saved 11 minutes on the create (from 290 to 279) and thirteen on mounting (from 504 to 491) with the mtab cache enabled, no other changes than that... So it's still pointless. 504 minutes is still eight and a half hour to mount 700 filesystems!

There are places where the cache isn't used at all if memory serves me right, but I looked into adding/rewriting that part, but I could find no way to actually access it (the handle wasn't availible/'reachable')...

@behlendorf
Copy link
Contributor

@FransUrbo shall we close this out. It's my understanding things are considerably better. There's probably still room for improvement but this issue has gotten a bit stale.

@behlendorf behlendorf added Difficulty - Medium Type: Performance Performance improvement or performance problem and removed Type: Feature Feature request or new feature labels Oct 16, 2014
@behlendorf behlendorf removed this from the 0.6.5 milestone Oct 16, 2014
@FransUrbo
Copy link
Contributor Author

I think that's ok. It's "reasonably" fast now - no one (can/would) expects 1k+ filesystems to mount, _snap_, like that...

I also think we can close all issues related to this... I've even considered closing my PR about this. I still think the idea behind it is valid, but it's not necessary.

CREATING 1k+ filesystems on the other hand needs more work/investigation, but that can be another issue.

@behlendorf
Copy link
Contributor

@FransUrbo Sounds good. I'm closing this issue and let's get a new one open for the filesystem creation case. Presumably that's being limited by the sync tasks but it would be good to investigate.

@mailinglists35
Copy link

not sure if I have commented in the right place - does my comment on #845 belong here?
#845 (comment)

@mailinglists35
Copy link

no one (can/would) expects 1k+ filesystems to mount, snap, like that...

well, I do :) - why would it have to take forever?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants