-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow metadata reads during writing #1179
Comments
What is the output of |
|
Was this ever addressed? I find the same thing. Here is output from zdb: pool1: Thanks. Steve |
@dominikh I am rather late in responding to this, but your issue could be caused by having an ashift=9 pool on an advanced format disk. That would incur read-copy-write overhead inside the disk itself. @cousins I am not sure about you. I do not have much information about what you are doing to go on. |
It's not an advanced format disk, it's a SAMSUNG HD103SJ, which still uses real 512 byte sectors. The performance penalties I am seeing are way higher than that read-copy-write would incur, too. |
Hi Richard, The current case is that I have two rsync's going (about 31 TB between the two so they are running for days) and meanwhile I wanted to check to see how much data had been copied so I was running: du -cbsh /pool2/omg/lab// The load on the system is between 1.5 and 2.0 with the majority of it being from ssh (rsync over ssh) and the two rsync processes top - 16:37:35 up 24 days, 1:14, 7 users, load average: 1.70, 1.69, 2.02 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND The last time I ran it it took about 52 minutesto run after 10 TB had been copied. On the source volumes (xfs on a different machine) it takes about 35 minutes to do the whole 31 TB which is longer than I expected but still quite a bit faster than the ZFS system right now. The 31 TB volumes contain a total of about 16 million files and so far about 6.6 million files have been copied. So: FWIW, The XFS system is on an old Supermicro Opteron file server (circa 2005) with two Opteron 248 CPU's with 32 GB of RAM and it is connected to two separate RAID systems via an LSI Ultra320 SCSI HBA. The ZFS system is a new Supermicro system with two Xeon E5-2620 processors with 128 GB of RAM. It is connected to the drives with LSI SAS controllers. You mentioned ashift. I used -o ashift=12 and these are Seagate ST4000NM0023 4 TB drives. Please let me know if I should have used something else. Thanks very much for your help. Let me know if you need any other information. Steve |
@cousins Can you try setting the following module option to see if that helps. If your filesystem is relatively full or fragmented it may help considerably.
|
Intriguing... what's the theory as to how enabling this option may possibly improve performance? Better caching? Perhaps it should become something other than a debug option? |
I gave it a try but I don't think it helped. This is a fairly new system so I can't imagine it is fragmented but this pool is getting filled up. Right now it is at 78% full. Here is what I got: [root@nfs1 lab]# echo " ";echo " Total size: "; time du -cbsh *;echo -n " Number of files: "; time find . -type f | wc -l; echo " " Total size: real 203m41.660s real 94m49.238s So 3 hours and 23 minutes to du 14 TB with 11 million files. The find command itself took 95 minutes. I'm willing to try other things. This system is a beta system. I'm trying to make sure that it is stable and I'm comfortable with zfsonlinux before we put it into production. That said, I don't really want to lose the data on it but I'm willing to try just about anything if it makes sense. |
@chrisrd One of the known soft spots in ZFS can be the block allocator. For a highly fragmented pool it's possible to have to unload+load multiple metaslabs before finding a good place to write. Setting metaslab_debug=1 prevents ZFS from unloading the old metaslabs which can improve things in this case by preventing wasted I/O. This comes at the expense of some memory. There's work planned to improve the allocator if it falls in to this case when the pool is full. @cousins However, since it didn't help clearly that's not the issue here. One other thing you might try is to increase the |
Hi Brian, Here you go. It looks like the max is being reached. The system has 128 GB so is ARC size 64 GB? So, set zfs_arc_meta_limit to 32 GB? Can this be done on a running system or do I set it and restart? Just put: options zfs zfs_arc_meta_limit=34359738368 in /etc/modprobe.d/zfs.conf and reboot? Thanks, Steve [root@nfs1 lab]# cat /proc/spl/kstat/zfs/arcstats |
(Apologies for the ticket topic hijack, but I thought it best to keep the information together for anyone following.) @behlendorf Would metaslab_debug=1 also mean the deferred free list doesn't get compacted until the next mount, and the compact on mount could then take excessive amounts of time? I had metaslab_debug=1 on whilst doing a lot of removes, then the machine was power cycled (possibly due to zfs memory issues), and after the reboot the mount took over 7 hours(!). Would this be an explanation for that behaviour, and why metaslab_debug=1 is a debug option rather than a recommended option for some scenarios? (For anyone interested, my reference for suspecting this behaviour is the 'deferred frees' section of: https://blogs.oracle.com/bonwick/entry/space_maps) |
@dweeezil Yep, I have (lots of) xattrs (w/ xattr=dir), but I'm removing the xattrs before unlinking the files (ref: #457 (comment)) |
...oh yes, I've also previously used stap to confirm that, if I unlink the xattrs first, I don't get into the problematical code in
|
@chrisrd With code from today's master (and, I suspect 0.6.1), the hidden directory hangs around in the unlinked set until a remount even if you manually remove all the xattrs. If you actually perform an operation that fetches an xattr during the current mount session, both the file's object and the hidden directory object get stuck in the unlinked set until the next remount. This problem is a side-effect of the various attempts at fixing xattr-deleting deadlocks. I realize now that this has nothing to do with the original poster's issue but, as I mentioned, any postings involving the unlinked set, xattrs, etc. get my attention. |
@dominikh Is this still a problem with latest GIT, or can this be closed? |
@FransUrbo Can't tell, I'm not currently using ZoL. Feel free to close and someone will refile the issue if it happens again. |
This issue has gotten pretty clutter. Let's close it out and we can reopen a new issue if this is reported again. |
I'm experiencing the following problem when doing any big sequential writes, such as copying (from a different file system)/generating a big file:
Directory listings (e.g. with
ls
) and computing disk usage withdu
get very slow, to the point that usingtree
prints a couple lines per second only.Normal reads, such as
cat
ing another big file, have perfect performance though (they slow down writes to almost a halt, actually), so the slow reads are limited to metadata.I have tested this both with a pool with a single SATA disk and a pool with three SATA disks without redundancy, on different controllers as well. I have tested setting zfs_vdev_max_pending to 1 as well as a higher value than the default, but neither made a noticeable difference.
Under idle load, using
tree
causes constant ~2MB/s reads (according tozpool iostat
). During a write,tree
causes between 6 and 600kB/s reads, but not constant, but rather in "impulses". It reads for 1-2 seconds, then pauses for another 1-2 seconds, and so on.I'd be more than happy to provide any information you need.
The text was updated successfully, but these errors were encountered: