zfs send/recv VERY slow for large files (>5GB) #2746

lintonv · 2014-09-29T13:40:14Z

I have been using ZFS SEND/RECV over SSH. The speeds are excellent for small files. But for large files (>2GB), it is terrible.

File Size less than 5 GB, the speed : 112MB/s (which is excellent and using the full speed of the interface)
File Size greater than 5 GB, the speed is 10MB/s (very, very slow)

Initially, I thought the bottleneck was SSH and the MTU. But despite using jumbo frames (9000 MTU) and using other transfer tools like netcat, there is no change in the transfer speeds.

The bottleneck has to be ZFS. After reading the code in zfs_log.c, I was convinced that zfs_immediate_write_sz was the problem. There is a hard coded limit of 4K in there. But even after tweaking that value, I still saw no performance gain.

ryao · 2014-09-29T15:45:56Z

@lintonv Thanks for letting us know about this. It was previously unknown to us. It might not get immediate attention, but it will be examined. Which version of ZoL are you using, which distribution do you run and what are the distribution and kernel versions?

lintonv · 2014-09-29T16:15:05Z

@ryao Thank you for the response. Here is the info you requested -
ZoL version : 0.6.3
Linux Distro : CentOS 6.3
Linux Kernel : 2.6.32-279.el6.x86_64

I am willing to help with the patch. I just need more insight on to what in SEND or RECV may be the bottleneck. I have done a lot of investigation on this and possible problems could be :

ZIL
Inflight IO size

behlendorf · 2014-09-29T17:55:20Z

@lintonv It would be great if you could help narrow this down. I'd suggest starting by running the following tests.

Determine where it's zfs send or zfs recv causing the performance bottleneck. You can go about this by sending the stream to /dev/null either on the local host or on the remote side of the network socket.
Assuming zfs send is the limitation watch the output of iostat -mx to determine if the system is IO bound and top or perf top if the system is cpu bound.

That should give a decent idea of where to look next.

lintonv · 2014-09-29T19:32:25Z

@behlendorf Thank you for your input.

I think the bottle neck is in ZFS SEND. Here's how I tested that.

I created a send stream for that volume. 'zfs send /vol/pvol > send_stream'. This took around 15 minutes.
I then used SCP to copy that stream over to the next machine. (the transfer rate is 119MB/sec)
I then piped data_stream into ZFS RECV (zfs recv /vol/svol < send_stream). This completed in 60 seconds (approx).

Here is what I see while trying to transfer a 26G Volume. I tried to transfer the same disk.vmdk (26G) using SCP and I am getting 119MB/sec.

With ZFS send/recv, I saw speeds as low as 7MB/sec.

Here are the results of TOP and IOSTAT for that workload -

TOP from send machine -

919 root 39 19 0 0 0 S 32.2 0.0 14:39.52 spl_kmem_cache/
6992 root 20 0 68536 4320 1740 S 25.9 0.0 0:28.65 ssh
927 root 0 -20 0 0 0 D 4.0 0.0 0:16.22 spl_system_task
6991 root 20 0 123m 1568 1200 D 4.0 0.0 0:03.12 zfs
98 root 20 0 0 0 0 R 3.0 0.0 0:48.45 kswapd0
1643 root 10 -10 0 0 0 S 1.3 0.0 1:01.86 aoe_tx0

IOSTAT from send machine -
avg-cpu: %user %nice %system %iowait %steal %idle
2.83 0.00 5.53 0.00 0.00 91.63

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.0 0.00 0.00 342.00 0.00 1.34 0.00 8.00 0.11 0.32 0.32 11.10
etherd!e5000.5 0.00 0.00 348.00 0.00 1.36 0.00 8.00 0.09 0.24 0.24 8.50
etherd!e5000.6 0.00 0.00 342.00 0.00 1.34 0.00 8.00 0.12 0.34 0.34 11.60
etherd!e5000.7 0.00 0.00 348.00 0.00 1.36 0.00 8.00 0.11 0.33 0.33 11.50
etherd!e5000.1 0.00 0.00 347.00 0.00 1.36 0.00 8.00 0.11 0.32 0.33 11.30
etherd!e5000.2 0.00 0.00 342.00 0.00 1.34 0.00 8.00 0.09 0.25 0.25 8.70
etherd!e5000.3 0.00 0.00 347.00 0.00 1.36 0.00 8.00 0.10 0.29 0.29 10.20
etherd!e5000.4 0.00 0.00 342.00 0.00 1.34 0.00 8.00 0.09 0.27 0.27 9.30
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

TOP from recv machine -

4824 root 20 0 99.9m 8256 3020 S 11.6 0.0 0:25.05 sshd
14826 root 20 0 123m 1528 1160 S 4.6 0.0 0:11.38 zfs
2406 root 38 18 0 0 0 S 1.3 0.0 1:35.22 z_wr_iss/0
2408 root 38 18 0 0 0 S 1.3 0.0 1:35.86 z_wr_iss/2
2409 root 38 18 0 0 0 S 1.3 0.0 1:35.14 z_wr_iss/3
2410 root 38 18 0 0 0 S 1.3 0.0 1:35.75 z_wr_iss/4
2411 root 38 18 0 0 0 S 1.3 0.0 1:34.89 z_wr_iss/5
2112 root 10 -10 0 0 0 S 1.0 0.0 1:30.60 aoe_tx0
2407 root 38 18 0 0 0 S 1.0 0.0 1:35.46 z_wr_iss/1
2486 root 0 -20 0 0 0 S 1.0 0.0 4:16.85 txg_sync
2420 root 39 19 0 0 0 S 0.7 0.0 0:38.39 z_wr_int/3
2424 root 39 19 0 0 0 S 0.7 0.0 0:38.31 z_wr_int/7
2425 root 39 19 0 0 0 S 0.7 0.0 0:38.56 z_wr_int/8
2428 root 39 19 0 0 0 S 0.7 0.0 0:38.44 z_wr_int/11
2430 root 39 19 0 0 0 S 0.7 0.0 0:38.48 z_wr_int/13

IOSTAT from recv machine -

avg-cpu: %user %nice %system %iowait %steal %idle
0.64 0.00 7.68 3.20 0.00 88.48

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.0 0.00 0.00 0.00 1339.00 0.00 10.39 15.89 2.82 2.11 0.20 26.50
etherd!e5000.2 0.00 0.00 0.00 1347.00 0.00 10.42 15.85 2.81 2.08 0.20 27.40
etherd!e5000.3 0.00 0.00 0.00 1347.00 0.00 10.42 15.85 2.69 1.99 0.21 27.70
etherd!e5000.4 0.00 0.00 0.00 1347.00 0.00 10.42 15.84 2.82 2.09 0.22 29.80
etherd!e5000.1 0.00 0.00 0.00 1319.00 0.00 10.23 15.88 2.81 2.13 0.21 27.20
etherd!e5000.5 0.00 0.00 0.00 1326.00 0.00 10.26 15.84 2.81 2.12 0.21 28.30
etherd!e5000.6 0.00 0.00 0.00 1345.00 0.00 10.41 15.85 2.71 2.02 0.22 29.00
etherd!e5000.7 0.00 0.00 0.00 1345.00 0.00 10.41 15.85 2.91 2.16 0.21 28.80
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

ryao · 2014-09-29T19:54:52Z

@lintonv Would you examine some of these large files to see if du -sh --apparent-size /path/to/file and du -sh /path/to/file differ? If they do, then the file has holes. This is a known issue that will be fixed in 0.6.4.

lintonv · 2014-09-29T20:28:47Z

@ryao Yes, the size differ. The first command shows 50G, the second command shows 27G. Could you give me the bug number for that fix? Also, do you guys have a timeline for when 0.6.4 will be officially released?

By the way, good catch!

lintonv · 2014-09-30T12:49:09Z

@behlendorf @ryao Do you guys have any more information on the bug, as to what the root cause is? Is there a patch available? Else. could you give me the bug number?
Thank you both.

behlendorf · 2014-09-30T16:46:38Z

It does sound like the issue is on the send side. So the summarize your results.

zfs send -> stream_file, 29MB/s
scp stream_file -> network, 119MB/s
stream_file -> zfs recv, 443 MB/s

The interesting thing I see on the zfs send size is the average read request size is very small 4k. Normally I'd say this is the explanation for the performance issue and the small IO sizes are due either to very small files or a highly fragmented pool. But according to iostat that drives aren't 100% utilized as I'd expect and that's a bit surprising.

Since you've narrowed it down to the zfs send side I'd be interested to see the IO performance over time. I'd run zfs send on the same dataset as a test and pipe the stream to /dev/null. While the test is running run iostat -mx 10 in another window, that will give you 10 second samples of the disk performance. Are the reported rMB/s, avgrq-s, and %util relatively constant for the life of the test?

Given the information in this bug. It's not clear to me that there are any changes in what will be 0.6.4 which will improve this.

ryao · 2014-09-30T17:14:05Z

@lintonv @behlendorf The hole_birth feature coming in 0.6.4 is intended to resolve this:

b0bc7a8

It will require zpool upgrade before it takes effect and afterward, it will no longer be possible to use 0.6.3 with the pool. I suggest waiting for the 0.6.4 release.

behlendorf · 2014-09-30T18:39:11Z

@ryao The hole birth feature will only benefit incremental sends. It doesn't sound like that's the case here. Still it would be interesting to see if there's any benefit when 0.6.4 is tagged (perhaps a month).

lintonv · 2014-09-30T18:53:38Z

@behlendorf Thank you. How did you get those numbers for zfs recv? 443 MB/s?

Here is the information you requested. I am not attaching the whole output as it is too big, but there is plenty of information below :

avg-cpu: %user %nice %system %iowait %steal %idle
0.01 0.00 2.83 0.03 0.00 97.13

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.30 0.00 1.30 0.00 0.01 9.85 0.00 0.23 0.23 0.03
etherd!e5000.0 0.00 0.00 402.80 0.00 1.57 0.00 8.00 0.12 0.29 0.29 11.76
etherd!e5000.5 0.00 0.00 400.20 0.00 1.56 0.00 8.00 0.12 0.31 0.31 12.43
etherd!e5000.6 0.00 0.00 402.50 0.00 1.57 0.00 8.00 0.12 0.30 0.30 11.90
etherd!e5000.7 0.00 0.00 400.80 0.00 1.57 0.00 8.00 0.11 0.27 0.27 11.00
etherd!e5000.1 0.00 0.00 401.40 0.00 1.57 0.00 8.00 0.12 0.29 0.29 11.60
etherd!e5000.2 0.00 0.00 403.40 0.00 1.58 0.00 8.00 0.12 0.29 0.29 11.75
etherd!e5000.3 0.00 0.00 400.00 0.00 1.56 0.00 8.00 0.12 0.29 0.29 11.64
etherd!e5000.4 0.00 0.00 402.70 0.00 1.57 0.00 8.00 0.12 0.30 0.30 11.89
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 3.96 0.01 0.00 96.03

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.20 0.10 0.60 0.00 0.00 11.43 0.00 0.71 0.71 0.05
etherd!e5000.0 0.00 0.00 353.70 0.00 1.38 0.00 8.00 0.11 0.31 0.31 10.85
etherd!e5000.5 0.00 0.00 350.00 0.00 1.37 0.00 8.00 0.11 0.30 0.30 10.55
etherd!e5000.6 0.00 0.00 354.10 0.00 1.38 0.00 8.00 0.10 0.29 0.29 10.32
etherd!e5000.7 0.00 0.00 351.00 0.00 1.37 0.00 8.00 0.10 0.29 0.29 10.08
etherd!e5000.1 0.00 0.00 350.90 0.00 1.37 0.00 8.00 0.11 0.30 0.30 10.63
etherd!e5000.2 0.00 0.00 354.80 0.00 1.39 0.00 8.00 0.11 0.30 0.30 10.64
etherd!e5000.3 0.00 0.00 350.40 0.00 1.37 0.00 8.00 0.10 0.29 0.29 10.13
etherd!e5000.4 0.00 0.00 355.90 0.00 1.39 0.00 8.00 0.10 0.29 0.29 10.40
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 3.77 0.01 0.00 96.22

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.70 0.00 0.70 0.00 0.01 16.00 0.00 0.71 0.71 0.05
etherd!e5000.0 0.00 0.00 372.40 0.00 1.45 0.00 8.00 0.11 0.30 0.29 10.95
etherd!e5000.5 0.00 0.00 367.50 0.00 1.44 0.00 8.00 0.11 0.30 0.30 10.85
etherd!e5000.6 0.00 0.00 372.30 0.00 1.45 0.00 8.00 0.11 0.30 0.30 11.25
etherd!e5000.7 0.00 0.00 366.90 0.00 1.43 0.00 8.00 0.11 0.30 0.30 11.11
etherd!e5000.1 0.00 0.00 367.70 0.00 1.44 0.00 8.00 0.11 0.30 0.30 10.87
etherd!e5000.2 0.00 0.00 372.10 0.00 1.45 0.00 8.00 0.11 0.28 0.28 10.56
etherd!e5000.3 0.00 0.00 367.30 0.00 1.43 0.00 8.00 0.11 0.29 0.29 10.79
etherd!e5000.4 0.00 0.00 371.40 0.00 1.45 0.00 8.00 0.11 0.29 0.29 10.76
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.01 0.00 3.10 0.01 0.00 96.87

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.30 0.00 0.40 0.00 0.00 14.00 0.00 0.25 0.25 0.01
etherd!e5000.0 0.00 0.00 224.10 0.00 0.88 0.00 8.00 0.07 0.31 0.31 6.94
etherd!e5000.5 0.00 0.00 222.40 0.00 0.87 0.00 8.00 0.07 0.30 0.30 6.65
etherd!e5000.6 0.00 0.00 223.30 0.00 0.87 0.00 8.00 0.07 0.32 0.32 7.14
etherd!e5000.7 0.00 0.00 222.00 0.00 0.87 0.00 8.00 0.07 0.29 0.29 6.51
etherd!e5000.1 0.00 0.00 221.20 0.00 0.86 0.00 8.00 0.07 0.30 0.30 6.68
etherd!e5000.2 0.00 0.00 224.30 0.00 0.88 0.00 8.00 0.07 0.29 0.29 6.52
etherd!e5000.3 0.00 0.00 220.90 0.00 0.86 0.00 8.00 0.06 0.29 0.29 6.32
etherd!e5000.4 0.00 0.00 223.00 0.00 0.87 0.00 8.00 0.07 0.29 0.29 6.50
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.00 0.01 0.00 99.99

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.20 0.00 0.80 0.00 0.00 10.00 0.00 1.50 0.62 0.05
etherd!e5000.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

@ryao Thank you as well. The additional bug info was helpful to try to understand the fix as well. That looks like a huge fix! Have you ran any tests or performance tests to validate the fix?

@ryao @behlendorf At this point, I am curious to know if you both agree that this will be fixed by the hole_birth feature.

ryao · 2014-10-01T03:20:42Z

@ryao The hole birth feature will only benefit incremental sends. It doesn't sound like that's the case here. Still it would be interesting to see if there's any benefit when 0.6.4 is tagged (perhaps a month).

If there are N snapshots, then sending the latest should be the equivalent to sending the oldest + N-1 incremental sends. That said, pull request #2729 might also help.

ryao · 2014-10-01T14:06:28Z

@lintonv If you have multiple snapshots and are sending the latest, then it is quite likely that this will be fixed by the hole_birth feature. There is another bug that might also affect you which is fixed by #2729 though. It is possible both are issues in your situation.

lintonv · 2014-10-01T16:09:07Z

@ryao @behlendorf

Two important questions :

I am in a production environment and run zfs 0.6.3. If I apply the patches (hole_birth and 2729), and upgrade my zpool, what are the consequences? When 0.6.4 officially comes out, will I be able to upgrade to that without issues?
If there are issues, then is there are an alternative way to do the upgrade in clean manner to 0.6.4 keeping in mind that I might be running 0.6.3 with those patches?

The transfer speeds is a major problem for us, but problems with future upgrade of our pools in production environment is a priority for us.

behlendorf · 2014-10-02T23:06:00Z

@lintonv I'd advise against cherry picking the patches for hole_birth. They have some significant dependencies and it would be easy to accidentally get something wrong. If you need to run with these changes I think you'd be better off grabbing the source from master and using that until 0.6.4 is tagged.

The master branch sees a significant amount of real world use and we try very hard to ensure it's always in a stable state. You're also less likely to have issue updating if you stick with the master branch rather than roll your own thing.

As for what to expect from the hole_birth feature it's well scripted in the updated man page. But at a high level this feature will help you if your datasets contain sparse files and your doing incremental send/recvs. It will also only apply for snapshot created after this feature is enabled. See:

       hole_birth

           GUID                   com.delphix:hole_birth
           READ-ONLY COMPATIBLE   no
           DEPENDENCIES           enabled_txg

           This feature improves performance of incremental sends  ("zfs  send
           -i") and receives for objects with many holes. The most common case
           of hole-filled objects is zvols.

           An incremental send stream from snapshot A to snapshot  B  contains
           information  about every block that changed between A and B. Blocks
           which did not change between those snapshots can be identified  and
           omitted from the stream using a piece of metadata called the 'block
           birth time', but birth times are not  recorded  for  holes  (blocks
           filled  only  with  zeroes).  Since holes created after A cannot be
           distinguished from holes created before A, information about  every
           hole  in  the  entire  filesystem  or  zvol is included in the send
           stream.

           For workloads where holes are rare this is not a problem.  However,
           when incrementally replicating filesystems or zvols with many holes
           (for example a zvol formatted with another  filesystem)  a  lot  of
           time  will  be  spent sending and receiving unnecessary information
           about holes that already exist on the receiving side.

           Once the hole_birth feature has been enabled the block birth  times
           of  all new holes will be recorded. Incremental sends between snap-
           shots created after this feature is enabled will use this new meta-
           data to avoid sending information about holes that already exist on
           the receiving side.

           This feature becomes active as soon as it is enabled and will never
           return to being enabled.

dswartz · 2014-10-02T23:11:34Z

Brian, is the following a typo?

"This feature becomes active as soon as it is enabled and will never
return to being enabled."

Shouldn't the last word be 'disabled'?

behlendorf · 2014-10-03T00:08:39Z

@dswartz Actually that's correct. A feature may be enabled, active, or disabled. See zpool-features(5).

   Feature states
       Features can be in one of three states:

       active
                   This feature's on-disk format changes are in effect on  the
                   pool.  Support  for  this feature is required to import the
                   pool in read-write mode. If this feature is  not  read-only
                   compatible,  support is also required to import the pool in
                   read-only mode (see "Read-only compatibility").

       enabled
                   An administrator has marked this feature as enabled on  the
                   pool,  but  the  feature's  on-disk format changes have not
                   been made yet. The pool can still be imported  by  software
                   that does not support this feature, but changes may be made
                   to the on-disk format at any time which will move the  fea-
                   ture to the active state. Some features may support return-
                   ing to the enabled state after becoming  active.  See  fea-
                   ture-specific documentation for details.

       disabled
                   This  feature's  on-disk  format changes have not been made
                   and will not be made unless an administrator moves the fea-
                   ture to the enabled state. Features cannot be disabled once
                   they have been enabled.

@kpande The birth txg is stored in the block pointer for the hole. I haven't double checked the source but it should be set for any newly created holes.

lintonv · 2014-10-03T20:44:05Z

@behlendorf @ryao Thank you for the detailed information.

I did some testing on the version of the ZFS code in GIT, which has both the birth_hole as well as the #2729 fix. Unfortunately, it does not seem to fix the transfer speeds.

It also did not appear that the number of snapshots on the system made a difference. It also does not seem that whether the send is incremental or not matters either.

lintonv · 2014-10-07T13:35:06Z

@ryao Was the birth_hole code tested against files that have holes? The bug that I am seeing is reproducible easily.

lintonv · 2014-12-04T16:41:24Z

I need to make a few corrections and hopefully this will help get to the root cause -

I am not dealing with ZFS Volumes, but ZFS Filesystems. I am trying ZFS send/recv ZFS filesystem snapshots
I have been weary of the stability of ZVOL, therefore I have been using a ZFS filesystem and creating a file and presenting it to upper layers (above zpool) as a Volume.

lintonv · 2015-02-20T15:17:16Z

@behlendorf

Just wanted to update you on some results from further testing and debug this week.

We were using a record size of 4K on the ZFS Filesystems were using for send/receive. By increasing this to 128K, we saw the send speed increase from 11 MB/sec to 38 MB/sec.

Could you let me know what other parameter I am using below could affect the speeds we are seeing. We should be at least getting 120 MB/sec.

Here are the pool parameters :

NAME PROPERTY VALUE SOURCE
ssn-0-12-36 size 5.81T -
ssn-0-12-36 capacity 1% -
ssn-0-12-36 altroot - default
ssn-0-12-36 health ONLINE -
ssn-0-12-36 guid 16967859640226627487 default
ssn-0-12-36 version - default
ssn-0-12-36 bootfs - default
ssn-0-12-36 delegation on default
ssn-0-12-36 autoreplace off default
ssn-0-12-36 cachefile - default
ssn-0-12-36 failmode wait default
ssn-0-12-36 listsnapshots off default
ssn-0-12-36 autoexpand off default
ssn-0-12-36 dedupditto 0 default
ssn-0-12-36 dedupratio 1.00x -
ssn-0-12-36 free 5.72T -
ssn-0-12-36 allocated 91.2G -
ssn-0-12-36 readonly off -
ssn-0-12-36 ashift 12 local
ssn-0-12-36 comment - default
ssn-0-12-36 expandsize 0 -
ssn-0-12-36 freeing 0 default
ssn-0-12-36 fragmentation 0% -
ssn-0-12-36 leaked 0 default
ssn-0-12-36 feature@async_destroy enabled local
ssn-0-12-36 feature@empty_bpobj active local
ssn-0-12-36 feature@lz4_compress active local
ssn-0-12-36 feature@spacemap_histogram active local
ssn-0-12-36 feature@enabled_txg active local
ssn-0-12-36 feature@hole_birth active local
ssn-0-12-36 feature@extensible_dataset enabled local
ssn-0-12-36 feature@embedded_data active local
ssn-0-12-36 feature@bookmarks enabled local

Here is the parameter of one FS in the pool above:

[root@ssn-0-12-36 NAME PROPERTY ssn-0-12-36/g_2G type ssn-0-12-36/g_2G creation ssn-0-12-36/g_2G used ssn-0-12-36/g_2G available ssn-0-12-36/g_2G referenced ssn-0-12-36/g_2G compressratio ssn-0-12-36/g_2G mounted ssn-0-12-36/g_2G quota ssn-0-12-36/g_2G reservation ssn-0-12-36/g_2G recordsize ssn-0-12-36/g_2G mountpoint ssn-0-12-36/g_2G sharenfs ssn-0-12-36/g_2G checksum ssn-0-12-36/g_2G compression ssn-0-12-36/g_2G atime ssn-0-12-36/g_2G devices ssn-0-12-36/g_2G exec ssn-0-12-36/g_2G setuid ssn-0-12-36/g_2G readonly ssn-0-12-36/g_2G zoned ssn-0-12-36/g_2G snapdir ssn-0-12-36/g_2G aclinherit ssn-0-12-36/g_2G canmount ssn-0-12-36/g_2G xattr ssn-0-12-36/g_2G copies ssn-0-12-36/g_2G version ssn-0-12-36/g_2G utf8only ssn-0-12-36/g_2G normalization ssn-0-12-36/g_2G casesensitivity ssn-0-12-36/g_2G vscan ssn-0-12-36/g_2G nbmand ssn-0-12-36/g_2G sharesmb ssn-0-12-36/g_2G refquota ssn-0-12-36/g_2G refreservation ssn-0-12-36/g_2G primarycache ssn-0-12-36/g_2G secondarycache ssn-0-12-36/g_2G usedbysnapshots ssn-0-12-36/g_2G usedbydataset ssn-0-12-36/g_2G usedbychildren ssn-0-12-36/g_2G ssn-0-12-36/g_2G logbias ssn-0-12-36/g_2G dedup ssn-0-12-36/g_2G mlslabel ssn-0-12-36/g_2G sync ssn-0-12-36/g_2G refcompressratio ssn-0-12-36/g_2G written ssn-0-12-36/g_2G logicalused ssn-0-12-36/g_2G logicalreferenced ssn-0-12-36/g_2G snapdev ssn-0-12-36/g_2G acltype ssn-0-12-36/g_2G context ssn-0-12-36/g_2G fscontext ssn-0-12-36/g_2G defcontext ssn-0-12-36/g_2G rootcontext ssn-0-12-36/g_2G relatime ssn-0-12-36/g_2G redundant_metadata ssn-0-12-36/g_2G overlay ssn-0-12-36/g_2G ssn-0-12-36/g_2G ssn:template ssn-0-12-36/g_2G ssn:snap_space_used ssn-0-12-36/g_2G ssn:thin ssn-0-12-36/g_2G ssn:snap_enabled ssn-0-12-36/g_2G ssn-0-12-36/g_2G ssn:vol_size tmp]# zfs get all ssn-0-12-36/g_2G
VALUE SOURCE
filesystem -
Fri Feb 20 15:05 2015 -
162K -
4.74T -
162K -
1.00x -
yes -
none default
none default
128K local
/ssn-0-12-36/g_2G default
off default
on default
off default
off local
on default
on default
on default
off default
off default
hidden default
restricted default
on default
on default
1 default
5 -
off -
none -
sensitive -
off default
off default
off default
none default
none default
metadata local
metadata local
0 -
162K -
0 -
usedbyrefreservation 0 -
latency default
off default
none default
standard default
1.00x -
162K -
12.5K -
12.5K -
hidden default
off default
none default
none default
none default
none default
off default
all default
off default
ssn:snap_reserve_percentage 100 local
0 local
0 local
0 local
TRUE local
ssn:snap_space_total 2048 local
2048 local

lintonv · 2015-02-20T20:51:19Z

@behlendorf Just FYI, we are using 8 SSDs in a RAID-Z1 setup

kernelOfTruth · 2015-03-15T18:20:13Z

@lintonv please take a look at #3171

lintonv · 2015-03-16T14:07:42Z

@kernelOfTruth Thank you, but the bottle neck I am seeing looks like the ZFS SEND code. The main problem is that it is a highly serial operation as it reads the blocks from disk and builds the SEND stream.

I had a ZFS-DISCUSSION thread on this and the Delphix team is working on a prefetch mechanism that will ensure the blocks already exist in cache to mitigate the serial nature of the SEND operation.

Here is a pointer to that thread (look for Prakash Surya's responses):
https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/OtWL-gO4y1o

Thoughts, comments?

lintonv · 2015-04-20T13:42:15Z

The root cause is as identified in the above comment.

But I found a way around the slow ZFS send performance. We increased the number of devices presented in the zpool by striping the disks. We use 8 SSDs and created 4 partitions for each. Therefore, zpool sees a total of 32 devices. This reduces the impact of the highly-serial ZFS send operation, as multiple threads are created in parallel (for each device) and the stream is build quicker.

With this, we are saturating our 1GIG link consistently.

@behlendorf Do you want me to close this issue now?

behlendorf · 2015-04-24T19:20:31Z

@lintonv the fix for this from illumos, b738bc5, was merged in to ZoL and is part of the 0.6.4 tag. I'm going to close this issue out, but if you have the time it would be great to verify this change improves things for you even under the old configuration.

lintonv · 2015-04-27T13:56:48Z

@behlendorf
We reverted back to the old configuration and tested with 0.6.4. I saw a 6X speed increase. This is fantastic! Thank you for the fix.

AndCycle · 2015-04-30T15:25:31Z

just got something probably related,

I found an fs on my system use for store huge ISO file accidently use recordsize=16k in the past,
lot's of files on it have 16k dblk,
rsync can stress out the bandwidth, but the send speed is slow as 6MB/s under 0.6.4,
I can't test if there is any different compare to 0.6.3 because I already upgrade the pool.

lintonv · 2015-05-01T15:32:17Z

@AndCycle Were you able to determine where exactly the bottleneck is? Is it in 'zfs send' or 'zfs recv' or 'ssh'? (assuming you are using ssh for the actual transfer)

0.6.4 has changes for hole_birth and prefetch amount. The hole_birth should help with the transfer of files with holes and the prefetch should help with the transfer of larger files (>1GB)

AndCycle · 2015-05-01T16:25:59Z

@lintonv my backup storage is attached by sata multiplier, I know it's not that good, but that's the affordable solution for me,

I will try to explain my case,

my system is AMD FX-8350, with 32G ECC memory,
running personal service and few vm, usually have 10GB memory spare for ZFS,
mostly have heavy metadata workload on it due to CrashPlan backup service,

I set arc_max as 4GB, figure larger than this will trigger reboot on my machine with 0.6.4,
in the past I let it use up to 16GB which is fine on 0.6.3 but it's no longer the case,

ztank is original create at 0.6.3
zbackup is newly created at 0.6.4

ztank is 5TBx6 raidz2
zbackup is 4TBx8 raidz1

iso fs is about 1TB,

I use zrep to backup ztank/prod/iso to zbackup/ztank-prod/iso them found this performance issue,
taking few days to init backup 1TB isn't reasonable,

after the backup done, I rename zbackup/ztank-prod/iso to zbackup/ztank-prod/iso-orig,
iso-orig the original fs that contain files with 16k recordsize,

then I destroy ztank/prod/iso, recreate with default 128k recordsize, do rsync to new iso fs,
zrep ztank/prod/iso to zbackup/ztank-prod/iso, a clean one with 128k recordsize and no past,

zrep won't keep original snapshot set, so there is only one full snapshot at zbackup,

zbackup/ztank-prod/iso@zrep_000000
zbackup/ztank-prod/iso-orig@zrep_000000

zfs send zbackup/ztank-prod/iso-orig@zrep_000000 | pv -trab > /dev/null
 443MiB 0:00:30 [16.7MiB/s] [14.8MiB/s]

zfs send zbackup/ztank-prod/iso@zrep_000000 | pv -trab > /dev/null
1.93GiB 0:00:30 [62.6MiB/s] [65.8MiB/s]

rsync from iso-orig can stress the bandwidth on PCI esata card at 100MB/s,
there are all DVD/BD iso file so this is expected,

the bottleneck is still there on zfs send,

but I am not sure is this related to your case.

lintonv · 2015-05-01T18:04:23Z

@AndCycle
Not sure if it is related.

In 0.6.3, without the new prefetch code (and tuning parameter), my ZFS SEND speeds were approximately 10MiB/s. In 0.6.4, with the prefetch code, I am seeing 6x speeds that is approximately 75MiB/s.

The ZFS SEND is a highly serialized operation. The bottleneck (as I state in one of the comments above) is the way the stream is build - it reads block by block and builds the stream. The fix would be to parallellize the building of this stream.

The other work around is to increase the partitions on your drives and present them all to the pool. This way you have multiple devices in this pool and you create the parallelism in that manner. In this setup, I was able to saturate my 1GIG link (getting 112 MiB/s)

denis-b · 2017-05-09T14:28:19Z

Hello,
I have the same problem on my server, debian wheezy 3.16 / zfsonlinux 6.5.7
On a pool with 2 x 5 SATA disks (raidz2) and 128 G SSD l2arc, sending zvols is very slow, sometimes less than 100ko/s !
sending datasets from the same pool runs at normal speed (50Mo/s), and read operations on kvm VMs using ZVOLs exported with ISCSI runs at the same speed.

Hole_birth is active, and disk i/os are very low.

hcw70 · 2018-10-12T18:06:22Z

Actually i also saw slow send / receive for my pool. About 10MB/s on an EVO850 SSD.

What helped was: https://everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/

So i did (since cloning local partitins) a

zfs send -R rpool-mediasrv@move-to-8k | mbuffer -s 128k -m 1g | zfs receive -F -v rpool-mediasrv-4k

That gave me at least > 50MB/s.

Actually, until it reached my VBox ZVol image, then it dropped to 7-25MB/s.

Sawtaytoes · 2023-01-20T07:33:15Z

What's mbuffer and how did it help you?

I'm using TrueNAS and don't have access to the command itself, but my transfers are going at 1/3rd the VPN speed even though SMB transfers (over VPN) are much faster. I've tried two different links; one local over VPN and another offsite over VPN. One case is 1/2 the link speed (10MB/s), the other case is 1/3rd the link speed (33MB/s).

I'd like to know if it's something to do with latency when doing a zfs send over SSH.

ryao · 2023-01-23T23:12:11Z

mbuffer is a userland memory buffer. For example, you can do zfs send | ssh 'mbuffer -s 128k -m 1g | zfs receive ...' to cache things in mbuffer so that transfers over the network are not blocked by a buffer on the other end unless the buffer is so full that it does not make sense to keep receiving.

behlendorf added Type: Performance Performance improvement or performance problem Difficulty - Medium labels Oct 7, 2014

behlendorf closed this as completed Apr 24, 2015

zfs send/recv VERY slow for large files (>5GB) #2746

zfs send/recv VERY slow for large files (>5GB) #2746

Comments

lintonv commented Sep 29, 2014

ryao commented Sep 29, 2014

lintonv commented Sep 29, 2014

behlendorf commented Sep 29, 2014

lintonv commented Sep 29, 2014

ryao commented Sep 29, 2014

lintonv commented Sep 29, 2014

lintonv commented Sep 30, 2014

behlendorf commented Sep 30, 2014

ryao commented Sep 30, 2014

behlendorf commented Sep 30, 2014

lintonv commented Sep 30, 2014

ryao commented Oct 1, 2014

ryao commented Oct 1, 2014

lintonv commented Oct 1, 2014

behlendorf commented Oct 2, 2014

dswartz commented Oct 2, 2014

behlendorf commented Oct 3, 2014

lintonv commented Oct 3, 2014

lintonv commented Oct 7, 2014

lintonv commented Dec 4, 2014

lintonv commented Feb 20, 2015

lintonv commented Feb 20, 2015

kernelOfTruth commented Mar 15, 2015

lintonv commented Mar 16, 2015

lintonv commented Apr 20, 2015

behlendorf commented Apr 24, 2015

lintonv commented Apr 27, 2015

AndCycle commented Apr 30, 2015

lintonv commented May 1, 2015

AndCycle commented May 1, 2015

lintonv commented May 1, 2015

denis-b commented May 9, 2017

hcw70 commented Oct 12, 2018 • edited Loading

Sawtaytoes commented Jan 20, 2023 • edited Loading

ryao commented Jan 23, 2023

hcw70 commented Oct 12, 2018 •

edited

Loading

Sawtaytoes commented Jan 20, 2023 •

edited

Loading