Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs send/recv VERY slow for large files (>5GB) #2746

Closed
lintonv opened this issue Sep 29, 2014 · 35 comments
Closed

zfs send/recv VERY slow for large files (>5GB) #2746

lintonv opened this issue Sep 29, 2014 · 35 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@lintonv
Copy link

lintonv commented Sep 29, 2014

I have been using ZFS SEND/RECV over SSH. The speeds are excellent for small files. But for large files (>2GB), it is terrible.

  • File Size less than 5 GB, the speed : 112MB/s (which is excellent and using the full speed of the interface)
  • File Size greater than 5 GB, the speed is 10MB/s (very, very slow)

Initially, I thought the bottleneck was SSH and the MTU. But despite using jumbo frames (9000 MTU) and using other transfer tools like netcat, there is no change in the transfer speeds.

The bottleneck has to be ZFS. After reading the code in zfs_log.c, I was convinced that zfs_immediate_write_sz was the problem. There is a hard coded limit of 4K in there. But even after tweaking that value, I still saw no performance gain.

@ryao
Copy link
Contributor

ryao commented Sep 29, 2014

@lintonv Thanks for letting us know about this. It was previously unknown to us. It might not get immediate attention, but it will be examined. Which version of ZoL are you using, which distribution do you run and what are the distribution and kernel versions?

@lintonv
Copy link
Author

lintonv commented Sep 29, 2014

@ryao Thank you for the response. Here is the info you requested -
ZoL version : 0.6.3
Linux Distro : CentOS 6.3
Linux Kernel : 2.6.32-279.el6.x86_64

I am willing to help with the patch. I just need more insight on to what in SEND or RECV may be the bottleneck. I have done a lot of investigation on this and possible problems could be :

  1. ZIL
  2. Inflight IO size

@behlendorf
Copy link
Contributor

@lintonv It would be great if you could help narrow this down. I'd suggest starting by running the following tests.

  1. Determine where it's zfs send or zfs recv causing the performance bottleneck. You can go about this by sending the stream to /dev/null either on the local host or on the remote side of the network socket.
  2. Assuming zfs send is the limitation watch the output of iostat -mx to determine if the system is IO bound and top or perf top if the system is cpu bound.

That should give a decent idea of where to look next.

@lintonv
Copy link
Author

lintonv commented Sep 29, 2014

@behlendorf Thank you for your input.

I think the bottle neck is in ZFS SEND. Here's how I tested that.

  1. I created a send stream for that volume. 'zfs send /vol/pvol > send_stream'. This took around 15 minutes.
  2. I then used SCP to copy that stream over to the next machine. (the transfer rate is 119MB/sec)
  3. I then piped data_stream into ZFS RECV (zfs recv /vol/svol < send_stream). This completed in 60 seconds (approx).

Here is what I see while trying to transfer a 26G Volume. I tried to transfer the same disk.vmdk (26G) using SCP and I am getting 119MB/sec.

With ZFS send/recv, I saw speeds as low as 7MB/sec.

Here are the results of TOP and IOSTAT for that workload -

TOP from send machine -

919 root 39 19 0 0 0 S 32.2 0.0 14:39.52 spl_kmem_cache/
6992 root 20 0 68536 4320 1740 S 25.9 0.0 0:28.65 ssh
927 root 0 -20 0 0 0 D 4.0 0.0 0:16.22 spl_system_task
6991 root 20 0 123m 1568 1200 D 4.0 0.0 0:03.12 zfs
98 root 20 0 0 0 0 R 3.0 0.0 0:48.45 kswapd0
1643 root 10 -10 0 0 0 S 1.3 0.0 1:01.86 aoe_tx0

IOSTAT from send machine -
avg-cpu: %user %nice %system %iowait %steal %idle
2.83 0.00 5.53 0.00 0.00 91.63

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.0 0.00 0.00 342.00 0.00 1.34 0.00 8.00 0.11 0.32 0.32 11.10
etherd!e5000.5 0.00 0.00 348.00 0.00 1.36 0.00 8.00 0.09 0.24 0.24 8.50
etherd!e5000.6 0.00 0.00 342.00 0.00 1.34 0.00 8.00 0.12 0.34 0.34 11.60
etherd!e5000.7 0.00 0.00 348.00 0.00 1.36 0.00 8.00 0.11 0.33 0.33 11.50
etherd!e5000.1 0.00 0.00 347.00 0.00 1.36 0.00 8.00 0.11 0.32 0.33 11.30
etherd!e5000.2 0.00 0.00 342.00 0.00 1.34 0.00 8.00 0.09 0.25 0.25 8.70
etherd!e5000.3 0.00 0.00 347.00 0.00 1.36 0.00 8.00 0.10 0.29 0.29 10.20
etherd!e5000.4 0.00 0.00 342.00 0.00 1.34 0.00 8.00 0.09 0.27 0.27 9.30
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

TOP from recv machine -

4824 root 20 0 99.9m 8256 3020 S 11.6 0.0 0:25.05 sshd
14826 root 20 0 123m 1528 1160 S 4.6 0.0 0:11.38 zfs
2406 root 38 18 0 0 0 S 1.3 0.0 1:35.22 z_wr_iss/0
2408 root 38 18 0 0 0 S 1.3 0.0 1:35.86 z_wr_iss/2
2409 root 38 18 0 0 0 S 1.3 0.0 1:35.14 z_wr_iss/3
2410 root 38 18 0 0 0 S 1.3 0.0 1:35.75 z_wr_iss/4
2411 root 38 18 0 0 0 S 1.3 0.0 1:34.89 z_wr_iss/5
2112 root 10 -10 0 0 0 S 1.0 0.0 1:30.60 aoe_tx0
2407 root 38 18 0 0 0 S 1.0 0.0 1:35.46 z_wr_iss/1
2486 root 0 -20 0 0 0 S 1.0 0.0 4:16.85 txg_sync
2420 root 39 19 0 0 0 S 0.7 0.0 0:38.39 z_wr_int/3
2424 root 39 19 0 0 0 S 0.7 0.0 0:38.31 z_wr_int/7
2425 root 39 19 0 0 0 S 0.7 0.0 0:38.56 z_wr_int/8
2428 root 39 19 0 0 0 S 0.7 0.0 0:38.44 z_wr_int/11
2430 root 39 19 0 0 0 S 0.7 0.0 0:38.48 z_wr_int/13

IOSTAT from recv machine -

avg-cpu: %user %nice %system %iowait %steal %idle
0.64 0.00 7.68 3.20 0.00 88.48

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.0 0.00 0.00 0.00 1339.00 0.00 10.39 15.89 2.82 2.11 0.20 26.50
etherd!e5000.2 0.00 0.00 0.00 1347.00 0.00 10.42 15.85 2.81 2.08 0.20 27.40
etherd!e5000.3 0.00 0.00 0.00 1347.00 0.00 10.42 15.85 2.69 1.99 0.21 27.70
etherd!e5000.4 0.00 0.00 0.00 1347.00 0.00 10.42 15.84 2.82 2.09 0.22 29.80
etherd!e5000.1 0.00 0.00 0.00 1319.00 0.00 10.23 15.88 2.81 2.13 0.21 27.20
etherd!e5000.5 0.00 0.00 0.00 1326.00 0.00 10.26 15.84 2.81 2.12 0.21 28.30
etherd!e5000.6 0.00 0.00 0.00 1345.00 0.00 10.41 15.85 2.71 2.02 0.22 29.00
etherd!e5000.7 0.00 0.00 0.00 1345.00 0.00 10.41 15.85 2.91 2.16 0.21 28.80
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

@ryao
Copy link
Contributor

ryao commented Sep 29, 2014

@lintonv Would you examine some of these large files to see if du -sh --apparent-size /path/to/file and du -sh /path/to/file differ? If they do, then the file has holes. This is a known issue that will be fixed in 0.6.4.

@lintonv
Copy link
Author

lintonv commented Sep 29, 2014

@ryao Yes, the size differ. The first command shows 50G, the second command shows 27G. Could you give me the bug number for that fix? Also, do you guys have a timeline for when 0.6.4 will be officially released?

By the way, good catch!

@lintonv
Copy link
Author

lintonv commented Sep 30, 2014

@behlendorf @ryao Do you guys have any more information on the bug, as to what the root cause is? Is there a patch available? Else. could you give me the bug number?
Thank you both.

@behlendorf
Copy link
Contributor

It does sound like the issue is on the send side. So the summarize your results.

  • zfs send -> stream_file, 29MB/s
  • scp stream_file -> network, 119MB/s
  • stream_file -> zfs recv, 443 MB/s

The interesting thing I see on the zfs send size is the average read request size is very small 4k. Normally I'd say this is the explanation for the performance issue and the small IO sizes are due either to very small files or a highly fragmented pool. But according to iostat that drives aren't 100% utilized as I'd expect and that's a bit surprising.

Since you've narrowed it down to the zfs send side I'd be interested to see the IO performance over time. I'd run zfs send on the same dataset as a test and pipe the stream to /dev/null. While the test is running run iostat -mx 10 in another window, that will give you 10 second samples of the disk performance. Are the reported rMB/s, avgrq-s, and %util relatively constant for the life of the test?

Given the information in this bug. It's not clear to me that there are any changes in what will be 0.6.4 which will improve this.

@ryao
Copy link
Contributor

ryao commented Sep 30, 2014

@lintonv @behlendorf The hole_birth feature coming in 0.6.4 is intended to resolve this:

b0bc7a8

It will require zpool upgrade before it takes effect and afterward, it will no longer be possible to use 0.6.3 with the pool. I suggest waiting for the 0.6.4 release.

@behlendorf
Copy link
Contributor

@ryao The hole birth feature will only benefit incremental sends. It doesn't sound like that's the case here. Still it would be interesting to see if there's any benefit when 0.6.4 is tagged (perhaps a month).

@lintonv
Copy link
Author

lintonv commented Sep 30, 2014

@behlendorf Thank you. How did you get those numbers for zfs recv? 443 MB/s?

Here is the information you requested. I am not attaching the whole output as it is too big, but there is plenty of information below :

avg-cpu: %user %nice %system %iowait %steal %idle
0.01 0.00 2.83 0.03 0.00 97.13

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.30 0.00 1.30 0.00 0.01 9.85 0.00 0.23 0.23 0.03
etherd!e5000.0 0.00 0.00 402.80 0.00 1.57 0.00 8.00 0.12 0.29 0.29 11.76
etherd!e5000.5 0.00 0.00 400.20 0.00 1.56 0.00 8.00 0.12 0.31 0.31 12.43
etherd!e5000.6 0.00 0.00 402.50 0.00 1.57 0.00 8.00 0.12 0.30 0.30 11.90
etherd!e5000.7 0.00 0.00 400.80 0.00 1.57 0.00 8.00 0.11 0.27 0.27 11.00
etherd!e5000.1 0.00 0.00 401.40 0.00 1.57 0.00 8.00 0.12 0.29 0.29 11.60
etherd!e5000.2 0.00 0.00 403.40 0.00 1.58 0.00 8.00 0.12 0.29 0.29 11.75
etherd!e5000.3 0.00 0.00 400.00 0.00 1.56 0.00 8.00 0.12 0.29 0.29 11.64
etherd!e5000.4 0.00 0.00 402.70 0.00 1.57 0.00 8.00 0.12 0.30 0.30 11.89
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 3.96 0.01 0.00 96.03

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.20 0.10 0.60 0.00 0.00 11.43 0.00 0.71 0.71 0.05
etherd!e5000.0 0.00 0.00 353.70 0.00 1.38 0.00 8.00 0.11 0.31 0.31 10.85
etherd!e5000.5 0.00 0.00 350.00 0.00 1.37 0.00 8.00 0.11 0.30 0.30 10.55
etherd!e5000.6 0.00 0.00 354.10 0.00 1.38 0.00 8.00 0.10 0.29 0.29 10.32
etherd!e5000.7 0.00 0.00 351.00 0.00 1.37 0.00 8.00 0.10 0.29 0.29 10.08
etherd!e5000.1 0.00 0.00 350.90 0.00 1.37 0.00 8.00 0.11 0.30 0.30 10.63
etherd!e5000.2 0.00 0.00 354.80 0.00 1.39 0.00 8.00 0.11 0.30 0.30 10.64
etherd!e5000.3 0.00 0.00 350.40 0.00 1.37 0.00 8.00 0.10 0.29 0.29 10.13
etherd!e5000.4 0.00 0.00 355.90 0.00 1.39 0.00 8.00 0.10 0.29 0.29 10.40
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 3.77 0.01 0.00 96.22

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.70 0.00 0.70 0.00 0.01 16.00 0.00 0.71 0.71 0.05
etherd!e5000.0 0.00 0.00 372.40 0.00 1.45 0.00 8.00 0.11 0.30 0.29 10.95
etherd!e5000.5 0.00 0.00 367.50 0.00 1.44 0.00 8.00 0.11 0.30 0.30 10.85
etherd!e5000.6 0.00 0.00 372.30 0.00 1.45 0.00 8.00 0.11 0.30 0.30 11.25
etherd!e5000.7 0.00 0.00 366.90 0.00 1.43 0.00 8.00 0.11 0.30 0.30 11.11
etherd!e5000.1 0.00 0.00 367.70 0.00 1.44 0.00 8.00 0.11 0.30 0.30 10.87
etherd!e5000.2 0.00 0.00 372.10 0.00 1.45 0.00 8.00 0.11 0.28 0.28 10.56
etherd!e5000.3 0.00 0.00 367.30 0.00 1.43 0.00 8.00 0.11 0.29 0.29 10.79
etherd!e5000.4 0.00 0.00 371.40 0.00 1.45 0.00 8.00 0.11 0.29 0.29 10.76
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.01 0.00 3.10 0.01 0.00 96.87

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.30 0.00 0.40 0.00 0.00 14.00 0.00 0.25 0.25 0.01
etherd!e5000.0 0.00 0.00 224.10 0.00 0.88 0.00 8.00 0.07 0.31 0.31 6.94
etherd!e5000.5 0.00 0.00 222.40 0.00 0.87 0.00 8.00 0.07 0.30 0.30 6.65
etherd!e5000.6 0.00 0.00 223.30 0.00 0.87 0.00 8.00 0.07 0.32 0.32 7.14
etherd!e5000.7 0.00 0.00 222.00 0.00 0.87 0.00 8.00 0.07 0.29 0.29 6.51
etherd!e5000.1 0.00 0.00 221.20 0.00 0.86 0.00 8.00 0.07 0.30 0.30 6.68
etherd!e5000.2 0.00 0.00 224.30 0.00 0.88 0.00 8.00 0.07 0.29 0.29 6.52
etherd!e5000.3 0.00 0.00 220.90 0.00 0.86 0.00 8.00 0.06 0.29 0.29 6.32
etherd!e5000.4 0.00 0.00 223.00 0.00 0.87 0.00 8.00 0.07 0.29 0.29 6.50
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.00 0.01 0.00 99.99

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.20 0.00 0.80 0.00 0.00 10.00 0.00 1.50 0.62 0.05
etherd!e5000.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e5000.4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
etherd!e9999.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

@ryao Thank you as well. The additional bug info was helpful to try to understand the fix as well. That looks like a huge fix! Have you ran any tests or performance tests to validate the fix?

@ryao @behlendorf At this point, I am curious to know if you both agree that this will be fixed by the hole_birth feature.

@ryao
Copy link
Contributor

ryao commented Oct 1, 2014

@ryao The hole birth feature will only benefit incremental sends. It doesn't sound like that's the case here. Still it would be interesting to see if there's any benefit when 0.6.4 is tagged (perhaps a month).

If there are N snapshots, then sending the latest should be the equivalent to sending the oldest + N-1 incremental sends. That said, pull request #2729 might also help.

@ryao
Copy link
Contributor

ryao commented Oct 1, 2014

@lintonv If you have multiple snapshots and are sending the latest, then it is quite likely that this will be fixed by the hole_birth feature. There is another bug that might also affect you which is fixed by #2729 though. It is possible both are issues in your situation.

@lintonv
Copy link
Author

lintonv commented Oct 1, 2014

@ryao @behlendorf

Two important questions :

  1. I am in a production environment and run zfs 0.6.3. If I apply the patches (hole_birth and 2729), and upgrade my zpool, what are the consequences? When 0.6.4 officially comes out, will I be able to upgrade to that without issues?
  2. If there are issues, then is there are an alternative way to do the upgrade in clean manner to 0.6.4 keeping in mind that I might be running 0.6.3 with those patches?

The transfer speeds is a major problem for us, but problems with future upgrade of our pools in production environment is a priority for us.

@behlendorf
Copy link
Contributor

@lintonv I'd advise against cherry picking the patches for hole_birth. They have some significant dependencies and it would be easy to accidentally get something wrong. If you need to run with these changes I think you'd be better off grabbing the source from master and using that until 0.6.4 is tagged.

The master branch sees a significant amount of real world use and we try very hard to ensure it's always in a stable state. You're also less likely to have issue updating if you stick with the master branch rather than roll your own thing.

As for what to expect from the hole_birth feature it's well scripted in the updated man page. But at a high level this feature will help you if your datasets contain sparse files and your doing incremental send/recvs. It will also only apply for snapshot created after this feature is enabled. See:

       hole_birth

           GUID                   com.delphix:hole_birth
           READ-ONLY COMPATIBLE   no
           DEPENDENCIES           enabled_txg

           This feature improves performance of incremental sends  ("zfs  send
           -i") and receives for objects with many holes. The most common case
           of hole-filled objects is zvols.

           An incremental send stream from snapshot A to snapshot  B  contains
           information  about every block that changed between A and B. Blocks
           which did not change between those snapshots can be identified  and
           omitted from the stream using a piece of metadata called the 'block
           birth time', but birth times are not  recorded  for  holes  (blocks
           filled  only  with  zeroes).  Since holes created after A cannot be
           distinguished from holes created before A, information about  every
           hole  in  the  entire  filesystem  or  zvol is included in the send
           stream.

           For workloads where holes are rare this is not a problem.  However,
           when incrementally replicating filesystems or zvols with many holes
           (for example a zvol formatted with another  filesystem)  a  lot  of
           time  will  be  spent sending and receiving unnecessary information
           about holes that already exist on the receiving side.

           Once the hole_birth feature has been enabled the block birth  times
           of  all new holes will be recorded. Incremental sends between snap-
           shots created after this feature is enabled will use this new meta-
           data to avoid sending information about holes that already exist on
           the receiving side.

           This feature becomes active as soon as it is enabled and will never
           return to being enabled.

@dswartz
Copy link
Contributor

dswartz commented Oct 2, 2014

Brian, is the following a typo?

"This feature becomes active as soon as it is enabled and will never
return to being enabled."

Shouldn't the last word be 'disabled'?

@behlendorf
Copy link
Contributor

@dswartz Actually that's correct. A feature may be enabled, active, or disabled. See zpool-features(5).

   Feature states
       Features can be in one of three states:

       active
                   This feature's on-disk format changes are in effect on  the
                   pool.  Support  for  this feature is required to import the
                   pool in read-write mode. If this feature is  not  read-only
                   compatible,  support is also required to import the pool in
                   read-only mode (see "Read-only compatibility").

       enabled
                   An administrator has marked this feature as enabled on  the
                   pool,  but  the  feature's  on-disk format changes have not
                   been made yet. The pool can still be imported  by  software
                   that does not support this feature, but changes may be made
                   to the on-disk format at any time which will move the  fea-
                   ture to the active state. Some features may support return-
                   ing to the enabled state after becoming  active.  See  fea-
                   ture-specific documentation for details.

       disabled
                   This  feature's  on-disk  format changes have not been made
                   and will not be made unless an administrator moves the fea-
                   ture to the enabled state. Features cannot be disabled once
                   they have been enabled.

@kpande The birth txg is stored in the block pointer for the hole. I haven't double checked the source but it should be set for any newly created holes.

@lintonv
Copy link
Author

lintonv commented Oct 3, 2014

@behlendorf @ryao Thank you for the detailed information.

I did some testing on the version of the ZFS code in GIT, which has both the birth_hole as well as the #2729 fix. Unfortunately, it does not seem to fix the transfer speeds.

It also did not appear that the number of snapshots on the system made a difference. It also does not seem that whether the send is incremental or not matters either.

@lintonv
Copy link
Author

lintonv commented Oct 7, 2014

@ryao Was the birth_hole code tested against files that have holes? The bug that I am seeing is reproducible easily.

@behlendorf behlendorf added Type: Performance Performance improvement or performance problem Difficulty - Medium labels Oct 7, 2014
@lintonv
Copy link
Author

lintonv commented Dec 4, 2014

I need to make a few corrections and hopefully this will help get to the root cause -

  1. I am not dealing with ZFS Volumes, but ZFS Filesystems. I am trying ZFS send/recv ZFS filesystem snapshots
  2. I have been weary of the stability of ZVOL, therefore I have been using a ZFS filesystem and creating a file and presenting it to upper layers (above zpool) as a Volume.

@lintonv
Copy link
Author

lintonv commented Feb 20, 2015

@behlendorf

Just wanted to update you on some results from further testing and debug this week.

We were using a record size of 4K on the ZFS Filesystems were using for send/receive. By increasing this to 128K, we saw the send speed increase from 11 MB/sec to 38 MB/sec.

Could you let me know what other parameter I am using below could affect the speeds we are seeing. We should be at least getting 120 MB/sec.

Here are the pool parameters :

NAME PROPERTY VALUE SOURCE
ssn-0-12-36 size 5.81T -
ssn-0-12-36 capacity 1% -
ssn-0-12-36 altroot - default
ssn-0-12-36 health ONLINE -
ssn-0-12-36 guid 16967859640226627487 default
ssn-0-12-36 version - default
ssn-0-12-36 bootfs - default
ssn-0-12-36 delegation on default
ssn-0-12-36 autoreplace off default
ssn-0-12-36 cachefile - default
ssn-0-12-36 failmode wait default
ssn-0-12-36 listsnapshots off default
ssn-0-12-36 autoexpand off default
ssn-0-12-36 dedupditto 0 default
ssn-0-12-36 dedupratio 1.00x -
ssn-0-12-36 free 5.72T -
ssn-0-12-36 allocated 91.2G -
ssn-0-12-36 readonly off -
ssn-0-12-36 ashift 12 local
ssn-0-12-36 comment - default
ssn-0-12-36 expandsize 0 -
ssn-0-12-36 freeing 0 default
ssn-0-12-36 fragmentation 0% -
ssn-0-12-36 leaked 0 default
ssn-0-12-36 feature@async_destroy enabled local
ssn-0-12-36 feature@empty_bpobj active local
ssn-0-12-36 feature@lz4_compress active local
ssn-0-12-36 feature@spacemap_histogram active local
ssn-0-12-36 feature@enabled_txg active local
ssn-0-12-36 feature@hole_birth active local
ssn-0-12-36 feature@extensible_dataset enabled local
ssn-0-12-36 feature@embedded_data active local
ssn-0-12-36 feature@bookmarks enabled local

Here is the parameter of one FS in the pool above:

[root@ssn-0-12-36 tmp]# zfs get all ssn-0-12-36/g_2G
NAME PROPERTY VALUE SOURCE
ssn-0-12-36/g_2G type filesystem -
ssn-0-12-36/g_2G creation Fri Feb 20 15:05 2015 -
ssn-0-12-36/g_2G used 162K -
ssn-0-12-36/g_2G available 4.74T -
ssn-0-12-36/g_2G referenced 162K -
ssn-0-12-36/g_2G compressratio 1.00x -
ssn-0-12-36/g_2G mounted yes -
ssn-0-12-36/g_2G quota none default
ssn-0-12-36/g_2G reservation none default
ssn-0-12-36/g_2G recordsize 128K local
ssn-0-12-36/g_2G mountpoint /ssn-0-12-36/g_2G default
ssn-0-12-36/g_2G sharenfs off default
ssn-0-12-36/g_2G checksum on default
ssn-0-12-36/g_2G compression off default
ssn-0-12-36/g_2G atime off local
ssn-0-12-36/g_2G devices on default
ssn-0-12-36/g_2G exec on default
ssn-0-12-36/g_2G setuid on default
ssn-0-12-36/g_2G readonly off default
ssn-0-12-36/g_2G zoned off default
ssn-0-12-36/g_2G snapdir hidden default
ssn-0-12-36/g_2G aclinherit restricted default
ssn-0-12-36/g_2G canmount on default
ssn-0-12-36/g_2G xattr on default
ssn-0-12-36/g_2G copies 1 default
ssn-0-12-36/g_2G version 5 -
ssn-0-12-36/g_2G utf8only off -
ssn-0-12-36/g_2G normalization none -
ssn-0-12-36/g_2G casesensitivity sensitive -
ssn-0-12-36/g_2G vscan off default
ssn-0-12-36/g_2G nbmand off default
ssn-0-12-36/g_2G sharesmb off default
ssn-0-12-36/g_2G refquota none default
ssn-0-12-36/g_2G refreservation none default
ssn-0-12-36/g_2G primarycache metadata local
ssn-0-12-36/g_2G secondarycache metadata local
ssn-0-12-36/g_2G usedbysnapshots 0 -
ssn-0-12-36/g_2G usedbydataset 162K -
ssn-0-12-36/g_2G usedbychildren 0 -
ssn-0-12-36/g_2G usedbyrefreservation 0 -
ssn-0-12-36/g_2G logbias latency default
ssn-0-12-36/g_2G dedup off default
ssn-0-12-36/g_2G mlslabel none default
ssn-0-12-36/g_2G sync standard default
ssn-0-12-36/g_2G refcompressratio 1.00x -
ssn-0-12-36/g_2G written 162K -
ssn-0-12-36/g_2G logicalused 12.5K -
ssn-0-12-36/g_2G logicalreferenced 12.5K -
ssn-0-12-36/g_2G snapdev hidden default
ssn-0-12-36/g_2G acltype off default
ssn-0-12-36/g_2G context none default
ssn-0-12-36/g_2G fscontext none default
ssn-0-12-36/g_2G defcontext none default
ssn-0-12-36/g_2G rootcontext none default
ssn-0-12-36/g_2G relatime off default
ssn-0-12-36/g_2G redundant_metadata all default
ssn-0-12-36/g_2G overlay off default
ssn-0-12-36/g_2G ssn:snap_reserve_percentage 100 local
ssn-0-12-36/g_2G ssn:template 0 local
ssn-0-12-36/g_2G ssn:snap_space_used 0 local
ssn-0-12-36/g_2G ssn:thin 0 local
ssn-0-12-36/g_2G ssn:snap_enabled TRUE local
ssn-0-12-36/g_2G ssn:snap_space_total 2048 local
ssn-0-12-36/g_2G ssn:vol_size 2048 local

@lintonv
Copy link
Author

lintonv commented Feb 20, 2015

@behlendorf Just FYI, we are using 8 SSDs in a RAID-Z1 setup

@kernelOfTruth
Copy link
Contributor

@lintonv please take a look at #3171

@lintonv
Copy link
Author

lintonv commented Mar 16, 2015

@kernelOfTruth Thank you, but the bottle neck I am seeing looks like the ZFS SEND code. The main problem is that it is a highly serial operation as it reads the blocks from disk and builds the SEND stream.

I had a ZFS-DISCUSSION thread on this and the Delphix team is working on a prefetch mechanism that will ensure the blocks already exist in cache to mitigate the serial nature of the SEND operation.

Here is a pointer to that thread (look for Prakash Surya's responses):
https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/OtWL-gO4y1o

Thoughts, comments?

@lintonv
Copy link
Author

lintonv commented Apr 20, 2015

The root cause is as identified in the above comment.

But I found a way around the slow ZFS send performance. We increased the number of devices presented in the zpool by striping the disks. We use 8 SSDs and created 4 partitions for each. Therefore, zpool sees a total of 32 devices. This reduces the impact of the highly-serial ZFS send operation, as multiple threads are created in parallel (for each device) and the stream is build quicker.

With this, we are saturating our 1GIG link consistently.

@behlendorf Do you want me to close this issue now?

@behlendorf
Copy link
Contributor

@lintonv the fix for this from illumos, b738bc5, was merged in to ZoL and is part of the 0.6.4 tag. I'm going to close this issue out, but if you have the time it would be great to verify this change improves things for you even under the old configuration.

@lintonv
Copy link
Author

lintonv commented Apr 27, 2015

@behlendorf
We reverted back to the old configuration and tested with 0.6.4. I saw a 6X speed increase. This is fantastic! Thank you for the fix.

@AndCycle
Copy link

just got something probably related,

I found an fs on my system use for store huge ISO file accidently use recordsize=16k in the past,
lot's of files on it have 16k dblk,
rsync can stress out the bandwidth, but the send speed is slow as 6MB/s under 0.6.4,
I can't test if there is any different compare to 0.6.3 because I already upgrade the pool.

@lintonv
Copy link
Author

lintonv commented May 1, 2015

@AndCycle Were you able to determine where exactly the bottleneck is? Is it in 'zfs send' or 'zfs recv' or 'ssh'? (assuming you are using ssh for the actual transfer)

0.6.4 has changes for hole_birth and prefetch amount. The hole_birth should help with the transfer of files with holes and the prefetch should help with the transfer of larger files (>1GB)

@AndCycle
Copy link

AndCycle commented May 1, 2015

@lintonv my backup storage is attached by sata multiplier, I know it's not that good, but that's the affordable solution for me,

I will try to explain my case,

my system is AMD FX-8350, with 32G ECC memory,
running personal service and few vm, usually have 10GB memory spare for ZFS,
mostly have heavy metadata workload on it due to CrashPlan backup service,

I set arc_max as 4GB, figure larger than this will trigger reboot on my machine with 0.6.4,
in the past I let it use up to 16GB which is fine on 0.6.3 but it's no longer the case,

ztank is original create at 0.6.3
zbackup is newly created at 0.6.4

ztank is 5TBx6 raidz2
zbackup is 4TBx8 raidz1

iso fs is about 1TB,

I use zrep to backup ztank/prod/iso to zbackup/ztank-prod/iso them found this performance issue,
taking few days to init backup 1TB isn't reasonable,

after the backup done, I rename zbackup/ztank-prod/iso to zbackup/ztank-prod/iso-orig,
iso-orig the original fs that contain files with 16k recordsize,

then I destroy ztank/prod/iso, recreate with default 128k recordsize, do rsync to new iso fs,
zrep ztank/prod/iso to zbackup/ztank-prod/iso, a clean one with 128k recordsize and no past,

zrep won't keep original snapshot set, so there is only one full snapshot at zbackup,

zbackup/ztank-prod/iso@zrep_000000
zbackup/ztank-prod/iso-orig@zrep_000000
zfs send zbackup/ztank-prod/iso-orig@zrep_000000 | pv -trab > /dev/null
 443MiB 0:00:30 [16.7MiB/s] [14.8MiB/s]

zfs send zbackup/ztank-prod/iso@zrep_000000 | pv -trab > /dev/null
1.93GiB 0:00:30 [62.6MiB/s] [65.8MiB/s]

rsync from iso-orig can stress the bandwidth on PCI esata card at 100MB/s,
there are all DVD/BD iso file so this is expected,

the bottleneck is still there on zfs send,

but I am not sure is this related to your case.

@lintonv
Copy link
Author

lintonv commented May 1, 2015

@AndCycle
Not sure if it is related.

In 0.6.3, without the new prefetch code (and tuning parameter), my ZFS SEND speeds were approximately 10MiB/s. In 0.6.4, with the prefetch code, I am seeing 6x speeds that is approximately 75MiB/s.

The ZFS SEND is a highly serialized operation. The bottleneck (as I state in one of the comments above) is the way the stream is build - it reads block by block and builds the stream. The fix would be to parallellize the building of this stream.

The other work around is to increase the partitions on your drives and present them all to the pool. This way you have multiple devices in this pool and you create the parallelism in that manner. In this setup, I was able to saturate my 1GIG link (getting 112 MiB/s)

@denis-b
Copy link

denis-b commented May 9, 2017

Hello,
I have the same problem on my server, debian wheezy 3.16 / zfsonlinux 6.5.7
On a pool with 2 x 5 SATA disks (raidz2) and 128 G SSD l2arc, sending zvols is very slow, sometimes less than 100ko/s !
sending datasets from the same pool runs at normal speed (50Mo/s), and read operations on kvm VMs using ZVOLs exported with ISCSI runs at the same speed.

Hole_birth is active, and disk i/os are very low.

@hcw70
Copy link

hcw70 commented Oct 12, 2018

Actually i also saw slow send / receive for my pool. About 10MB/s on an EVO850 SSD.

What helped was: https://everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/

So i did (since cloning local partitins) a

zfs send -R rpool-mediasrv@move-to-8k | mbuffer -s 128k -m 1g | zfs receive -F -v rpool-mediasrv-4k

That gave me at least > 50MB/s.

Actually, until it reached my VBox ZVol image, then it dropped to 7-25MB/s.

@Sawtaytoes
Copy link

Sawtaytoes commented Jan 20, 2023

What's mbuffer and how did it help you?

I'm using TrueNAS and don't have access to the command itself, but my transfers are going at 1/3rd the VPN speed even though SMB transfers (over VPN) are much faster. I've tried two different links; one local over VPN and another offsite over VPN. One case is 1/2 the link speed (10MB/s), the other case is 1/3rd the link speed (33MB/s).

I'd like to know if it's something to do with latency when doing a zfs send over SSH.

@ryao
Copy link
Contributor

ryao commented Jan 23, 2023

mbuffer is a userland memory buffer. For example, you can do zfs send | ssh 'mbuffer -s 128k -m 1g | zfs receive ...' to cache things in mbuffer so that transfers over the network are not blocked by a buffer on the other end unless the buffer is so full that it does not make sense to keep receiving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

10 participants