Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage by "z_fr_iss" after deleting large files #3976

Closed
blind-oracle opened this issue Nov 1, 2015 · 14 comments
Closed

High CPU usage by "z_fr_iss" after deleting large files #3976

blind-oracle opened this issue Nov 1, 2015 · 14 comments
Labels
Status: Inactive Not being actively updated Type: Performance Performance improvement or performance problem

Comments

@blind-oracle
Copy link

Problem: after deleting several large files (~50Gb summary) ZFS goes crazy on CPU usage.

For about a minute it hogs all cores (it's 2 x 6 core E5-2620 with HT, 24 CPUs total) to 100% with about 48 running "z_fr_iss_X" threads (or more, hard to say).

Then, when they are done, txg_sync goes for some time using 30% of one core, but that's Ok i presume.

Without PREEMPT kernel this literally blocks all userspace activity for this period of time.
I can switch to PREEMPT as it's now supported by ZFS, but i'd rather not to.

Kernel 4.1.12, ZoL 0.6.5.3, 256Gb RAM.

ZFS created with:

zpool create zfs1 \
-o feature@async_destroy=enabled \
-o feature@empty_bpobj=enabled \
-o feature@lz4_compress=enabled \
-o feature@spacemap_histogram=enabled \
-o feature@enabled_txg=enabled \
-o feature@embedded_data=enabled \
-o feature@hole_birth=enabled \
-o feature@extensible_dataset=enabled \
-o ashift=9 \
-O recordsize=128k \
-O checksum=sha256 \
-O dedup=sha256 \
-O compression=lz4 \
-O atime=off \
-O relatime=off \
-O logbias=latency \
-O primarycache=all \
-O secondarycache=none \
raidz2 \
/dev/mapper/D01_9Y0RA \
/dev/mapper/D02_B9G2A \
/dev/mapper/D03_B01GA \
/dev/mapper/D04_B9GDA \
/dev/mapper/D05_AU1HA \
/dev/mapper/D06_BD4HA \
/dev/mapper/D07_7P85D \
/dev/mapper/D08_93WJA \
/dev/mapper/D09_4394K \
/dev/mapper/D10_BDYKA \
log mirror \
/dev/mapper/SSD1_SLOG \
/dev/mapper/SSD2_SLOG \
/dev/mapper/SSD3_SLOG

Pool size is ~21Tb and it's empty but for these test files.
If there's any more info required - ask.

@blind-oracle blind-oracle changed the title High CPU usage by after deleting large files High CPU usage by "z_fr_iss" after deleting large files Nov 1, 2015
@dweeezil
Copy link
Contributor

dweeezil commented Nov 1, 2015

@blind-oracle I have a feeling it may have to do with dedup and/or its interaction with snapshots or possibly the pool geometry. Are there any snapshots on the pool? How effective is the dedup with this set of (how many?) files. What's the storage stack underneath the dm vdevs?

I did run a quick sanity test by removing 4 50GiB files from a 34TiB (non-raidz) pool with no dedup and there was no excessive CPU use by the z_fr_iss threads.

@blind-oracle
Copy link
Author

Yep, i thought about dedup - maybe it's so recklessly cleaning up references in ARC... But it shouldn't be so CPU-hungy, i think.

DM vdevs are dm-crypt encrypted 3Tb HDDs and SSDs for SLOG. I don't think the problem's with them.

There's no snapshots, no nothing. It's a newly created pool with ~10 HD movies copied to it for testing. These are different files, so dedupe ratio is around 1.00x.

Pool and FS parameters:

# zpool get all zfs1
NAME     PROPERTY                    VALUE                       SOURCE
zfs1  size                        27.2T                       -
zfs1  capacity                    0%                          -
zfs1  altroot                     -                           default
zfs1  health                      ONLINE                      -
zfs1  guid                        18219375908895185010        default
zfs1  version                     -                           default
zfs1  bootfs                      -                           default
zfs1  delegation                  on                          default
zfs1  autoreplace                 off                         default
zfs1  cachefile                   -                           default
zfs1  failmode                    wait                        default
zfs1  listsnapshots               off                         default
zfs1  autoexpand                  off                         default
zfs1  dedupditto                  0                           default
zfs1  dedupratio                  1.00x                       -
zfs1  free                        27.2T                       -
zfs1  allocated                   62.0G                       -
zfs1  readonly                    off                         -
zfs1  ashift                      9                           local
zfs1  comment                     -                           default
zfs1  expandsize                  -                           -
zfs1  freeing                     0                           default
zfs1  fragmentation               0%                          -
zfs1  leaked                      0                           default
zfs1  feature@async_destroy       enabled                     local
zfs1  feature@empty_bpobj         active                      local
zfs1  feature@lz4_compress        active                      local
zfs1  feature@spacemap_histogram  active                      local
zfs1  feature@enabled_txg         active                      local
zfs1  feature@hole_birth          active                      local
zfs1  feature@extensible_dataset  enabled                     local
zfs1  feature@embedded_data       active                      local
zfs1  feature@bookmarks           enabled                     local
zfs1  feature@filesystem_limits   enabled                     local
zfs1  feature@large_blocks        enabled                     local

# zfs get all zfs1
NAME     PROPERTY              VALUE                  SOURCE
zfs1  type                  filesystem             -
zfs1  creation              Sat Oct 31 21:58 2015  -
zfs1  used                  49.4G                  -
zfs1  available             21.0T                  -
zfs1  referenced            35.9K                  -
zfs1  compressratio         1.00x                  -
zfs1  mounted               yes                    -
zfs1  quota                 none                   default
zfs1  reservation           none                   default
zfs1  recordsize            128K                   local
zfs1  mountpoint            /zfs1               default
zfs1  sharenfs              off                    default
zfs1  checksum              sha256                 local
zfs1  compression           lz4                    local
zfs1  atime                 off                    local
zfs1  devices               on                     default
zfs1  exec                  on                     default
zfs1  setuid                on                     default
zfs1  readonly              off                    default
zfs1  zoned                 off                    default
zfs1  snapdir               hidden                 default
zfs1  aclinherit            restricted             default
zfs1  canmount              on                     default
zfs1  xattr                 on                     default
zfs1  copies                1                      default
zfs1  version               5                      -
zfs1  utf8only              off                    -
zfs1  normalization         none                   -
zfs1  casesensitivity       sensitive              -
zfs1  vscan                 off                    default
zfs1  nbmand                off                    default
zfs1  sharesmb              on                     local
zfs1  refquota              none                   default
zfs1  refreservation        none                   default
zfs1  primarycache          all                    local
zfs1  secondarycache        none                   local
zfs1  usedbysnapshots       0                      -
zfs1  usedbydataset         35.9K                  -
zfs1  usedbychildren        49.4G                  -
zfs1  usedbyrefreservation  0                      -
zfs1  logbias               latency                local
zfs1  dedup                 sha256                 local
zfs1  mlslabel              none                   default
zfs1  sync                  standard               default
zfs1  refcompressratio      1.00x                  -
zfs1  written               35.9K                  -
zfs1  logicalused           49.5G                  -
zfs1  logicalreferenced     9.50K                  -
zfs1  filesystem_limit      none                   default
zfs1  snapshot_limit        none                   default
zfs1  filesystem_count      none                   default
zfs1  snapshot_count        none                   default
zfs1  snapdev               hidden                 default
zfs1  acltype               off                    default
zfs1  context               none                   default
zfs1  fscontext             none                   default
zfs1  defcontext            none                   default
zfs1  rootcontext           none                   default
zfs1  relatime              off                    local
zfs1  redundant_metadata    all                    default
zfs1  overlay               off                    default

Here's the picture from top taken after removing all files:


top - 10:11:01 up 14:51,  2 users,  load average: 33.03, 7.77, 2.61
Tasks: 1199 total,  56 running, 1143 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  26418534+total, 67488984 used, 19669635+free,   102344 buffers
KiB Swap:        0 total,        0 used,        0 free.   272352 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 8121 root       0 -20       0      0      0 R  38.4  0.0   0:31.77 z_fr_iss_3
 8077 root       0 -20       0      0      0 R  37.5  0.0   0:33.59 z_fr_iss_0
 8095 root       0 -20       0      0      0 R  36.0  0.0   0:29.58 z_fr_iss_1
 8079 root       0 -20       0      0      0 R  35.2  0.0   0:30.70 z_fr_iss_0
 8075 root       0 -20       0      0      0 R  34.1  0.0   0:29.30 z_fr_iss_0
 8082 root       0 -20       0      0      0 R  33.2  0.0   0:27.54 z_fr_iss_0
 8083 root       0 -20       0      0      0 R  32.3  0.0   0:32.50 z_fr_iss_0
 8098 root       0 -20       0      0      0 R  31.9  0.0   0:34.82 z_fr_iss_2
 8129 root       0 -20       0      0      0 R  31.6  0.0   0:31.09 z_fr_iss_4
 8131 root       0 -20       0      0      0 R  31.4  0.0   0:32.06 z_fr_iss_4
 8130 root       0 -20       0      0      0 R  31.2  0.0   0:31.13 z_fr_iss_4
 8110 root       0 -20       0      0      0 R  29.8  0.0   0:31.67 z_fr_iss_3
 8092 root       0 -20       0      0      0 R  29.6  0.0   0:30.44 z_fr_iss_1
 8127 root       0 -20       0      0      0 R  29.2  0.0   0:30.99 z_fr_iss_4
 8119 root       0 -20       0      0      0 R  28.3  0.0   0:31.13 z_fr_iss_3
 8105 root       0 -20       0      0      0 D  27.4  0.0   0:30.60 z_fr_iss_2
 8169 root       0 -20       0      0      0 D  27.4  0.0   0:31.86 z_fr_iss_7
 8151 root       0 -20       0      0      0 R  27.3  0.0   0:33.14 z_fr_iss_6
 8111 root       0 -20       0      0      0 R  26.7  0.0   0:29.48 z_fr_iss_3
 8109 root       0 -20       0      0      0 R  26.5  0.0   0:34.68 z_fr_iss_2
 8100 root       0 -20       0      0      0 R  26.4  0.0   0:31.77 z_fr_iss_2
 8112 root       0 -20       0      0      0 R  26.2  0.0   0:31.65 z_fr_iss_3
 8147 root       0 -20       0      0      0 R  26.2  0.0   0:35.85 z_fr_iss_6
 8106 root       0 -20       0      0      0 R  26.0  0.0   0:31.44 z_fr_iss_2
 8124 root       0 -20       0      0      0 R  25.8  0.0   0:33.12 z_fr_iss_4
 8133 root       0 -20       0      0      0 D  25.6  0.0   0:29.14 z_fr_iss_4
 8101 root       0 -20       0      0      0 D  25.1  0.0   0:31.35 z_fr_iss_2
 8084 root       0 -20       0      0      0 R  24.9  0.0   0:30.91 z_fr_iss_0
 8086 root       0 -20       0      0      0 D  24.0  0.0   0:30.92 z_fr_iss_1
 8140 root       0 -20       0      0      0 R  23.9  0.0   0:30.40 z_fr_iss_5
 8103 root       0 -20       0      0      0 D  23.7  0.0   0:30.30 z_fr_iss_2
 8160 root       0 -20       0      0      0 D  23.5  0.0   0:29.99 z_fr_iss_7
 8088 root       0 -20       0      0      0 D  23.0  0.0   0:32.13 z_fr_iss_1
 8108 root       0 -20       0      0      0 R  23.0  0.0   0:26.22 z_fr_iss_2
 8167 root       0 -20       0      0      0 R  23.0  0.0   0:30.03 z_fr_iss_7
 8085 root       0 -20       0      0      0 R  22.8  0.0   0:31.18 z_fr_iss_0
 8089 root       0 -20       0      0      0 R  22.2  0.0   0:32.61 z_fr_iss_1
 8138 root       0 -20       0      0      0 R  22.2  0.0   0:31.05 z_fr_iss_5
 8099 root       0 -20       0      0      0 R  21.2  0.0   0:31.59 z_fr_iss_2
 8132 root       0 -20       0      0      0 D  21.0  0.0   0:32.35 z_fr_iss_4
 8143 root       0 -20       0      0      0 R  20.6  0.0   0:27.08 z_fr_iss_5
 8135 root       0 -20       0      0      0 D  20.4  0.0   0:30.79 z_fr_iss_5
 8102 root       0 -20       0      0      0 R  19.9  0.0   0:27.19 z_fr_iss_2
 8155 root       0 -20       0      0      0 D  19.7  0.0   0:28.99 z_fr_iss_6
 8113 root       0 -20       0      0      0 D  19.5  0.0   0:28.47 z_fr_iss_3
 8126 root       0 -20       0      0      0 R  19.5  0.0   0:30.81 z_fr_iss_4
 8154 root       0 -20       0      0      0 R  19.2  0.0   0:30.33 z_fr_iss_6
 8090 root       0 -20       0      0      0 D  18.8  0.0   0:31.33 z_fr_iss_1
 8148 root       0 -20       0      0      0 R  18.8  0.0   0:33.16 z_fr_iss_6
 8163 root       0 -20       0      0      0 R  18.7  0.0   0:31.29 z_fr_iss_7
 8093 root       0 -20       0      0      0 R  18.3  0.0   0:29.12 z_fr_iss_1
 8156 root       0 -20       0      0      0 D  18.1  0.0   0:27.75 z_fr_iss_6
 8145 root       0 -20       0      0      0 R  17.9  0.0   0:28.86 z_fr_iss_5
 8157 root       0 -20       0      0      0 R  17.9  0.0   0:27.45 z_fr_iss_6
 8161 root       0 -20       0      0      0 R  17.8  0.0   0:31.66 z_fr_iss_7
 8080 root       0 -20       0      0      0 D  17.6  0.0   0:31.16 z_fr_iss_0

LA 33, 56 running threads, 0.0% idle.

P.S.
Rebuilt kernel and ZFS modules & userspace with Preempt - "zpool import -a" began to segfault after importing pools, although the import process itself completes:

[  652.953019] zpool[5255]: segfault at 0 ip 00007f6f727b0c8a sp 00007ffde270cf68 error 4 in libc-2.19.so[7f6f7272f000+19f000]

The main problem persists, and system didn't became much more responsive (maybe because these ZFS threads are run at -20 nice priority).

@dweeezil
Copy link
Contributor

dweeezil commented Nov 2, 2015

@blind-oracle After re-running the test with dedup=sha256, I think I can see what's happening. Over time, both the upstream and ZoL have fiddled with the z_fr_iss taskqs and the number of threads allowed. In the present code (>= 0.6.5), we allow 96 threads which obviously oversubscribes the 24 you've got available. The process is very CPU intensive when dedup is enabled and I think you're seeing the result of that. Also, ZoL now runs those threads at priority 100 (the highest non-realtime priority) whereas they were (erroneously) run at priority 139 in the past (prior to 0.6.5). Since normal user processes run at 120, these zio threads can now overwhelm the system. I can only imagine how painful this would be on a more modest system with, say, only 8 threads.

I'll note that the system on which I ran the test, there were 80 threads available so it was still oversubscribed but not nearly as bad. I also had no other load present so it managed the freeing pretty quickly.

Based on this issue and others which may be related, it seems we may need to consider doing something about either or both the number of zio threads and their priority. I'm considering opening a more general issue which summarizes the various problems seen since the updates mentioned above.

@blind-oracle
Copy link
Author

Aha.. Can this number of threads be (or made be) tuned through module parameters?

BTW, i think i've found the root problem. My initial kernel was built with Voluntary preemption model. Now i've rebuilt it with no preemption and after deleting ~90Gb of files the CPU is hogged in such a way for only about 10 seconds, not for a minute or more like before.

So the preemption mode has some major influence on ZoL performance in this particular situation.

@dweeezil
Copy link
Contributor

dweeezil commented Nov 2, 2015

@blind-oracle I just analyzed a flame graph of a 500GiB deletion on a filesystem with dedup and what happens is that the 96 z_fr_iss threads, as expected, are all performing lookups against the DDT ZAP objects (similar to looking up an entry in a directory). All the actual wall time is spent in dbuf_find() spinlocks due to the contention from the 96 threads.

Your observation regarding preemption makes sense: Since you've oversubscribed the available cores, with preemption, there will be plenty of extra context switching and waiting as the 96 tasks contend for only 24 threads. With no preemption, the 24 that are actually running at once will likely be able to finish their work uninterrupted.

Presumably this didn't have as much impact in the past when we erroneously ran the z_fr_iss threads at too low a priority which actually allowed other user processes to continue functioning normally.

My test ran faster on my standard testing kernel which does also use PREEMPT_VOLUNTARY likely because the 96 tasks were contending for 80 threads.

As I alluded to originally, this all brings into question the current taskq parameters (quantity and/or priority).

I am going to re-run my own tests with CONFIG_PREEMPT_NONE. I don't expect to see a huge difference because 96 isn't that much greater than 80.

As to your other question, there's currently no way to tune the number of zio threads without editing module/zfs/spa.c, editing the parameters and recompiling.

@blind-oracle
Copy link
Author

Okay, the preemption role looks clear to me now, thanks.

I've modified the thread/taskq count it spa.c from (12, 8) to (6, 4) to match my hardware.
Now, after delete operation, the threads are much more loaded (each up to 100%), so there are less context switches. And the whole operation is performed faster than with 96 threads.

Maybe we should add some logic that analyzes CPU count on module load and adjusts thread count accordingly?

@dweeezil
Copy link
Contributor

dweeezil commented Nov 3, 2015

@blind-oracle I'm not sure what the proper solution is for this general problem which is why I'm thinking of opening a new more general issue regarding the type of thrashing which can occur when the physical cores are oversubscribed. There are clearly a number of issues at play here, not the least of which is the particular spinlock implementation, preemption model, etc. For my part, I'm planning on running a new round of high-concurrency tests with various combinations of kernel settings. I'm also keen on testing with a 4.2 kernel now that queue spinlocks have been made the default. They might also help your situation but the author (of the queue spinlock implmentation) states that the big wins come into play on >= 4 socket systems.

It's also not clear to me whether ZoL is supposed to "officially" support CONFIG_PREEMPT_NONE kernels. In the distant past, I remember some talk about it being known to cause problems.

In light of this issue, I'll be adding a large dedup unlink workload to my set of tests.

One other note regarding future work in this area is that there's clearly a desire for ZoL to stay as close to the upstream as possible. It would seem this issue ought to affect all OpenZFS implementations but the impact would be dependent on all the other kernel-related variables I alluded to above.

@kernelOfTruth
Copy link
Contributor

I'm curious if the following issue (https://www.illumos.org/issues/6288 , #3987) could be somewhat related:

the symptoms are at least partially similar: high load, low performance

Maybe we should add some logic that analyzes CPU count on module load and adjusts thread count accordingly?

It would be a start for mid-size boxen, I wonder if the performance and responsiveness in all cases would be better if the number of threads stayed below the cpu (or number of cpu threads, SMT) count. If the performance hit weren't too large a working state in all cases would be preferential over occasional service interruptions and outages

@dweeezil
Copy link
Contributor

dweeezil commented Nov 4, 2015

@kernelOfTruth In this particular case, I don't think those issues are related. The CPU time is all being burned in the spinlocks underneath ddt_zap_lookup(). This flame graph shows the condition I'm describing: https://gist.github.com/dweeezil/7cd1fc9bc41452eeafdd#file-fg-rm-500g-svg (note that this is not the exact test I wanted to perform, there were actually some deduped blockes but the original poster likely didn't have any, but the end result is the same).

@kernelOfTruth
Copy link
Contributor

@dweeezil I see, thanks !

I did a quick search and reports mention that snapshot handling and dedup are particularly tricky, like you mentioned above (but doesn't apply here), also that memory requirements are rather high even when only small amounts of data need deduplication (e.g. several windows VMs with already compressed NTFS in them), I also remember having read in the past that using both compression and deduplication tend to make things especially slow

Related finds:
https://groups.google.com/a/zfsonlinux.org/forum/#!msg/zfs-discuss/Q8YEAXg8gmo/hu5O48CWYkwJ
#118
#258

Bottom line - be very careful not to delete large amounts of deduplicated data that is also locked up in snapshots all at once. We had plenty of Zeus ZIL and L2ARC, and it was still very slow. After several days of troubleshooting, Sun/Oracle engineers concluded there was "lock contention", in addition to IOPS depletion, which was the cause of the performance impacts.

This is an area of ZFS that definitely needs improvement.

All of this while considering how an ARC works (https://en.wikipedia.org/wiki/Adaptive_replacement_cache) and adding these features - truly remarkable !

@nwf
Copy link
Contributor

nwf commented Nov 10, 2015

I have an experimental approach to tamping down on the CPU and IOPS used by large deduped removals at #3725, which attempts to throttle these operations and keep the rest of the system responsive. It does mean that these removes take a good deal longer, but that's often much preferred to the system going away.

@tuxoko
Copy link
Contributor

tuxoko commented Nov 18, 2015

@dweeezil
Would using rwlock for DBUF_HASH_MUTEX results in less contention?

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Nov 19, 2015
@dweeezil
Copy link
Contributor

@tuxoko Interesting idea. At the very least, it looks like it might help the dbuf_prefetch() case. For the moment, however, I'm a lot more concerned about contention caused by the fanout of the various zio taskqs; and the write issue taskq in particular. We've got a lot more locking going on than illumos and contention on the single taskq can be a big point of contention for high-concurrency write workloads.

@yuri-ccp
Copy link

yuri-ccp commented Oct 1, 2017

Another way to avoid this is put a customizable limit for the number o z_fr_iss threads in module parameters and change de default limit to number os available CPU threads, this may avoid the excessive load and allow each user to tune the values for their own needs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

8 participants
@tuxoko @behlendorf @nwf @blind-oracle @dweeezil @kernelOfTruth @yuri-ccp and others