Long import time with enabled deduplication #9339

adam-blaszczykowski · 2019-09-20T12:37:38Z

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	Jessie
Linux Kernel	4.19
Architecture	x86_64
ZFS Version	0.8.1-1
SPL Version	0.8.1-1

Describe the problem you're observing

After testing various scenarios I've reached the conclusion that whenever there is movement in DDT and zpool is exported/imported or system is turned off when background operation in dedupe table is performed or system is powered down, there is guaranteed long import time afterwards.

Describe how to reproduce the problem

Create zpool (I used 20 280GB drives in various configurations - raid z1, z2, singles etc)
create 2 datasets (dedup=on recordsize=128K - bigger recordsize cleans up table faster, smaller takes forever to perform tests)
put portion of sample data on first dataset (in my case 500*500MB files with dd)
copy files from first dataset to second dataset (DDT must have deduplicated blocks)
destroy one of the datasets (deleting data takes too long time, DDT may be able to clean data)
immadietly afterwards either export zpool or restart machine (dedupe table must still show data in refct 2 line)

In my scenario regular export/import took over 15 minutes, import after powering down machine took almost 20 minutes. In both cases zpool status -D has refcnt 2 still present and it takes few minutes to completely clean up. Performing another reboot/export before it's done also takes long time.

Pool is created with following parameters:
zpool create Pool-z1-aon raidz1 sdb sdc sdd sde sdf sdg sdh sdi sdj sdk -O acltype=posixacl -O compression=off -O dedup=off -o ashift=12 -o failmode=wait -o multihost=on -m /Pools/Pool-z1-aon -f
async_destroy is enabled

Is it expected behavior? If yes how can it be optimized because larger datasets can take hours to import.

Include any warning/errors/backtraces from the system logs

GregorKopka · 2019-09-21T04:12:16Z

async_destroy is enabled

causes the destroy to be done in background instead of sync and exporting the pool suspends it and removes all related buffers from ARC (this includes the DDT), when async destroy continues on import all the DDT need to be read back in from the disks, leading to a random read workload that can be massive - having the DDT on spinning drives (with IOPS budgets in the low three digit range) will naturally occupy the drives for an extended timespan (amount of reads needed * average seek time), while at the same time competing for the available IOPS with updates to the DDT that have to be persisted.

In case you want to use dedup and avoid that problem (though an async destroy with a cold ARC is basically worst case) you should make sure the metadata (DDT) is on fast solid state media that does not have a seek time penalty for random reads.

adam-blaszczykowski · 2019-09-24T11:19:29Z

I don't know if I understand special device function correctly - does it hold metadata (or DDT) instead of putting it on all drives, shortening ARC rebuild time after import/export?
As suggested I did some additional testing adding nvme drive to the z1 pool (created as before) as dedicated dedup and special device
zpool add -f Pool-z1-ded dedup /dev/nvme0n1
dataset destruction took 6 minutes, export and import afterwards - 20 minutes total, 10 minutes each
Special device acted slightly different
zpool add -f Pool-z1-sp special /dev/nvme0n1
dataset with same amount of data was destroyed in few seconds (only noticable differene) but export/import also totaled to 20 minutes as before (15 minutes for export and 5 for import).
From my observation using special device, at least in my configuration caused even longer import/export times than without it and consumed really big amount of RAM (over 100GB vs 10-20GB before).
Did I miss something? Any further tips how to make it work with HDD or only way to make the times short is to use SSDs because ARC will always have to crosscheck DDT blocks with all drives in pool?

GregorKopka · 2019-09-24T14:31:28Z

does it hold metadata (or DDT) instead of putting it on all drives, shortening ARC rebuild time after import/export?

Yes, to speed up access to on-disk metadata by putting it onto a fast device (with as much IOPS as one can get, so metadata can be pulled into ARC as fast as possible) is the idea behind allocation classes.

I take it for granted that you have recreated the complete test data from scratch after having added the special devices to the pool, as changing the pool layout only affects new writes while all existing data (which includes any metadata and DDT) stays exactly where it already is.

Using deduped blocks gives the overhead of updating the DDT (which might need to be pulled from disk), both on writes and frees, adding a random read/write load to the random reads for the metadata to be traversed (to find the blocks to be released) and the writes needed to update the metaslabs on the pool drives. I would expect that additional load to exclusively happen on the special/dedup device, with a special device giving a higher speedup than a dedup one as the former would offload both metadata and DDT from the main pool.

Could you please verify that this works as expected?

To test the actual performance (or increases to it stemming from special devices) it would make sense to do the test with steps 5 and 6 swapped: export/import the pool after the copy has completed (and pool is idle again) to clear the ARC, then destory the dataset and measure the time (plus collect performance data, as outlined below) till zpool get freeing reached zero and the pool is idle again.

Off my head I would think zpool iostat -v [interval] (which for the import case could be started in another terminal as soon as the pool shows up in zpool list, even with the zpool import still hanging in the other terminal) should give that answer, possibly a look in yet another terminal with the normal iostat could give additional insight about how busy the drives really are. If possible correlated with the amount shown for zpool get freeing (if any) and what arcstat.py thinks about the RAM side of things. Also a look into what the system thinks about CPU load could give a hint about what might be wrong, a flame graph (#4227 (comment)) could also be interesting.

Thinking about:

destroy one of the datasets (deleting data takes too long time, DDT may be able to clean data)

The difference between deleting data (which seems to be too slow for your liking) on the dataset and destroying the whole dataset is that a delete is happening sync (with rm hanging till completion) while the dataset destroy is fully happening in the background. DDT wise the workload should practically be identical (as DDT track on-disk blocks on the pool level), so I think the overall delay you see on the export/import cycle should basically be identical to what you would see when doing a rm on the test data (the metadata overhead of a dataset is not huge, it shouldn't really matter).

So to backtrack to your original problem (the export/import taking too long)... the problem might be caused by async_destroy generating load, both on the export case (keeping the pool from settling to a state where it can decativate) and on the import (kicking in so soon that the generated IO load keeps the import from completing in a timely manner). Possibly the long export/import times could be solved by having the async_destroy feature pause itself as soon as a pool is marked for export and to have it wait for the import to be fully completed before unpausing, so it'll to keep itself better out of the way of the export/import processes (to not compete with them by generating competing load).

To know if the two cases (export/import to clear arc then destroy vs. destroy then export/import) vastly differ in performace (till the pool is idle again) would be interesting.

That's one issue.

The other is the question why a special or dedup device dosn't seem to speed things up in a relevant manner (which I would expect it to). What might happen is that destroying the test datasets spreads frees over many metaslabs and the sheer amoung of frees cause the main pool vdevs being overladed by medatada IO (updates to the metaslabs / their space maps). Unsure, need more information (see above). Collecting performance relevant data (as outlined above) could help to answer that, posting the actual commands (and results) would help others to reproduce your issue.

Only other thing I could think about is not explicitely specifying ashift on the zpool add [special|dedup] causing a bad value being picked (by zfs) for the nvme, but that shouldn't slow it to the level of a spinning drive...

adam-blaszczykowski · 2019-09-25T12:16:45Z

I take it for granted that you have recreated the complete test data from scratch after having added the special devices to the pool

I think this goes without saying.

To know if the two cases (export/import to clear arc then destroy vs. destroy then export/import) vastly differ in performace (till the pool is idle again) would be interesting.

Here are exact steps and output from my tests (special device with set ashift matching to the pool. Other than that base test structure is the same I used before)

Created zpool z1
zpool create Pool-z1-spec raidz1 sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq sdr sds sdt sdu -O acltype=posixacl -O compression=off -O dedup=off -o ashift=12 -o failmode=wait -o multihost=on -m /Pools/Pool-z1-spec -f
added special device with the same ashift as zpool
zpool add -f -o ashift=12 Pool-z1-spec special /dev/nvme0n1
created 2 datasets
zfs create -p -o dedup=on -o recordsize=128K Pool-z1-spec/ds1 zfs create -p -o dedup=on -o recordsize=128K Pool-z1-spec/ds2
put sample data on first dataset
for i in {1..500}; do echo $i; dd if=/dev/urandom of=/Pools/Pool-z1-spec/ds1/500x1024k$i bs=1024k count=500 ; done
copied data to second dataset
cp -v /Pools/Pool-z1-spec/ds1/* /Pools/Pool-z1-spec/ds2/
On idle:

iostat
[...]
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
Pool-z1-spec                 262G  5.31T      0     20      0  83.1K
  raidz1                     259G  5.20T      0     19      0  79.1K
[...]
special                         -      -      -      -      -      -
  nvme0n1                   2.51G   116G      0      0      0  4.00K
--------------------------  -----  -----  -----  -----  -----  -----

zpool status -D; zpool get freeing 

 dedup: DDT entries 2000000, size 877B on disk, 283B in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     2    1.91M    244G    244G    244G    3.81M    488G    488G    487G
 Total    1.91M    244G    244G    244G    3.81M    488G    488G    487G

destroyed dataset and exported the pool

time zfs destroy Pool-z1-spec/ds1 -r -R -f ; time zpool export Pool-z1-spec

real	0m7.421s
user	0m0.000s
sys	0m0.341s

real	12m42.252s
user	0m0.007s
sys	0m1.946s

6b. monitored data movement (separate terminal windows)

zpool iostat -v Pool-z1-spec 5

pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
Pool-z1-spec                 265G  5.31T      0    246      0  16.8M
  raidz1                     259G  5.20T      0     19      0  79.9K
    sdb                         -      -      0      0      0  4.00K
[...]
special                         -      -      -      -      -      -
  nvme0n1                   5.83G   113G      0    226      0  16.7M
--------------------------  -----  -----  -----  -----  -----  -----

watch -n 2 'zpool status -D; zpool get freeing' 

 dedup: DDT entries 2000000, size 880B on disk, 284B in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1     293K   36.6G   36.6G   36.5G     293K   36.6G   36.6G   36.5G
     2    1.62M    208G    208G    207G    3.24M    415G    415G    414G
 Total    1.91M    244G    244G    244G    3.53M    452G    452G    451G

NAME          PROPERTY  VALUE    SOURCE
Pool-z1-spec  freeing   207G     -

Pool dropped ‘freeing’ value to slightly above 100G just before it succesfully exported

imported zpool

time zpool import Pool-z1-spec

real	7m37.584s
user	0m0.042s
sys	0m0.280s

 dedup: DDT entries 2000000, size 880B on disk, 284B in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    1.81M    232G    232G    231G    1.81M    232G    232G    231G
     2    99.3K   12.4G   12.4G   12.4G     199K   24.8G   24.8G   24.8G
 Total    1.91M    244G    244G    244G    2.00M    257G    257G    256G

NAME          PROPERTY  VALUE    SOURCE
Pool-z1-spec  freeing   12.4G    -

pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
Pool-z1-spec                 265G  5.31T      0    289      0  16.9M
  raidz1                     259G  5.20T      0     19      0  79.9K
    sdb                         -      -      0      0      0  4.00K
[...]
special                         -      -      -      -      -      -
  nvme0n1                   5.72G   113G      0    269      0  16.9M
--------------------------  -----  -----  -----  -----  -----  -----

Afterwards freeing slowly drops to 0, zpool is idle.

‘ dedup: DDT entries 2000000, size 880B on disk, 284B in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    1.91M    244G    244G    244G    1.91M    244G    244G    244G
 Total    1.91M    244G    244G    244G    1.91M    244G    244G    244G

NAME          PROPERTY  VALUE    SOURCE
Pool-z1-spec  freeing   0        -

Pool-z1-spec                 261G  5.31T      0     20      0  83.1K
  raidz1                     259G  5.20T      0     19      0  79.1K
    sdb                         -      -      0      0      0  4.00K
[...]
special                         -      -      -      -      -      -
  nvme0n1                   1.94G   117G      0      0      0  4.00K

This sums up first test with defined special device ashift.

Now to the test with export and dataset destruction switched, using same zpool as created above:
8. (3’) zfs create -p -o dedup=on -o recordsize=128K Pool-z1-spec/ds3
9. (5’) cp -v /Pools/Pool-z1-spec/ds2/* /Pools/Pool-z1-spec/ds3/
10. exported zpool

time zpool export Pool-z1-spec

real	0m4.684s
user	0m0.006s
sys	0m3.811s

imported zpool

time zpool import Pool-z1-spec

real	0m3.594s
user	0m0.027s
sys	0m0.234s

 dedup: DDT entries 2000000, size 880B on disk, 284B in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     2    1.91M    244G    244G    244G    3.81M    488G    488G    487G
 Total    1.91M    244G    244G    244G    3.81M    488G    488G    487G

NAME          PROPERTY  VALUE    SOURCE
Pool-z1-spec  freeing   0        -

destroyed dataset

time zfs destroy Pool-z1-spec/ds2 -r -R -f                   

real	0m11.623s
user	0m0.000s
sys	0m0.014s

Freeing allocated data took around 20 minutes
Start: Wed Sep 25 14:46:42 CEST 2019
Finish: Wed Sep 25 15:06:02 CEST 2019
top -H showed several z_wr_iss processes in the meantime. As for iostat and dedicated device:
Begining:

pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
Pool-z1-spec                 265G  5.31T      0    342      0  20.0M
  raidz1                     259G  5.20T      0     19      0  79.9K
    sdb                         -      -      0      0      0  4.00K
[...]
special                         -      -      -      -      -      -
  nvme0n1                   5.77G   113G      0    322      0  19.9M
--------------------------  -----  -----  -----  -----  -----  -----

Change at completely freeing data (constructive results shows freeing 1.3G from special device)


--------------------------  -----  -----  -----  -----  -----  -----
                              capacity     operations     bandwidth 
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
Pool-z1-spec                 263G  5.31T      0    314      0  14.7M
  raidz1                     259G  5.20T      0     35      0   143K
[...]
special                         -      -      -      -      -      -
  nvme0n1                   3.25G   116G      0    278      0  14.6M
--------------------------  -----  -----  -----  -----  -----  -----
                              capacity     operations     bandwidth

pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
Pool-z1-spec                 261G  5.31T      0     20      0  83.9K
  raidz1                     259G  5.20T      0     19      0  79.9K
    sdb                         -      -      0      0      0  4.00K
[...]
special                         -      -      -      -      -      -
  nvme0n1                   1.97G   117G      0      0      0  4.00K
--------------------------  -----  -----  -----  -----  -----  -----

 dedup: DDT entries 2000000, size 880B on disk, 284B in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    1.91M    244G    244G    244G    1.91M    244G    244G    244G
 Total    1.91M    244G    244G    244G    1.91M    244G    244G    244G

NAME          PROPERTY  VALUE    SOURCE
Pool-z1-spec  freeing   0        -

Overall import, export and destruction of dataset always take 20 minutes.

**And now first scenario but with deduplication turned off **
1” Created fresh zpool z1
zpool create Pool-z1-noded raidz1 sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq sdr sds sdt sdu -O acltype=posixacl -O compression=off -O dedup=off -o ashift=12 -o failmode=wait -o multihost=on -m /Pools/Pool-z1-noded -f
2” added special device with same ashift
zpool add -f -o ashift=12 Pool-z1-noded special /dev/nvme0n1
3” created 2 datasets

zfs create -p -o dedup=off -o recordsize=128K Pool-z1-noded/ds1
zfs create -p -o dedup=off -o recordsize=128K Pool-z1-noded/ds2

4” put sample data on first dataset
for i in {1..500}; do echo $i; dd if=/dev/urandom of=/Pools/Pool-z1-noded/ds1/500x1024k$i bs=1024k count=500 ; done
5” copied data to second dataset
cp -v /Pools/Pool-z1-noded/ds1/* /Pools/Pool-z1-noded/ds2/

                              capacity     operations     bandwidth 
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
Pool-z1-noded                519G  5.06T      0     21      0  84.7K
  raidz1                     519G  4.95T      0     20      0  80.7K
    sdb                         -      -      0      0      0  4.00K
    special                         -      -      -      -      -      -
[...]
  nvme0n1                    346M   119G      0      0      0  4.00K

6" destroyed dataset and exported zpool

time zfs destroy Pool-z1-noded/ds1 -r -R -f ; time zpool export Pool-z1-noded

real	0m0.341s
user	0m0.000s
sys	0m0.029s

real	0m5.354s
user	0m0.012s
sys	0m2.075s

Freeing allocated data takes seconds
7" imported zpool

time zpool import Pool-z1-noded

real	0m6.275s
user	0m0.038s
sys	0m0.207s

zpool iostat -v Pool-z1-noded 5
                              capacity     operations     bandwidth 
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
Pool-z1-noded                260G  5.32T      6     31   241K   282K
  raidz1                     259G  5.20T      0     25  61.0K   214K
[...]
special                         -      -      -      -      -      -
  nvme0n1                    174M   119G      5      6   180K  67.2K
--------------------------  -----  -----  -----  -----  -----  -----

As mentioned in my original post, there is no issue when deduplication is off. Considering how fast data is destroyed without DDT, in my opinion, async_destroy is not an issue, at least alone.

Everything is pointing to movement in deduplication table.
I still have no idea why my nvme drive is not speeding things up, it's certainly being used for data.

richardelling · 2019-09-25T13:17:38Z

frees are throttled. does this affect your results?
https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_free_max_blocks

adam-blaszczykowski · 2019-09-26T13:06:18Z

frees are throttled. does this affect your results?
https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_free_max_blocks

I've changed zfs_async_block_max_blocks ( former zfs_free_max_blocks) to 1000 from default value . Same test scenario as before (dataset destruction, export, import)
with special device:

time zfs destroy Pool-z1-max1000/ds1 -r -R -f ; time zpool export Pool-z1-max1000 ; time zpool import Pool-z1-max1000

real	0m2.881s
user	0m0.000s
sys	0m0.324s

real	0m51.103s
user	0m0.008s
sys	0m3.270s

real	0m9.158s
user	0m0.061s
sys	0m0.217s

without special device:

time zfs destroy Pool-z1-max1000-nospec/ds1 -r -R -f ; time zpool export Pool-z1-max1000-nospec ; time zpool import Pool-z1-max1000-nospec

real	0m3.280s
user	0m0.006s
sys	0m0.410s

real	0m44.096s
user	0m0.006s
sys	0m3.355s

real	0m50.807s
user	0m0.053s
sys	0m0.214s

I also did one test powering down server after dataset destruction - import was under minute.
Freeing 240G in background takes almost 2hours with and without special devices but it finally works as it should - in background without heavy impact command execution.
Do you have any further tips or some formulas to count and optimize free max blocks / async block max blocks number?

adam-blaszczykowski · 2019-10-16T07:19:31Z

I was able to obtain several SSDs and do the same scenario as before (8x SSD, 2x dataset with 128K block size, deduplication and copied data).
zfs_async_block_max_blocks was tested in 2 settings: default and cut to 1000

100000 (default): Freeing around 12-13G in one refresh (change seen when watching DDT and freeing parameter); 2 minutes -destroy-export-import

time zfs destroy Pool-raidz1/ds1 -r -R -f ; time zpool export Pool-raidz1 ; time zpool import Pool-raidz1

real	0m17.631s
user	0m0.009s
sys	0m0.001s

real	1m55.073s
user	0m0.020s
sys	0m0.703s

real	0m0.742s
user	0m0.027s
sys	0m0.115s

1000 Freeing around 900mb in one refresh; less than 10s to destroy - export - import

time zfs destroy Pool-raidz1/ds3 -r -R -f ; time zpool export Pool-raidz1 ; time zpool import Pool-raidz1

real	0m0.631s
user	0m0.002s
sys	0m0.009s

real	0m4.235s
user	0m0.005s
sys	0m0.121s

real	0m3.800s
user	0m0.012s
sys	0m0.090s

This proves that using SSDs should be highly recommended when using deduplication.
I still have more questions regarding tuning kernel parameters to optimize these operations using HDDs.
I'm at the point where I have to compromise - either drastically lower zfs_async_block_max_blocks value and prolong freeing times background taking toll on overall performance of r/w operations taking place until it finishes space reclamation or leave default setting and risk very long zpool import. Is this possible to somehow modify zfs_async_block_max_blocks on the fly? I know it's kernel module but does it even allow such modification without exporting and importing zpool and/or completely rebooting machine (I could implement mechanism to check if there is data freeing operation and change this parameter accordingly in that case). There are many other txg related parameters maybe modifying some of them would have real impact on speeding up space reclamation on HDDS?

stale · 2020-10-23T22:13:19Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf added the Type: Performance Performance improvement or performance problem label Oct 24, 2019

stale bot added the Status: Stale No recent activity for issue label Oct 23, 2020

stale bot closed this as completed Jan 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long import time with enabled deduplication #9339

Long import time with enabled deduplication #9339

adam-blaszczykowski commented Sep 20, 2019

GregorKopka commented Sep 21, 2019

adam-blaszczykowski commented Sep 24, 2019

GregorKopka commented Sep 24, 2019

adam-blaszczykowski commented Sep 25, 2019

richardelling commented Sep 25, 2019

adam-blaszczykowski commented Sep 26, 2019

adam-blaszczykowski commented Oct 16, 2019

stale bot commented Oct 23, 2020

Long import time with enabled deduplication #9339

Long import time with enabled deduplication #9339

Comments

adam-blaszczykowski commented Sep 20, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

GregorKopka commented Sep 21, 2019

adam-blaszczykowski commented Sep 24, 2019

GregorKopka commented Sep 24, 2019

adam-blaszczykowski commented Sep 25, 2019

richardelling commented Sep 25, 2019

adam-blaszczykowski commented Sep 26, 2019

adam-blaszczykowski commented Oct 16, 2019

stale bot commented Oct 23, 2020