-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long import time with enabled deduplication #9339
Comments
causes the destroy to be done in background instead of sync and exporting the pool suspends it and removes all related buffers from ARC (this includes the DDT), when async destroy continues on import all the DDT need to be read back in from the disks, leading to a random read workload that can be massive - having the DDT on spinning drives (with IOPS budgets in the low three digit range) will naturally occupy the drives for an extended timespan (amount of reads needed * average seek time), while at the same time competing for the available IOPS with updates to the DDT that have to be persisted. In case you want to use dedup and avoid that problem (though an async destroy with a cold ARC is basically worst case) you should make sure the metadata (DDT) is on fast solid state media that does not have a seek time penalty for random reads. |
I don't know if I understand special device function correctly - does it hold metadata (or DDT) instead of putting it on all drives, shortening ARC rebuild time after import/export? |
Yes, to speed up access to on-disk metadata by putting it onto a fast device (with as much IOPS as one can get, so metadata can be pulled into ARC as fast as possible) is the idea behind allocation classes. I take it for granted that you have recreated the complete test data from scratch after having added the special devices to the pool, as changing the pool layout only affects new writes while all existing data (which includes any metadata and DDT) stays exactly where it already is. Using deduped blocks gives the overhead of updating the DDT (which might need to be pulled from disk), both on writes and frees, adding a random read/write load to the random reads for the metadata to be traversed (to find the blocks to be released) and the writes needed to update the metaslabs on the pool drives. I would expect that additional load to exclusively happen on the special/dedup device, with a special device giving a higher speedup than a dedup one as the former would offload both metadata and DDT from the main pool. Could you please verify that this works as expected? To test the actual performance (or increases to it stemming from special devices) it would make sense to do the test with steps 5 and 6 swapped: export/import the pool after the copy has completed (and pool is idle again) to clear the ARC, then destory the dataset and measure the time (plus collect performance data, as outlined below) till Off my head I would think Thinking about:
The difference between deleting data (which seems to be too slow for your liking) on the dataset and destroying the whole dataset is that a delete is happening sync (with rm hanging till completion) while the dataset destroy is fully happening in the background. DDT wise the workload should practically be identical (as DDT track on-disk blocks on the pool level), so I think the overall delay you see on the export/import cycle should basically be identical to what you would see when doing a So to backtrack to your original problem (the export/import taking too long)... the problem might be caused by async_destroy generating load, both on the export case (keeping the pool from settling to a state where it can decativate) and on the import (kicking in so soon that the generated IO load keeps the import from completing in a timely manner). Possibly the long export/import times could be solved by having the async_destroy feature pause itself as soon as a pool is marked for export and to have it wait for the import to be fully completed before unpausing, so it'll to keep itself better out of the way of the export/import processes (to not compete with them by generating competing load). To know if the two cases (export/import to clear arc then destroy vs. destroy then export/import) vastly differ in performace (till the pool is idle again) would be interesting. That's one issue. The other is the question why a special or dedup device dosn't seem to speed things up in a relevant manner (which I would expect it to). What might happen is that destroying the test datasets spreads frees over many metaslabs and the sheer amoung of frees cause the main pool vdevs being overladed by medatada IO (updates to the metaslabs / their space maps). Unsure, need more information (see above). Collecting performance relevant data (as outlined above) could help to answer that, posting the actual commands (and results) would help others to reproduce your issue. Only other thing I could think about is not explicitely specifying ashift on the |
I think this goes without saying.
Here are exact steps and output from my tests (special device with set ashift matching to the pool. Other than that base test structure is the same I used before)
6b. monitored data movement (separate terminal windows)
Pool dropped ‘freeing’ value to slightly above 100G just before it succesfully exported
Afterwards freeing slowly drops to 0, zpool is idle.
This sums up first test with defined special device ashift. Now to the test with export and dataset destruction switched, using same zpool as created above:
Freeing allocated data took around 20 minutes
Change at completely freeing data (constructive results shows freeing 1.3G from special device)
Overall import, export and destruction of dataset always take 20 minutes. **And now first scenario but with deduplication turned off **
4” put sample data on first dataset
6" destroyed dataset and exported zpool
Freeing allocated data takes seconds
As mentioned in my original post, there is no issue when deduplication is off. Considering how fast data is destroyed without DDT, in my opinion, async_destroy is not an issue, at least alone. Everything is pointing to movement in deduplication table. |
frees are throttled. does this affect your results? |
I've changed
without special device:
I also did one test powering down server after dataset destruction - import was under minute. |
I was able to obtain several SSDs and do the same scenario as before (8x SSD, 2x dataset with 128K block size, deduplication and copied data).
This proves that using SSDs should be highly recommended when using deduplication. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Describe the problem you're observing
After testing various scenarios I've reached the conclusion that whenever there is movement in DDT and zpool is exported/imported or system is turned off when background operation in dedupe table is performed or system is powered down, there is guaranteed long import time afterwards.
Describe how to reproduce the problem
In my scenario regular export/import took over 15 minutes, import after powering down machine took almost 20 minutes. In both cases zpool status -D has refcnt 2 still present and it takes few minutes to completely clean up. Performing another reboot/export before it's done also takes long time.
Pool is created with following parameters:
zpool create Pool-z1-aon raidz1 sdb sdc sdd sde sdf sdg sdh sdi sdj sdk -O acltype=posixacl -O compression=off -O dedup=off -o ashift=12 -o failmode=wait -o multihost=on -m /Pools/Pool-z1-aon -f
async_destroy is enabled
Is it expected behavior? If yes how can it be optimized because larger datasets can take hours to import.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: