Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add information about multi-threaded compression with geotiff creation #968

Merged
merged 1 commit into from
Dec 6, 2019

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Nov 13, 2019

Adding the num_threads option to my save_datasets calls has reduced my processing time down a ton. For example some processing that was taking 20m on a docker container on an older machine is now taking 5m30s by specifying num_threads=8. This PR adds documentation to the FAQ page about using this option.

  • Tests passed
  • Fully documented

Other options

As mentioned on Slack by @pnuu, maybe we should consider setting this by default or changing the default compression algorithm in trollimage to be something other than DEFLATE. If this is done I would recommend no compression.

@djhoese djhoese requested review from mraspaud and pnuu November 13, 2019 19:09
@djhoese djhoese self-assigned this Nov 13, 2019
@coveralls
Copy link

Coverage Status

Coverage increased (+0.004%) to 86.772% when pulling 8c610bf on djhoese:doc-geotiff-thread-compression into 2a24016 on pytroll:master.

@codecov
Copy link

codecov bot commented Nov 13, 2019

Codecov Report

Merging #968 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #968      +/-   ##
==========================================
+ Coverage   86.76%   86.77%   +<.01%     
==========================================
  Files         179      179              
  Lines       27298    27298              
==========================================
+ Hits        23686    23687       +1     
+ Misses       3612     3611       -1
Impacted Files Coverage Δ
satpy/writers/geotiff.py 90.56% <100%> (ø) ⬆️
satpy/scene.py 90.53% <0%> (+0.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a24016...8c610bf. Read the comment docs.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this valuable info!

Copy link
Member

@pnuu pnuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'd prefer to keep the default DEFLATE compression. The imagery are saved only once, but e.g. in WMS usage read plenty of times so the fast reading is a must. I'd rather use the DASK_NUM_WORKERS as the default for num_threads to speed things up in a normal setup.

@djhoese
Copy link
Member Author

djhoese commented Nov 13, 2019

I'd rather use the DASK_NUM_WORKERS as the default for num_threads to speed things up in a normal setup.

I think we could use dask.config.get('num_workers') for that. So I guess it would be:

num_threads = kwargs.get('num_threads', dask.config.get('num_workers', 1))

@pnuu
Copy link
Member

pnuu commented Nov 13, 2019

There seems to be one gotcha in that. By default dask uses all threads, but:

$ python -c 'import dask; print(dask.config.get("num_workers", "nope"))'
nope

while

$ DASK_NUM_WORKERS=4 python -c 'import dask; print(dask.config.get("num_workers", "nope"))'
4

So by default the compression would use only one thread even if everything else used all available resources.

@djhoese
Copy link
Member Author

djhoese commented Nov 13, 2019

So by default the compression would use only one thread even if everything else used all available resources.

Yeah, that was my intention. I was hoping to avoid accidentally killing the user's machine. I thought 1 was safer. We could also do ALL_CPUS as the default which GDAL understands to mean the same thing. Maybe that would best to match dask?

@mraspaud
Copy link
Member

Do we want to fix the number of threads in this PR, or make a separate one ?

@djhoese
Copy link
Member Author

djhoese commented Nov 15, 2019

I think we should either change the default threads or the default compression (to no compression) and do it in trollimage. Doing either of those means this PR should be updated to reflect those changes.

I'm a little worried/uncomfortable with how we are changing GDAL's defaults to do what we want and making those changes effect everyone. Granted, we are using dask so changing the default number of threads kind of makes sense. Setting the default compression though was maybe not the "cleanest" solution.

@pnuu
Copy link
Member

pnuu commented Nov 16, 2019

If the user is asking for N threads, it isn't completely out-of-the-box-thinking that it would happen consistently all through the computation.

@djhoese
Copy link
Member Author

djhoese commented Nov 16, 2019

@pnuu True. To be clear, GDAL and dask are likely not using the same threads so I guess there could be some performance/kernel concerns.

@mraspaud
Copy link
Member

mraspaud commented Dec 6, 2019

I'm merging this as I think this information is useful as it is. Please make another PR if you think this should be further modified.

@mraspaud mraspaud merged commit 7219160 into pytroll:master Dec 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants