Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up tile download with concurrent processing #182

Merged
merged 7 commits into from
Aug 31, 2020
Merged

Conversation

martham93
Copy link
Contributor

@martham93 martham93 commented Aug 27, 2020

to address issue #12

  • uses ThreadPoolExecutor to speed up tile downloads
  • previously downloading ~400 tiles from a mapbox endpoint took ~31 second, with using the threadpool executor downloading ~400 tiles takes ~8 seconds!
  • if this change aligns with what you were thinking, I think it's possible to do something similar with the child tile download for the supertiles special case, and I can investigate

cc @drewbo

@martham93 martham93 requested a review from drewbo August 27, 2020 22:10
Copy link
Contributor

@drewbo drewbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome and glad it will speed things up considerably. A few notes:

  • the thread count should be configurable via CLI, let's add it as a flag (not in the config.json since it isn't data related). Note that threads != CPU cores, so you can try like ~50 and see if you get faster downloads (my guess is somewhere between 20-50 is the fastest)
  • because we don't use the result of the function anywhere, we don't need to use .as_completed or .result. I'd recommend using shutdown or wait and making sure that the image function itself logs failures

@drewbo
Copy link
Contributor

drewbo commented Aug 28, 2020

Re: supertile child download: that function will already be happening inside the ThreadPool (because download_tile_tms is one of the image_download options)

@martham93
Copy link
Contributor Author

the download time for ~400 tiles with thread count of 50 is ~4 seconds!

image_function(tile, imagery, tiles_dir, kwargs)
t = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor(max_workers=threadcount) as executor:
{executor.submit(image_function, tile, imagery, tiles_dir, kwargs): tile for tile in tiles}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a list comprehension instead of dict

@drewbo
Copy link
Contributor

drewbo commented Aug 28, 2020

@martham93 sweet; looks better. Last two things: (1) make the default threads ~10 and then people can up it if they want and (2) either remove the time elapsed or print it with nicer formatting

@martham93
Copy link
Contributor Author

awesome thanks @drewbo all of these should be addressed now

@martham93 martham93 requested a review from drewbo August 28, 2020 20:50
@martham93 martham93 merged commit 0691422 into master Aug 31, 2020
@drewbo drewbo deleted the tile_download branch August 31, 2020 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants