-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel raster optimisation #228
Conversation
We often use In any case, perhaps worth a try to change to |
Using For problem 1, I suggest to replace the progress bars with a spinner (#171) and log messages (and a $ terracotta optimize-rasters *.tif
Optimizing 17 files on 5 processes
foo1.tif ... (1/17)
foo2.tif ... (2/17)
foo3.tif ... (3/17)
foo4.tif ... (4/17)
foo5.tif ... (5/17)
<spinner> But I think the real problem here is memory usage. We either need some heuristic that ensures we don't over-commit, or disable concurrency by default. Otherwise we risk crashing people's machines. BTW, since all of the heavy lifting happens inside GDAL, I suggest you also try using a thread pool instead of processes. Chances are, it's the same speed with less overhead. |
Codecov Report
@@ Coverage Diff @@
## main #228 +/- ##
==========================================
+ Coverage 98.46% 98.57% +0.10%
==========================================
Files 45 45
Lines 2221 2251 +30
Branches 278 287 +9
==========================================
+ Hits 2187 2219 +32
+ Misses 19 18 -1
+ Partials 15 14 -1
Continue to review full report at Codecov.
|
The branch is now updated to use a On my test data, the optimization process takes somewhere between 1:00 and 1:10 (min:sec) when using a The process indicator now shows
where the progress bar shows the overall progress. I still cannot figure out how to find out which files are currently being processed (i.e., how to be notified when the pool starts processing a new file) -- only which files are done being processed. Hence the 'Optimized' when the file is completed, rather than an 'Optimizing...' as you showed in your example. A I have added the |
Let's stay with processes then. I like it better when the file currently being optimzed is printed. This makes it a lot easier to debug hangs or running out of memory (because you know which file is the culprit). For this I would simply move the print to the subprocesses. Also, could we switch the progress bar for a spinner? When optimizing large rasters this can hang for a long time, so I would like to give users the feeling that something is happening. Here's a snipped to get you started: import time
import click
import click_spinner
@click.command()
def main():
with click_spinner.spinner():
i = 0
while True:
click.echo(f"\rfoo {i}")
time.sleep(4)
i += 1
if __name__ == "__main__":
main() |
It now prints something like
while optimising the raster files. Does that match your intended style? |
It's sort of a nitpick, but could we match the style of #228 (comment) exactly? |
Do you want the dots and |
No alignment or truncation or fancy stuff. Just don't repeat "optimizing" and don't log when you're done :) |
Updated to:
In order to know which files might be causing problems, we need to know which files are currently being processed, right? As the optimization of files probably won't complete in order, I thought we needed either
I considered the second option -- I think it would be a nice solution -- but deemed it overkill. Btw, occasionally the logging of files happens slightly out-of-order (file2 might log before file1), when both processes are started almost-simultaneously. Then the order seems slightly odd (prints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good already, let's just get this polished a little more :)
Yes, no problem. |
f757591
to
def4808
Compare
Status update: |
To trigger an exception you can use monkeypatch to replace |
Thanks! This turned out to be quite a bit of work, but I think it was worth it. This has much better UX than before. |
Use a
multiprocessing.Pool
to optimize raster files in parallel (each raster file is applied individually). Closes #55Currently uses
multiprocessing.cpu_count()
number of threads. This seems to be the number of logical (hyper-threading) cores: I'm unsure whether the number of physical cores might be preferable.Currently has some caveats:
1: The sub-progress bars are removed, such that only the overall progress is shown.
(And the filename postfix is now showing the just-completed file, rather than the currently-being-processed file).
2: When raising an exception because optimized files already exists (and neither
--overwrite
nor--skip-existing
flags are set), the entire trace-stack is now shown, where it previously just showed a nice little message.3: The
test_reoptimize
test intests/scripts/test_optimize_rasters.py
fails. Not because the code performs wrongly, but because of the way exceptions inside threads are handled.When submitting a task to the
Pool
, anerror_callback
function is given, which receives any errors which occurs in the thread (and handles them in the main-thread). But apparently pytest recognizes that an error has been thrown inside the threads, and fails the test based on that.If a preprocessed file already exists, and
terracotta optimize-rasters
is run without--overwrite
or--skip-existing
flags, then the expected behaviour is to throw an exception. But somehow pytest expects the exception to be raised differently, or something like that?And I can't figure out how to make the test accept the new exception flow, such that the test just tests whether the code handles preexisting optimized files properly, and not where/how the exception is raised.