-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MultiScene writer handling of multiple delayed objects #1639
Fix MultiScene writer handling of multiple delayed objects #1639
Conversation
If no groups are specified the cf_writer now returns only one delayed object. Therefore distributed writing of netcdfs is now possible with Multiscenes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about the internals of the CF writer, but what about having multiple delayed objects returned stops MultiScene
from working?
satpy/multiscene.py
Outdated
log.info("Finished saving %d scenes", idx) | ||
idx += 1 | ||
# for future in future_list: | ||
future_list.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this left over from testing or is the for loop actually not needed?
@djhoese scratch that. Sorry for the confusion. You are quite right in general there is no problem with multiple delayed objects hence the for loop. I experimented with this a while ago and I fixed something at the wrong point (fixed the cf writer to return only one delayed which in turn needed removal of the loop). What I think actually needs fixing is: Line 348 in 2245c9a
While this catches situations with writers returning (source, target) combinations it also catches the cf writer case because it always returns a list of at least two delayed objects. This if clause should be more specific so it does not apply to the cf writer. |
Ah yes, that does seem like too simplistic of a condition. There may be utilities in |
I checked The changes to the reader are strictly not necessary then. The only "advantage" being that in the case with no groups only one delayed will be returned and therefore only one file access is needed but I am not sure if that has any relevant benefit. So I would propose to add the additional condition in the distributed writing and roll back the changes in the writer and maybe keep some of the code simplifying changes only. |
The MultiScene could use satpy/satpy/writers/__init__.py Lines 487 to 514 in 3984453
sort_writer_results .
|
Codecov Report
@@ Coverage Diff @@
## main #1639 +/- ##
=======================================
Coverage 92.66% 92.67%
=======================================
Files 258 258
Lines 37988 38024 +36
=======================================
+ Hits 35200 35237 +37
+ Misses 2788 2787 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Yes that's were I got the idea from. I am not really that familiar with all the writers and how their return values look like. It is very possible that what I proposed is not covering it and If using |
I don't think there is anything logically wrong with what you're suggesting, but if you use |
Sorry for the delay. I changed it to use |
satpy/multiscene.py
Outdated
@@ -345,7 +345,8 @@ def load_data(q): | |||
|
|||
for scene in scenes_iter: | |||
delayed = scene.save_datasets(compute=False, **kwargs) | |||
if isinstance(delayed, (list, tuple)) and len(delayed) == 2: | |||
sources, targets, _ = split_results(delayed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the _
be overwriting delayed
? This way, in the future, if the MultiScene can handle source/target combinations and delayed objects the code won't have to be changed. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I thought about it and I think it should not be overwritten since the result of split_results
is only needed for the check of sources. Currently delayed
is allowed to be a target/ list of targets (e.g. simple image writer) or a delayed/ list of delayeds (e.g. CF writer). So if delayed
got overwritten some additional code to either hand targets
or delayed
to client.compute
in line 355 would be needed. So for the check only sources
is needed and targets
could also be _
.
One thing I noticed was that it is not well documented what each writer returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Targets should never be returned without a source. They are source + target pairs. The simple image writer should return delayed objects. The geotiff writer returns source (dask array) + target (output file-like object). If you have another example, then I still propose that _
be replaced with delayed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see now I got fooled by the test which returned targets without sources from split_results
for the simple image writer because the mock did not pass as a delayed object. I changed the code to overwrite delayed now and also fixed the test. Additionally I added a test for writers with source/target combinations to fail.
I re-titled this after it seemed like most of the work was in getting the MultiScene to work. However, I realize you still did a lot of work on the CF writer. Is this work still needed? If so, would you mind merging with the pytroll main branch and resolve the conflicts in the CF writer? |
satpy/multiscene.py
Outdated
@@ -341,17 +341,20 @@ def load_data(q): | |||
|
|||
input_q = Queue(batch_size if batch_size is not None else 1) | |||
load_thread = Thread(target=load_data, args=(input_q,)) | |||
# set threads to daemon so they are killed if error is raised from main thread | |||
load_thread.daemon = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be set in the Thread
init right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed. Changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
To support groups the
cf_writer
was implemented in a way that even if no groups were specified two delayeds were returned. This made it impossible to use distributed writing with multiscenes.Now if now groups are used a single delayed is returned since all datasets can be written at once without appending to the file.