Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MultiScene writer handling of multiple delayed objects #1639

Merged
merged 8 commits into from
May 17, 2021

Conversation

BENR0
Copy link
Collaborator

@BENR0 BENR0 commented Apr 13, 2021

To support groups the cf_writer was implemented in a way that even if no groups were specified two delayeds were returned. This made it impossible to use distributed writing with multiscenes.
Now if now groups are used a single delayed is returned since all datasets can be written at once without appending to the file.

  • Tests added

If no groups are specified the cf_writer now returns only one delayed
object. Therefore distributed writing of netcdfs is now possible with
Multiscenes.
@BENR0 BENR0 requested review from djhoese and mraspaud as code owners April 13, 2021 10:28
Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about the internals of the CF writer, but what about having multiple delayed objects returned stops MultiScene from working?

log.info("Finished saving %d scenes", idx)
idx += 1
# for future in future_list:
future_list.result()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this left over from testing or is the for loop actually not needed?

@BENR0
Copy link
Collaborator Author

BENR0 commented Apr 13, 2021

@djhoese scratch that. Sorry for the confusion. You are quite right in general there is no problem with multiple delayed objects hence the for loop. I experimented with this a while ago and I fixed something at the wrong point (fixed the cf writer to return only one delayed which in turn needed removal of the loop). What I think actually needs fixing is:

if isinstance(delayed, (list, tuple)) and len(delayed) == 2:

While this catches situations with writers returning (source, target) combinations it also catches the cf writer case because it always returns a list of at least two delayed objects. This if clause should be more specific so it does not apply to the cf writer.

@djhoese
Copy link
Member

djhoese commented Apr 13, 2021

Ah yes, that does seem like too simplistic of a condition. There may be utilities in satpy/writers/__init__.py that could be used to "parse" the return values from the writers.

@BENR0
Copy link
Collaborator Author

BENR0 commented Apr 19, 2021

I checked satpy/writers/__init__.py for something useful but could not find something directly applicable. But if I see it correctly if a reader has sources they are always the first item so adding and isintance(delayed[0], da.Array) to the condition would solve the issue I think.

The changes to the reader are strictly not necessary then. The only "advantage" being that in the case with no groups only one delayed will be returned and therefore only one file access is needed but I am not sure if that has any relevant benefit. So I would propose to add the additional condition in the distributed writing and roll back the changes in the writer and maybe keep some of the code simplifying changes only.

@djhoese
Copy link
Member

djhoese commented Apr 19, 2021

The MultiScene could use split_results:

def split_results(results):
"""Split results.
Get sources, targets and delayed objects to separate lists from a list of
results collected from (multiple) writer(s).
"""
from dask.delayed import Delayed
def flatten(results):
out = []
if isinstance(results, (list, tuple)):
for itm in results:
out.extend(flatten(itm))
return out
return [results]
sources = []
targets = []
delayeds = []
for res in flatten(results):
if isinstance(res, da.Array):
sources.append(res)
elif isinstance(res, Delayed):
delayeds.append(res)
else:
targets.append(res)
return sources, targets, delayeds
and if there are any sources then raise the error. Although, if this function is used here it should maybe be renamed to something like sort_writer_results.

@codecov
Copy link

codecov bot commented Apr 19, 2021

Codecov Report

Merging #1639 (3a07044) into main (2a1110e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1639   +/-   ##
=======================================
  Coverage   92.66%   92.67%           
=======================================
  Files         258      258           
  Lines       37988    38024   +36     
=======================================
+ Hits        35200    35237   +37     
+ Misses       2788     2787    -1     
Flag Coverage Δ
behaviourtests 4.82% <2.27%> (-0.01%) ⬇️
unittests 92.80% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
satpy/multiscene.py 91.10% <100.00%> (+0.33%) ⬆️
satpy/tests/test_multiscene.py 99.46% <100.00%> (+0.04%) ⬆️
satpy/writers/cf_writer.py 95.17% <100.00%> (-0.02%) ⬇️
satpy/writers/geotiff.py 90.90% <0.00%> (+0.16%) ⬆️
satpy/tests/writer_tests/test_geotiff.py 97.87% <0.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a1110e...3a07044. Read the comment docs.

@BENR0
Copy link
Collaborator Author

BENR0 commented Apr 20, 2021

Yes that's were I got the idea from. I am not really that familiar with all the writers and how their return values look like. It is very possible that what I proposed is not covering it and split_writer_results should be used. I was just thinking that it over complicates things if the small addition to the condition will fix the issue.

If using split_writer_results is preferred I will change it and push it again.

@djhoese
Copy link
Member

djhoese commented Apr 20, 2021

I don't think there is anything logically wrong with what you're suggesting, but if you use split_writer_results then we will catch issues like this sooner if writer results are changed in the future. And when developers need to change split_writer_results they will see it is being used in multiscene.py and that it also needs to be updated.

@BENR0
Copy link
Collaborator Author

BENR0 commented May 11, 2021

Sorry for the delay. I changed it to use split_results now.

@@ -345,7 +345,8 @@ def load_data(q):

for scene in scenes_iter:
delayed = scene.save_datasets(compute=False, **kwargs)
if isinstance(delayed, (list, tuple)) and len(delayed) == 2:
sources, targets, _ = split_results(delayed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the _ be overwriting delayed? This way, in the future, if the MultiScene can handle source/target combinations and delayed objects the code won't have to be changed. Right?

Copy link
Collaborator Author

@BENR0 BENR0 May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I thought about it and I think it should not be overwritten since the result of split_results is only needed for the check of sources. Currently delayed is allowed to be a target/ list of targets (e.g. simple image writer) or a delayed/ list of delayeds (e.g. CF writer). So if delayed got overwritten some additional code to either hand targets or delayed to client.compute in line 355 would be needed. So for the check only sources is needed and targets could also be _.

One thing I noticed was that it is not well documented what each writer returns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Targets should never be returned without a source. They are source + target pairs. The simple image writer should return delayed objects. The geotiff writer returns source (dask array) + target (output file-like object). If you have another example, then I still propose that _ be replaced with delayed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see now I got fooled by the test which returned targets without sources from split_results for the simple image writer because the mock did not pass as a delayed object. I changed the code to overwrite delayed now and also fixed the test. Additionally I added a test for writers with source/target combinations to fail.

@djhoese djhoese changed the title Fix cf writer returning two delayeds Fix MultiScene writer handling of multiple delayed objects May 13, 2021
@djhoese djhoese self-assigned this May 13, 2021
@djhoese
Copy link
Member

djhoese commented May 13, 2021

I re-titled this after it seemed like most of the work was in getting the MultiScene to work. However, I realize you still did a lot of work on the CF writer. Is this work still needed? If so, would you mind merging with the pytroll main branch and resolve the conflicts in the CF writer?

@@ -341,17 +341,20 @@ def load_data(q):

input_q = Queue(batch_size if batch_size is not None else 1)
load_thread = Thread(target=load_data, args=(input_q,))
# set threads to daemon so they are killed if error is raised from main thread
load_thread.daemon = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be set in the Thread init right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. Changed it.

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@djhoese djhoese merged commit 62c0e83 into pytroll:main May 17, 2021
@BENR0 BENR0 deleted the fix_cf_writer_returning_two_delayeds branch December 4, 2023 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants