Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow tarball as input for pygac-fdr-run #4

Closed
carloshorn opened this issue May 20, 2020 · 6 comments · Fixed by #82
Closed

Allow tarball as input for pygac-fdr-run #4

carloshorn opened this issue May 20, 2020 · 6 comments · Fixed by #82

Comments

@carloshorn
Copy link

It would be nice, if we could process entire tarballs filled with gzipped files.

The adaptation of the pygac-fdr-run script to open the tarballs could be similar to https://github.com/pytroll/pygac/blob/master/bin/pygac-run.

Then, we can pass the open file object as reader keyword argument to the scene constructor.

@mraspaud
Copy link
Member

So the expected output would be multiple netcdf files, right ?

@sfinkens
Copy link
Member

Yes I think this should be possible (with an update of the satpy reader to make use of that new argument)

@carloshorn
Copy link
Author

Hi @sfinkens and @mraspaud,

I have a hot fix for this issue, but I wanted to discuss the general concept with you.

I did a little change to the pygac-fdr-run script to open tarballs and pass the file objects as reader key word argument, furthermore, I drop the gzip suffix from the filename to avoid trouble with satpy. In pygac.reader I added the file object as reader argument and attribute and in the klm/pod reader read method, I check if a file object is passed as argument or if the reader has a file object as attribute else use the filename as path to the file.

It works without too many changes, but actually, I don't like it... I think the confusion results from using filename and file location synonymously. This results in satpy having too many expectations on the filename (has to be a path and needs to follow some pattern to find a dedicated reader), instead of allowing the user to pass a file location (either path or file object) together with a user defined reader (the user gets what he orders, the right choice is the user responsibility)... Getting additional help in choosing the reader based on filename heuristic should be an additional feature that should only be offered on explicit user demand. Definitely a ticket that I should open on satpy, but I don't know its code well and it could take long to get it working for all readers. However, I could imagine many use cases where a file does only exist in memory and you don't want to dump it on disk.

Should I push my hot fix to pygac and pygac-fdr (maybe some dev branches that never find their way into the master), or tackle the issue on satpy which should keep pygac unchanged, but containing the risk that I don't have a clue on how long it could take?
What do you think? Any estimates on the workload from your side?

@sfinkens
Copy link
Member

sfinkens commented Aug 5, 2020

@carloshorn Good question. I can see the advantages of your proposal. Just a couple of thoughts from the top of my head why satpy is the way it is. Probably @mraspaud can explain this better than me, but I'll try.

  • The majority of satpy readers use dask to read the data in chunks from the files. Having all the data in memory is a less common use case I would say. That's why the strong dependence on filenames is kind of natural to satpy.
  • Furthermore, there are cases where data from the same instrument comes in a variety of different formats with varying contents. Here satpy uses the file name to determine the file type and provide the user information on the expected datasets in the files. I can imagine determining the file type from file object(s) could be hard in some cases. It would certainly be a lot of work to update all the readers in this regard.

If there is a consensus to move in this direction, I'd estimate it would take several weeks (including discussion, testing etc) to get this done.

@sfinkens
Copy link
Member

sfinkens commented Aug 5, 2020

Using some dev branches would have the disadvantage, that we cannot reference a proper software version in the global attributes of the netcdf files.

@carloshorn
Copy link
Author

Related: pytroll/pygac#92
Once merged, the only thing left is creating a tarball filesystem and use a PathLike object as filename argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants