-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reading files passing file objects #1299
Comments
Do you want this specifically for AVHRR GAC formats or for all readers? Some existing readers do support reading compressed formats because of how common the compression is, but this is currently only supported by uncompressing on the fly and requires creating a temporary directory as far as I know. Other popular formats like HDF5 and NetCDF4 have builtin internal compression which is preferred in my opinion. Otherwise, I also wanted to mention that So depending on the format you want this for, are there other options that would be "good enough" or does it have to be supported gzipped archives? |
Hi @djhoese, |
We've looked in to something like this with S3. A lot of the times we depend on the underlying file reading library (ex. xarray, NetCDF4, HDF5, rasterio - geotiff) to handle this. Recently @gerritholl has looked at adding support for fsspec (https://filesystem-spec.readthedocs.io/en/latest/) to support S3 in the I'm not saying we shouldn't support BytesIO objects, just that there might be other ways of accessing the data rather than the user providing the Bytes directly. |
It would be quite simple to write a sort of In my vision, I have a processing pipeline, where the data may come from any sort of stream, and as long as I can tell which reader to use, satpy should not complain. Of course, not all readers may support any sort of streams. Reader calling |
How so? Reading directly from cloud storage would use the proper HTTP requests to only read the data that is used from a file. This assumes that data is not stored on the cloud storage (ex. S3) in a compressed archive like a gzipped tarball.
NetCDF4 C (and the python wrapper) recently had the functionality added to read from S3 using HTTP byte ranges: https://twitter.com/dopplershift/status/1286415993347047425 HDF5 recently released version 1.12 with a read-only s3 virtual driver, however h5py doesn't currently use this as far as I know as they have released a version that supports HDF5 1.12.
Looks like there is already a zip implementation: https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html Another nice thing about using fsspec as an interface is that it should overlap really well with the python Intake library (https://intake.readthedocs.io/en/latest/index.html) and the work the Pangeo community has been doing with defining "catalogs" of data: https://pangeo.io/catalog.html |
fsspec has this in an example: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.open_files
Not that this means anything but it is an interface fsspec has for defining compression on a remote file. I think if we can wrap this functionality into a TarballFS as you've described then that would be great. I do think support would have to be reader to reader as I'm not sure we can support it for all readers (as you've said already). |
Okay, let's think about a possible implementation. @djhoese, do you have any further suggestions? |
Disclaimer: I have very little experience with fsspec's FileSystem objects. I think the single file system per reader makes sense. You could think of your normal local file system as the default/simplest case. The user gives you "file identifiers" (I'm making this term up) and the reader accesses those file objects from the provided file system. There are a couple complications I can think of and some lower-level satpy stuff that I should describe here (some of which you probably know):
|
Thanks for sharing these thoughts.
Me too, but let's change this :-)
Hopefully, I find some time this week to start the PR. At least, I have an idea where to start. |
Your response for 1 makes me think if it doesn't exist already in fsspec they should ad a CacheFileSystem that takes two file systems. The first being a file system acts as a cache, anything not in the cache is requested from the second one, opened, and written to the first file system (the cache). Oh boy. I should probably do my real work today before I get any more distracted thinking about this. |
Related: #1062 |
Feature Request
Is your feature request related to a problem? Please describe.
We are using satpy in pygac-fdr to process all AVHRR GAC level 1b data. The data (orbits) is packed in tarballs (one month of orbits) and each orbit file is individually gzip compressed. In order to create the
Scene
object, we would need to extract all files from the tarballs to give a filename. Furthermore, we would also need to unzip all of these files, although pygac is able to process gzipped files, otherwise, satpy would not recognize the filename, because of the.gz
suffix (maybe renaming would be sufficient to trick satpy...).Describe the solution you'd like
I would like to see a clear distinction between file location and filename, so I could pass an open file object or
pathlib.Path
object or a string to a potentially renamed file which together with a specified reader should be sufficient to read the product. If for some reason, the filename is still needed, a potential strategy could be to allow the user to explicitly provide a filename, or if not given try to extract it from the file location, e.g. using the.name
attribute of the file/Path object.Describe any changes to existing user workflow
If the pattern check would be handled by a Scene argument and it would default to
True
(i.e. please do pattern checks with the filename), then there should be no impact on existing workflows, because passing a string as file location would result in the current behavior.Additional context
The discussion started here: pytroll/pygac-fdr#4
Whenever I say
filename
in the context ofScene
argument, I meanScene(filenames=[filename], ...)
.The text was updated successfully, but these errors were encountered: