-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fsspec FileSystem support #1321
Conversation
Hi @djhoese, Maybe you can also give me some advise on an implementation detail: |
Really nice start. Thanks for getting this going. That bit of code is definitely not great and if it ends up being the way this is implemented I would hope it would only be as a temporary workaround until other file handlers are updated to support it or explicitly not support it. Other ways I see this being possible:
I also saw that your use of Nice start. |
From how I understand it, each of the alternatives proposed by @djhoese requires that at least any reader that seeks to support this would need to be updated, is that correct? |
I'm looking at reading remote data over fsspec but so far I'm not having any luck reading an fsspec-opened file even outside satpy (with or without caching makes no difference):
which fails with a ValueError (I have reported this at pydata/xarray#4471). If it's not supported yet by xarray I'm not sure how far we can get supporting this in satpy right now. |
satpy/readers/yaml_reader.py
Outdated
NotImplementedError("File handler {} does not support file systems".format( | ||
filetype_cls.__name__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a raise
keyword missing here
And when I try the same directly using
resulting in
(see h5netcdf/h5netcdf#80) so I don't know if we can support NetCDF files coming from fsspec, unless perhaps they use the |
@djhoese found that when passing |
What I don't understand is that I can open and close the file with xarray using the h5netcdf backend and |
The most important thing here I think is that I'm 90% sure that although h5netcdf can read file-like objects, it isn't doing it intelligently. So you giving it an fsspec open file object doesn't mean that it is doing byte range requests. I think it is downloading the whole file before really reading it. I could be wrong (and pleasantly surprised) about that though. The other main thing is that h5netcdf is not as widely used in my opinion and sometimes has issues when combined with netcdf4-python in the same runtime (if you import them both, from my understanding). For your particular case you are playing with my |
There are two aspects of reading remotely that I'm particularly interested in:
Point (2) has priority for me, because when I'm developing I often need to try my code on the same selection of scenes repeatedly, and when moving from manually downloading selected files to having some code to do so, I found that I started to reinvent the wheel of I'm not sure how I could combine the two. Although I can wrap an |
Some network data usage measurements:
The only case in which it downloads the whole file is with I used import s3fs, fsspec.implementations.cached, xarray
cache_storage_block = "/data/gholl/cache/fscache/blockcache"
cache_storage_whole = "/data/gholl/cache/fscache/filecache"
fs_s3 = s3fs.S3FileSystem(anon=True)
fs_whole = fsspec.implementations.cached.WholeFileCacheFileSystem(
fs=fs_s3,
cache_storage=cache_storage_whole,
check_files=False,
expiry_times=False,
same_names=True)
fs_block = fsspec.implementations.cached.CachingFileSystem(
fs=fs_s3,
cache_storage=cache_storage_block,
check_files=False,
expiry_times=False,
same_names=True)
target = "noaa-goes16/ABI-L1b-RadF/2019/321/14/OR_ABI-L1b-RadF-M6C01_G16_s20193211440283_e20193211449591_c20193211450047.nc"
print("loading", target)
#with fs_s3.open(target) as of:
with fs_block.open(target, block_size=2**20) as of:
with xarray.open_dataset(of, decode_cf=False, engine="h5netcdf") as ds:
print(ds["Rad"][3000, 4000])
del ds, of # prevents https://github.com/intake/filesystem_spec/issues/404 where I adapted the context manager opening block to test the different scenarios. Conclusions:
|
I'd be curious if the download size changes with variable/attribute being read. Maybe it is only reading as far as it has to? Also mode=bytes would be an interesting test. |
I found that I've added the loading of two more pixels, and an alternative with bucket = "noaa-goes16"
path = "/ABI-L1b-RadF/2019/321/14/OR_ABI-L1b-RadF-M6C01_G16_s20193211440283_e20193211449591_c20193211450047.nc"
target = bucket + path
url = f"https://{bucket:s}.s3.amazonaws.com" + path + "#mode=bytes"
print("loading", target)
#with fs_s3.open(target) as of:
with netCDF4.Dataset(url, mode="r") as ds:
#with fs_block.open(target, block_size=2**20) as of:
# with xarray.open_dataset(of, decode_cf=False, engine="h5netcdf") as ds:
print(ds["Rad"][3000, 4000])
print(ds["Rad"][8000, 9000])
print(ds["DQF"][8000, 9000]) I find:
I don't know what exactly the |
I'm very confused by those download numbers. In the near future we may have to see what requests the NetCDF4 library is actually making to the S3 end point. It shouldn't be that slow (in my opinion). However, what it should be doing is reading a known size for all the header information, then downloading exactly the block of data requested. My guess is the caching file system is requesting a block that ends up including the header and maybe the data being requested. Perhaps the overhead of each request on this particular S3 bucket is too much so even though NetCDF4 (the C library) is downloading as little as possible, it takes so long to make each request it needs that it is much slower than the other options. |
satpy/readers/__init__.py
Outdated
@@ -704,7 +704,7 @@ def find_files_and_readers(start_time=None, end_time=None, base_dir=None, | |||
|
|||
|
|||
def load_readers(filenames=None, reader=None, reader_kwargs=None, | |||
ppp_config_dir=None): | |||
ppp_config_dir=None, file_system=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this argument be called file_system
or fs
? It's already called fs
in find_files_and_readers
, and the fsspec classes CachingFileSystem
and WholeFileCacheFileSystem
also use fs
for this parameter, so fs
would be more consistent with naming elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think file_system
is better, more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point — should we then change it in find_files_and_readers
as well for consistency? Otherwise it's a tradeoff between consistency and what is more explicit.
Personally, I prefer fs
, which is shorter and already very common as an abbreviation for filesystem: nfs, ntfs, zfs, xfs, reiserfs, fsspec, s3fs, fspath
, and elsewhere.
More download number confusions: when I use h5netcdf over |
Add support for passing an fsspec filesystem instance to the ABI L1B reader. Still needs testing and documentation.
Why pass |
I suppose one reason would be that the |
Instead of passing file_system explicitly in every step in the stack, pass it as part of reader_kwargs and fh_kwargs. Instead of LBYL, use EAFP to check if reader fails to support it.
Where ABI (or GLM) mocks open_dataset, also mock LocalFileSystem, as with the file_system passing there are two open commands. Now all tests pass except one in the GAC/LAC, which I don't understand well enough to fix, but perhaps @carloshorn can fix this one.
@carloshorn Would you be willing to give me write access to your branch so we can cooperate on this PR? |
@gerritholl, you should now have access to the repository. |
I just realised that as currently written, this does not allow to pass files corresponding to multiple readers, where those are not from the same filesystem. |
Playing around with this some more I discovered another problem. I was using it in a Traceback resulting in too many open files
I tried to implement in the ABI reader to A single Maybe we need to extend xarray's lazy loading to include lazy opening (and thus autoclosing), but that would require xarray to be filesystem-aware and to know parameters for opening a file. Or we'd have to hack this ourselves subclassing I don't know how to resolve this problem. Note that it doesn't block this PR at all; it just limits its usability for some use cases. |
Another regression with this PR. This MCVE: from satpy.multiscene import MultiScene
from glob import glob
def get_ms():
mscn = MultiScene.from_files(glob("/data/gholl/cache/fogtools/abi/2020/04/12/18/C1/OR_ABI-L1b-RadM1-M6C*_G16_s2020103180*.nc"), reader="abi_l1b")
mscn.load(["C01"])
mscn.scenes
return mscn.resample(mscn.first_scene["C01"].attrs["area"])
ms = get_ms()
ms.save_animation("/tmp/{name}_{start_time:%Y%m%d_%H%M%S}.mp4") fails with the following traceback: Traceback for MCVE
Apparently, then a MultiScene is resampled, references to the original file handlers (in |
The latter has to do with closing xarray objects vs. closing underlying open files. This (analogous to this PR) fails with a import xarray
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
fp = open(fn, "rb")
ds = xarray.open_dataset(fp, chunks=1024)
ds.close()
fp.close()
print(ds["Rad"][400:402, 300:302].load()) Interestingly, this (analogous to current master) succeeds: import xarray
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
ds = xarray.open_dataset(fn, chunks=1024)
ds.close()
print(ds["Rad"][400:402, 300:302].load()) Apparently,
Both would use the same resources. The first would probably be the more appropriate solution, but the second is easier to implement. Incidentally, considering that |
Resolving one conflict in satpy/readers/__init__.py
Do we want to keep this open? |
I am fine with closing this draft PR, which was intended as a starting point of the discussion. |
Thanks, closing then. |
This PR adds a
file_system
argument toScene
which can be used in the file handlers to access the filenames.This is useful, if the filenames are not regular paths on the local file system.
flake8 satpy
AUTHORS.md
if not there already