fsspec FileSystem support #1321

carloshorn · 2020-08-25T15:01:45Z

This PR adds a file_system argument to Scene which can be used in the file handlers to access the filenames.
This is useful, if the filenames are not regular paths on the local file system.

carloshorn · 2020-08-25T15:25:58Z

Hi @djhoese,
I started this draft PR to get a better idea of a possible implementation. Maybe, you can help me with suggestions for tests and code review.

Maybe you can also give me some advise on an implementation detail:
https://github.com/carloshorn/satpy/blob/551887d3859584e81c48f90247cb1eb4619aaf7f/satpy/readers/yaml_reader.py#L458-L462
I thought that checking if file_system is in the argument list of the constructor of a given file handler could help to tell if the new file system feature is supported. Do you think this is a useful check, or do you know other ways to do it?

djhoese · 2020-08-25T17:42:15Z

Really nice start. Thanks for getting this going. That bit of code is definitely not great and if it ends up being the way this is implemented I would hope it would only be as a temporary workaround until other file handlers are updated to support it or explicitly not support it.

Other ways I see this being possible:

Create a separate base reader base class that adds this functionality. This would mean reader developers would have to "opt-in" to agreeing that their file handlers can support the file_system parameter. The base yaml reader would then either throw away the file_system keyword argument before creating file handlers (or better, raise an exception).
Create a separate file handler base class that implements this functionality. The base reader can then do an isinstance(file_handler_obj, FileSystemFileHandler) and pass the keyword argument if True.
Update all file handlers currently in Satpy to accept any keyword arguments (**kwargs) and pass them to the base file handler. We'd then need to update the base file handler's __init__ so if there are any unrecognized kwargs it raises an exception.

I also saw that your use of open_file had it passed to a reader.read. I'm wondering if it would be cleaner to always have open_file return an xarray.Dataset version of the file...nah maybe that's not possible (NetCDF files with groups).

Nice start.

gerritholl · 2020-09-10T14:49:20Z

From how I understand it, each of the alternatives proposed by @djhoese requires that at least any reader that seeks to support this would need to be updated, is that correct?

gerritholl · 2020-09-29T08:17:26Z

I'm looking at reading remote data over fsspec but so far I'm not having any luck reading an fsspec-opened file even outside satpy (with or without caching makes no difference):

import fsspec, xarray
with fsspec.open("s3://noaa-goes16/ABI-L1b-RadF/2019/321/14/OR_ABI-L1b-RadF-M6C01_G16_s20193211440283_e20193211449591_c20193211450047.nc", mode="rb", anon=True) as f:
    ds = xarray.open_dataset(f)
    print(ds["Rad"][400, 300])

which fails with a ValueError (I have reported this at pydata/xarray#4471). If it's not supported yet by xarray I'm not sure how far we can get supporting this in satpy right now.

gerritholl · 2020-09-29T08:49:24Z

satpy/readers/yaml_reader.py

+                NotImplementedError("File handler {} does not support file systems".format(
+                    filetype_cls.__name__))


there is a raise keyword missing here

gerritholl · 2020-09-29T09:01:32Z

And when I try the same directly using h5netcdf, I get the requested value but also a core dump:

import fsspec, h5netcdf
with fsspec.open("s3://noaa-goes16/ABI-L1b-RadF/2019/321/14/OR_ABI-L1b-RadF-M6C01_G16_s20193211440283_e20193211449591_c20193211450047.nc", mode="rb", anon=True) as f:
    ds = h5netcdf.File(f)
    print(ds["Rad"][400, 300])

resulting in

1023
Exception ignored in: <function File.close at 0x7fd770efbf70>
Traceback (most recent call last):
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5netcdf/core.py", line 802, in close
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5netcdf/core.py", line 794, in flush
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5netcdf/core.py", line 531, in _create_dim_scales
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5netcdf/core.py", line 374, in _h5group
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5py/_hl/group.py", line 264, in __getitem__
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
  File "h5py/h5i.pyx", line 43, in h5py.h5i.wrap_identifier
ImportError: sys.meta_path is None, Python is likely shutting down
Fatal Python error: PyThreadState_Get: no current thread
Python runtime state: finalizing (tstate=0x55a7e9fedc50)

Aborted (core dumped)

(see h5netcdf/h5netcdf#80)

so I don't know if we can support NetCDF files coming from fsspec, unless perhaps they use the WholeFileCacheFileSystem which presumably provides an API to access the file under which it has been cached locally (although I haven't found yet how that would be).

gerritholl · 2020-09-29T13:24:18Z

@djhoese found that when passing decode_cf=False the netCDF4 file can be opened with xarray without (visible) errors, so at least for readers doing their own scale factors etc. we may make some progress (see pydata/xarray#4471 for more information).

gerritholl · 2020-09-29T16:51:24Z

What I don't understand is that I can open and close the file with xarray using the h5netcdf backend and decode_cf=False, but when I try to open and close it with h5netcdf directly I get core dumps or AttributeErrors, see h5netcdf/h5netcdf#80. Currently netcdf_utils.py uses NetCDF4, which doesn't work with open file objects. What would be the potential caveats on switching to h5netcdf or xarray here?

djhoese · 2020-09-29T18:23:29Z

The most important thing here I think is that I'm 90% sure that although h5netcdf can read file-like objects, it isn't doing it intelligently. So you giving it an fsspec open file object doesn't mean that it is doing byte range requests. I think it is downloading the whole file before really reading it. I could be wrong (and pleasantly surprised) about that though.

The other main thing is that h5netcdf is not as widely used in my opinion and sometimes has issues when combined with netcdf4-python in the same runtime (if you import them both, from my understanding). For your particular case you are playing with my #mode=bytes PR would be the better solution as it intelligently only requests the bytes needed to load the data (from my understanding).

gerritholl · 2020-09-30T07:40:16Z

There are two aspects of reading remotely that I'm particularly interested in:

Reading only the parts of the file I need, which can apparently be done with #mode=bytes if available through https (see Allow for #mode=bytes in file URLs to support NetCDF byte ranges #1349), and
caching on local disk with either WholeFileCacheFileSystem or (preferably) CachingFileSystem (which uses ~~black magic~~ sparse files).

Point (2) has priority for me, because when I'm developing I often need to try my code on the same selection of scenes repeatedly, and when moving from manually downloading selected files to having some code to do so, I found that I started to reinvent the wheel of WholeFileCacheFileSystem and develop something inferior to CachingFileSystem (which uses ~~black magic~~ sparse files).

I'm not sure how I could combine the two. Although I can wrap an S3FileSystem in a CachingFileSystem class, there isn't an fsspec implementation for https, probably because https does not usually expose typical file system like behaviour such as directory listings.

gerritholl · 2020-09-30T09:24:39Z

Some network data usage measurements:

When I use an instance of S3FileSystem directly to read a single value from a 89 MiB NetCDF file using xarray and the h5netcdf engine, it receives 56.281 MB. When I run it again, it uses 56.167 MB.
When I wrap this S3FileSystem instance in a WholeFileCacheSystem instance, the first time I call it it downloads 90.838 MB, the second time I call it it downloads 0.009 MB.
When I wrap this S3FileSystem instance in a CachingFileSystem instance, the first time I call it it downloads 3.565 MB, the second time I call it it downloads 0.005 MB.

The only case in which it downloads the whole file is with WholeFileCacheSystem, which is expected. I don't quite understand why wrapping it in a CachingFileSystem causes much less data to be read compared to not using any caching, but in neither case does it download the entire file.

I used nethogs -v 3 to monitor this, and did so on my personal notebook because this tool requires superuser access which I don't have on my work computer. The script I used for this is approximately:

import s3fs, fsspec.implementations.cached, xarray
cache_storage_block = "/data/gholl/cache/fscache/blockcache"
cache_storage_whole = "/data/gholl/cache/fscache/filecache"
fs_s3 = s3fs.S3FileSystem(anon=True)
fs_whole = fsspec.implementations.cached.WholeFileCacheFileSystem(
        fs=fs_s3,
        cache_storage=cache_storage_whole,
        check_files=False,
        expiry_times=False,
        same_names=True)
fs_block = fsspec.implementations.cached.CachingFileSystem(
        fs=fs_s3,
        cache_storage=cache_storage_block,
        check_files=False,
        expiry_times=False,
        same_names=True)
target = "noaa-goes16/ABI-L1b-RadF/2019/321/14/OR_ABI-L1b-RadF-M6C01_G16_s20193211440283_e20193211449591_c20193211450047.nc"
print("loading", target)
#with fs_s3.open(target) as of:
with fs_block.open(target, block_size=2**20) as of:
    with xarray.open_dataset(of, decode_cf=False, engine="h5netcdf") as ds:
        print(ds["Rad"][3000, 4000])
del ds, of  # prevents https://github.com/intake/filesystem_spec/issues/404

where I adapted the context manager opening block to test the different scenarios.

Conclusions:

h5netcdf isn't greatly efficient, but it does not download the entire file;
the block caching brings down the amount downloaded much further, but I don't understand how.

djhoese · 2020-09-30T11:06:27Z

I'd be curious if the download size changes with variable/attribute being read. Maybe it is only reading as far as it has to?

Also mode=bytes would be an interesting test.

gerritholl · 2020-09-30T16:20:20Z

I found that netCDF4.Dataset downloads the least data, but is nevertheless by far the slowest alternative (except for blockcache with a very small block size).

I've added the loading of two more pixels, and an alternative with netCDF4.Dataset such that the end of the script is now:

bucket = "noaa-goes16"
path = "/ABI-L1b-RadF/2019/321/14/OR_ABI-L1b-RadF-M6C01_G16_s20193211440283_e20193211449591_c20193211450047.nc"
target = bucket + path
url = f"https://{bucket:s}.s3.amazonaws.com" + path + "#mode=bytes"
print("loading", target)
#with fs_s3.open(target) as of:
with netCDF4.Dataset(url, mode="r") as ds:
#with fs_block.open(target, block_size=2**20) as of:
#    with xarray.open_dataset(of, decode_cf=False, engine="h5netcdf") as ds:
        print(ds["Rad"][3000, 4000])
        print(ds["Rad"][8000, 9000])
        print(ds["DQF"][8000, 9000])

I find:

netCDF4.Dataset with #mode=bytes downloads 0.270 MB and takes 28s
netCDF4.Dataset without #mode=bytes fails with OSError: file not found.
h5netcdf directly over s3fs downloads 81.189 MB and takes 13.6s
h5netcdf over CachingFileSystem with block_size==2**20 downloads 5.810 MB and takes 5.5s
h5netcdf over CachingFileSystem with block_size==2**10 downloads 0.378 MB and takes 43s
h5netcdf over CachingFileSystem with block_size==2**15 downloads 0.446 MB and takes 6.1s
h5netcdf over CachingFileSystem with block_size==2**25 downloads 57.591 MB and takes 8.3s

I don't know what exactly the netCDF4 library does but I find it interesting that it manages to take twice as long to download 0.270 MB as h5netcdf when it downloads almost the entire file. The winner here would appear to be h5netcdf over CachingFileSystem with a smartly chosen block_size, although smartly chosen may depend on the use case.

djhoese · 2020-09-30T17:21:40Z

I'm very confused by those download numbers. In the near future we may have to see what requests the NetCDF4 library is actually making to the S3 end point. It shouldn't be that slow (in my opinion). However, what it should be doing is reading a known size for all the header information, then downloading exactly the block of data requested. My guess is the caching file system is requesting a block that ends up including the header and maybe the data being requested. Perhaps the overhead of each request on this particular S3 bucket is too much so even though NetCDF4 (the C library) is downloading as little as possible, it takes so long to make each request it needs that it is much slower than the other options.

gerritholl · 2020-10-02T13:18:00Z

satpy/readers/__init__.py

@@ -704,7 +704,7 @@ def find_files_and_readers(start_time=None, end_time=None, base_dir=None,


 def load_readers(filenames=None, reader=None, reader_kwargs=None,
-                 ppp_config_dir=None):
+                 ppp_config_dir=None, file_system=None):


Should this argument be called file_system or fs? It's already called fs in find_files_and_readers, and the fsspec classes CachingFileSystem and WholeFileCacheFileSystem also use fs for this parameter, so fs would be more consistent with naming elsewhere.

I think file_system is better, more explicit.

Fair point — should we then change it in find_files_and_readers as well for consistency? Otherwise it's a tradeoff between consistency and what is more explicit.

Personally, I prefer fs, which is shorter and already very common as an abbreviation for filesystem: nfs, ntfs, zfs, xfs, reiserfs, fsspec, s3fs, fspath, and elsewhere.

gerritholl · 2020-10-02T13:38:42Z

More download number confusions: when I use h5netcdf over CachingFileSystem but use h5netcdf as an xarray backend through the satpy ABI reader, it again appears to download up to the place requested (as in this case determined by storage used by the sparse files in the block cache directory).

Add support for passing an fsspec filesystem instance to the ABI L1B reader. Still needs testing and documentation.

gerritholl · 2020-10-05T15:20:08Z

Why pass file_system explicitly around from method to method down the line, when it could be passed as part of the reader_kwargs or fh_kwargs dictionary? It doesn't seem to be used anywhere in the stack, apart from passing from Scene.__init__ to find_files_and_readers, which was long deprecated.

gerritholl · 2020-10-05T15:29:57Z

I suppose one reason would be that the file_system argument is envisioned to be shared by many different readers, so it's not reader-specific as such.

Instead of passing file_system explicitly in every step in the stack, pass it as part of reader_kwargs and fh_kwargs. Instead of LBYL, use EAFP to check if reader fails to support it.

@carloshorn

Where ABI (or GLM) mocks open_dataset, also mock LocalFileSystem, as with the file_system passing there are two open commands. Now all tests pass except one in the GAC/LAC, which I don't understand well enough to fix, but perhaps @carloshorn can fix this one.

gerritholl · 2020-10-07T16:13:34Z

@carloshorn Would you be willing to give me write access to your branch so we can cooperate on this PR?

carloshorn · 2020-10-08T10:30:39Z

@gerritholl, you should now have access to the repository.
Sorry for the late reply, I did not have much time the last weeks to work on it.

gerritholl · 2020-10-13T12:29:55Z

I just realised that as currently written, this does not allow to pass files corresponding to multiple readers, where those are not from the same filesystem.

gerritholl · 2020-10-20T16:10:40Z

Playing around with this some more I discovered another problem. I was using it in a MultiScene processing one hour worth of ABI M1 + M2 data using a CachingFileSystem wrapping an S3FileSystem. When I'm processing all ABI channels, that means an hour of data consists of 2 * 60 * 16 = 1920 files. When processing files on-disk and passing a filename to xarray.open_dataset, the autoclose=True argument (apparently true by default) ensures files are closed when not needed. When passing open files to xarray, xarray doesn't close them (which is very reasonable of xarray, because it can't reopen them). Unless satpy takes responsibility for closing files and reopening them as needed, this means the number of open files per process (1024 on my system) is quickly exhausted:

Traceback resulting in too many open files

  File "/home/gholl/checkouts/sattools/src/sattools/scutil.py", line 43, in _get_all_areas_from_multiscene
    for sc in ms.scenes:
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 221, in scenes
    self._scenes = list(self._scenes)
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 123, in __iter__
    scn = next(self._self_iter)
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 113, in _create_cached_iter
    for scn in self._scene_gen:
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 262, in _call_scene_func
    for scn in gen:
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 123, in __iter__
    scn = next(self._self_iter)
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 113, in _create_cached_iter
    for scn in self._scene_gen:
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 200, in <genexpr>
    scenes = (Scene(filenames=fg, **scene_kwargs) for fg in file_groups)
  File "/home/gholl/checkouts/satpy/satpy/scene.py", line 166, in __init__
    self._readers = self._create_reader_instances(
  File "/home/gholl/checkouts/satpy/satpy/scene.py", line 215, in _create_reader_instances
    return load_readers(filenames=filenames,
  File "/home/gholl/checkouts/satpy/satpy/readers/__init__.py", line 513, in load_readers
    reader_instance.create_filehandlers(
  File "/home/gholl/checkouts/satpy/satpy/readers/yaml_reader.py", line 612, in create_filehandlers
    filehandlers = self._new_filehandlers_for_filetype(filetype_info,
  File "/home/gholl/checkouts/satpy/satpy/readers/yaml_reader.py", line 600, in _new_filehandlers_for_filetype
    return list(filtered_iter)
  File "/home/gholl/checkouts/satpy/satpy/readers/yaml_reader.py", line 568, in filter_fh_by_metadata
    for filehandler in filehandlers:
  File "/home/gholl/checkouts/satpy/satpy/readers/yaml_reader.py", line 502, in _new_filehandler_instances
    yield filetype_cls(filename, filename_info, filetype_info,
  File "/home/gholl/checkouts/satpy/satpy/readers/abi_base.py", line 52, in __init__
    self.of = self.file_system.open(self.filename, block_size=self.block_size)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 376, in <lambda>
    return lambda *args, **kw: getattr(type(self), item)(self, *args, **kw)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/spec.py", line 897, in open
    f = self._open(
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 376, in <lambda>
    return lambda *args, **kw: getattr(type(self), item)(self, *args, **kw)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 327, in _open
    self.save_cache()
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 376, in <lambda>
    return lambda *args, **kw: getattr(type(self), item)(self, *args, **kw)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 177, in save_cache
    move(fn2, fn)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/shutil.py", line 802, in move
    copy_function(src, real_dst)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/shutil.py", line 432, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/shutil.py", line 261, in copyfile
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
OSError: [Errno 24] Too many open files: '/data/gholl/cache/ABI-block-cache/cache'

I tried to implement in the ABI reader to .close() and open() whenever data are read, but that doesn't resolve the problem because of the lazy loading in xarray and the dask-aware processing through satpy, which mean data reading is postponed until access. That access will fail if the file has been closed. That access could be anywhere, which makes reopening-just-in-time challenging. Yet it would have to be in satpy: it's satpy that (in this PR) uses the file system object (which may be CachingFileSystem) to open a file, and it's to be expected that each CachingFileSystem open file keeps an open file corresponding to the locally cached file.

A single Scene object is unlikely to run into the 1024-file problem (although we've had problems with FCI files in the past because we needed an xarray.open_dataset for each group, but there hopefully different groups in the same file could perhaps share a file object?), but with a 'MultiScene' it's quite likely to happen. Perhaps the MultiScene could take some responsibility to not open more Scenes than needed, but I'm not sure how this could work exactly.

Maybe we need to extend xarray's lazy loading to include lazy opening (and thus autoclosing), but that would require xarray to be filesystem-aware and to know parameters for opening a file. Or we'd have to hack this ourselves subclassing xarray.DataArray. Or provide some interface where users would need to signal explicitly that they will access data for a particular scene, which would then open and close, perhaps with a context manager.

I don't know how to resolve this problem. Note that it doesn't block this PR at all; it just limits its usability for some use cases.

gerritholl · 2020-10-21T08:23:48Z

Another regression with this PR. This MCVE:

from satpy.multiscene import MultiScene
from glob import glob

def get_ms():
    mscn = MultiScene.from_files(glob("/data/gholl/cache/fogtools/abi/2020/04/12/18/C1/OR_ABI-L1b-RadM1-M6C*_G16_s2020103180*.nc"), reader="abi_l1b")
    mscn.load(["C01"])
    mscn.scenes
    return mscn.resample(mscn.first_scene["C01"].attrs["area"])

ms = get_ms()
ms.save_animation("/tmp/{name}_{start_time:%Y%m%d_%H%M%S}.mp4")

fails with the following traceback:

Traceback for MCVE

Traceback (most recent call last):                              
  File "mwe98.py", line 11, in <module>
    ms.save_animation("/tmp/{name}_{start_time:%Y%m%d_%H%M%S}.mp4")          
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 653, in save_animation
    self._simple_frame_compute(writers, frame_keys, frames_to_write)                                                        
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 533, in _simple_frame_compute                              
    w.append_data(product_frame.compute())                                                                 
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/base.py", line 167, in compute   
    (result,) = compute(self, traverse=False, **kwargs)                
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/base.py", line 452, in compute
    results = schedule(dsk, keys, **kwargs)             
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(                                          
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)      
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/local.py", line 316, in reraise                  
    raise exc                                                                                                             
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)                                                                      
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))              
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/core.py", line 121, in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))  
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))                    
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/core.py", line 121, in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))                                                                   
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task           
    return func(*(_execute_task(a, cache) for a in args))                                                  
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/array/core.py", line 102, in getter
    c = np.asarray(c)                                                  
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)     
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 495, in __array__
    return np.asarray(self.array, dtype=dtype)                    
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 655, in __array__      
    return np.asarray(self.array, dtype=dtype)                                                                            
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)                                                         
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 560, in __array__
    return np.asarray(array[self.key], dtype=None)                     
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/backends/h5netcdf_.py", line 28, in __getitem__
    return indexing.explicit_indexing_adapter(             
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 845, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)                              
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/backends/h5netcdf_.py", line 38, in _getitem
    return array[key]                                                                                                       
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5netcdf/core.py", line 146, in __getitem__         
    return self._h5ds[key]                                                                                 
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper                                     
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 573, in __getitem__
    self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 181, in h5py.h5d.DatasetID.read
  File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw
  File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread
  File "h5py/defs.pyx", line 484, in h5py.defs.H5Dread
  File "h5py/h5fd.pyx", line 155, in h5py.h5fd.H5FD_fileobj_read
ValueError: seek of closed file

Apparently, then a MultiScene is resampled, references to the original file handlers (in Scene._readers[...].file_handlers) are destroyed, which leads to the calling of satpy.readers.abi_l1b.NC_ABI_L1B.__del__, where the xarray object is explicitly closed. I'm not sure why that is causing a problem only with this PR and not with satpy master.

gerritholl · 2020-10-21T08:49:47Z

The latter has to do with closing xarray objects vs. closing underlying open files.

This (analogous to this PR) fails with a ValueError due to operating on a closed file:

import xarray
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
fp = open(fn, "rb")
ds = xarray.open_dataset(fp, chunks=1024)
ds.close()
fp.close()
print(ds["Rad"][400:402, 300:302].load())

Interestingly, this (analogous to current master) succeeds:

import xarray
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
ds = xarray.open_dataset(fn, chunks=1024)
ds.close()
print(ds["Rad"][400:402, 300:302].load())

Apparently, Dataset.close() closes file objects associated with the Dataset, but xarray happily reopens them when data access are needed. To avoid the ValueError, we can either:

make sure references to the FileHandler are not lost, so that NC_ABI_L1B.__del__ isn't called until data are actually no longer needed, or
refrain from closing the file opened via fsspec in NC_ABI_L1B.__del__.

Both would use the same resources. The first would probably be the more appropriate solution, but the second is easier to implement.

Incidentally, considering that autoclose=True is already the default when opening datasets, the currently existing self.nc.close() in the ABI reader probably does absolutely nothing at all.

Resolving one conflict in satpy/readers/__init__.py

mraspaud · 2020-12-16T14:09:17Z

Do we want to keep this open?

carloshorn · 2020-12-16T14:25:27Z

I am fine with closing this draft PR, which was intended as a starting point of the discussion.
The idea of how to implement the requested feature is given in #1439 using the introduced FSFile as filename.
The task of enabling more readers to accept FSFile instances is beyond the scope of this PR.

mraspaud · 2020-12-16T14:30:11Z

Thanks, closing then.

horn added 3 commits August 24, 2020 15:22

WIP: fsspec FileSystem support

2434e5f

ready for testing

c6c0885

Merge branch 'master' into file_system

551887d

djhoese requested review from mraspaud, gerritholl and pnuu August 25, 2020 17:42

gerritholl reviewed Sep 29, 2020

View reviewed changes

Add missing raise keyword

0d008e6

gerritholl reviewed Oct 2, 2020

View reviewed changes

Add file-system support in ABI reader

7bb67bf

Add support for passing an fsspec filesystem instance to the ABI L1B reader. Still needs testing and documentation.

This was referenced Oct 2, 2020

Add support in ABI reader to read file system carloshorn/satpy#1

Merged

Interface glmtools to get gridded products from GLM LCFA files on-the-fly. #1386

Open

gerritholl added 2 commits October 5, 2020 17:47

Pass file_system as part of reader_kwargs

9675614

Instead of passing file_system explicitly in every step in the stack, pass it as part of reader_kwargs and fh_kwargs. Instead of LBYL, use EAFP to check if reader fails to support it.

Merge branch 'master' into file-system-gerrit-fixes

6305f4e

gerritholl mentioned this pull request Oct 6, 2020

Feature request: Add fsspec FileSystem support in FileSet atmtools/typhon#373

Closed

PEP8 fixes

c56fb8d

gerritholl mentioned this pull request Oct 13, 2020

reader_kwargs should differentiate between different readers #1396

Closed

gerritholl mentioned this pull request Oct 20, 2020

six hour video -> too many open files gerritholl/sattools#17

Open

gerritholl added 3 commits October 21, 2020 11:14

Refrain from explicitly closing file object

851377b

Merge branch 'master' into file_system

9031e8b

Merge branch 'master' into file_system

b2cbe86

Resolving one conflict in satpy/readers/__init__.py

djhoese mentioned this pull request Nov 5, 2020

read goes-r data from https aws, gcp or azure (it uses netcdf4) #1423

Open

5 tasks

Merge branch 'master' into file_system

45a1572

gerritholl mentioned this pull request Nov 20, 2020

Add support for s3 buckets in OLCI and ABI l1 readers #1439

Merged

3 tasks

gerritholl mentioned this pull request Nov 30, 2020

Adapt method for remote reading gerritholl/sattools#37

Open

mraspaud closed this Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsspec FileSystem support #1321

fsspec FileSystem support #1321

carloshorn commented Aug 25, 2020

carloshorn commented Aug 25, 2020

djhoese commented Aug 25, 2020

gerritholl commented Sep 10, 2020

gerritholl commented Sep 29, 2020 •

edited

Loading

gerritholl Sep 29, 2020

gerritholl commented Sep 29, 2020 •

edited

Loading

gerritholl commented Sep 29, 2020

gerritholl commented Sep 29, 2020

djhoese commented Sep 29, 2020

gerritholl commented Sep 30, 2020

gerritholl commented Sep 30, 2020

djhoese commented Sep 30, 2020

gerritholl commented Sep 30, 2020

djhoese commented Sep 30, 2020

gerritholl Oct 2, 2020

mraspaud Oct 2, 2020

gerritholl Oct 2, 2020

gerritholl commented Oct 2, 2020

gerritholl commented Oct 5, 2020

gerritholl commented Oct 5, 2020

gerritholl commented Oct 7, 2020

carloshorn commented Oct 8, 2020

gerritholl commented Oct 13, 2020

gerritholl commented Oct 20, 2020

gerritholl commented Oct 21, 2020

gerritholl commented Oct 21, 2020

mraspaud commented Dec 16, 2020

carloshorn commented Dec 16, 2020

mraspaud commented Dec 16, 2020

		NotImplementedError("File handler {} does not support file systems".format(
		filetype_cls.__name__))

fsspec FileSystem support #1321

fsspec FileSystem support #1321

Conversation

carloshorn commented Aug 25, 2020

carloshorn commented Aug 25, 2020

djhoese commented Aug 25, 2020

gerritholl commented Sep 10, 2020

gerritholl commented Sep 29, 2020 • edited Loading

gerritholl Sep 29, 2020

Choose a reason for hiding this comment

gerritholl commented Sep 29, 2020 • edited Loading

gerritholl commented Sep 29, 2020

gerritholl commented Sep 29, 2020

djhoese commented Sep 29, 2020

gerritholl commented Sep 30, 2020

gerritholl commented Sep 30, 2020

djhoese commented Sep 30, 2020

gerritholl commented Sep 30, 2020

djhoese commented Sep 30, 2020

gerritholl Oct 2, 2020

Choose a reason for hiding this comment

mraspaud Oct 2, 2020

Choose a reason for hiding this comment

gerritholl Oct 2, 2020

Choose a reason for hiding this comment

gerritholl commented Oct 2, 2020

gerritholl commented Oct 5, 2020

gerritholl commented Oct 5, 2020

gerritholl commented Oct 7, 2020

carloshorn commented Oct 8, 2020

gerritholl commented Oct 13, 2020

gerritholl commented Oct 20, 2020

gerritholl commented Oct 21, 2020

gerritholl commented Oct 21, 2020

mraspaud commented Dec 16, 2020

carloshorn commented Dec 16, 2020

mraspaud commented Dec 16, 2020

gerritholl commented Sep 29, 2020 •

edited

Loading

gerritholl commented Sep 29, 2020 •

edited

Loading