Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsspec FileSystem support #1321

Closed
wants to merge 13 commits into from
Closed

Conversation

carloshorn
Copy link

This PR adds a file_system argument to Scene which can be used in the file handlers to access the filenames.
This is useful, if the filenames are not regular paths on the local file system.

@carloshorn
Copy link
Author

Hi @djhoese,
I started this draft PR to get a better idea of a possible implementation. Maybe, you can help me with suggestions for tests and code review.

Maybe you can also give me some advise on an implementation detail:
https://github.com/carloshorn/satpy/blob/551887d3859584e81c48f90247cb1eb4619aaf7f/satpy/readers/yaml_reader.py#L458-L462
I thought that checking if file_system is in the argument list of the constructor of a given file handler could help to tell if the new file system feature is supported. Do you think this is a useful check, or do you know other ways to do it?

@djhoese
Copy link
Member

djhoese commented Aug 25, 2020

Really nice start. Thanks for getting this going. That bit of code is definitely not great and if it ends up being the way this is implemented I would hope it would only be as a temporary workaround until other file handlers are updated to support it or explicitly not support it.

Other ways I see this being possible:

  1. Create a separate base reader base class that adds this functionality. This would mean reader developers would have to "opt-in" to agreeing that their file handlers can support the file_system parameter. The base yaml reader would then either throw away the file_system keyword argument before creating file handlers (or better, raise an exception).
  2. Create a separate file handler base class that implements this functionality. The base reader can then do an isinstance(file_handler_obj, FileSystemFileHandler) and pass the keyword argument if True.
  3. Update all file handlers currently in Satpy to accept any keyword arguments (**kwargs) and pass them to the base file handler. We'd then need to update the base file handler's __init__ so if there are any unrecognized kwargs it raises an exception.

I also saw that your use of open_file had it passed to a reader.read. I'm wondering if it would be cleaner to always have open_file return an xarray.Dataset version of the file...nah maybe that's not possible (NetCDF files with groups).

Nice start.

@gerritholl
Copy link
Member

From how I understand it, each of the alternatives proposed by @djhoese requires that at least any reader that seeks to support this would need to be updated, is that correct?

@gerritholl
Copy link
Member

gerritholl commented Sep 29, 2020

I'm looking at reading remote data over fsspec but so far I'm not having any luck reading an fsspec-opened file even outside satpy (with or without caching makes no difference):

import fsspec, xarray
with fsspec.open("s3://noaa-goes16/ABI-L1b-RadF/2019/321/14/OR_ABI-L1b-RadF-M6C01_G16_s20193211440283_e20193211449591_c20193211450047.nc", mode="rb", anon=True) as f:
    ds = xarray.open_dataset(f)
    print(ds["Rad"][400, 300])

which fails with a ValueError (I have reported this at pydata/xarray#4471). If it's not supported yet by xarray I'm not sure how far we can get supporting this in satpy right now.

Comment on lines 461 to 462
NotImplementedError("File handler {} does not support file systems".format(
filetype_cls.__name__))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a raise keyword missing here

@gerritholl
Copy link
Member

gerritholl commented Sep 29, 2020

And when I try the same directly using h5netcdf, I get the requested value but also a core dump:

import fsspec, h5netcdf
with fsspec.open("s3://noaa-goes16/ABI-L1b-RadF/2019/321/14/OR_ABI-L1b-RadF-M6C01_G16_s20193211440283_e20193211449591_c20193211450047.nc", mode="rb", anon=True) as f:
    ds = h5netcdf.File(f)
    print(ds["Rad"][400, 300])

resulting in

1023
Exception ignored in: <function File.close at 0x7fd770efbf70>
Traceback (most recent call last):
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5netcdf/core.py", line 802, in close
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5netcdf/core.py", line 794, in flush
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5netcdf/core.py", line 531, in _create_dim_scales
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5netcdf/core.py", line 374, in _h5group
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5py/_hl/group.py", line 264, in __getitem__
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
  File "h5py/h5i.pyx", line 43, in h5py.h5i.wrap_identifier
ImportError: sys.meta_path is None, Python is likely shutting down
Fatal Python error: PyThreadState_Get: no current thread
Python runtime state: finalizing (tstate=0x55a7e9fedc50)

Aborted (core dumped)

(see h5netcdf/h5netcdf#80)

so I don't know if we can support NetCDF files coming from fsspec, unless perhaps they use the WholeFileCacheFileSystem which presumably provides an API to access the file under which it has been cached locally (although I haven't found yet how that would be).

@gerritholl
Copy link
Member

@djhoese found that when passing decode_cf=False the netCDF4 file can be opened with xarray without (visible) errors, so at least for readers doing their own scale factors etc. we may make some progress (see pydata/xarray#4471 for more information).

@gerritholl
Copy link
Member

What I don't understand is that I can open and close the file with xarray using the h5netcdf backend and decode_cf=False, but when I try to open and close it with h5netcdf directly I get core dumps or AttributeErrors, see h5netcdf/h5netcdf#80. Currently netcdf_utils.py uses NetCDF4, which doesn't work with open file objects. What would be the potential caveats on switching to h5netcdf or xarray here?

@djhoese
Copy link
Member

djhoese commented Sep 29, 2020

The most important thing here I think is that I'm 90% sure that although h5netcdf can read file-like objects, it isn't doing it intelligently. So you giving it an fsspec open file object doesn't mean that it is doing byte range requests. I think it is downloading the whole file before really reading it. I could be wrong (and pleasantly surprised) about that though.

The other main thing is that h5netcdf is not as widely used in my opinion and sometimes has issues when combined with netcdf4-python in the same runtime (if you import them both, from my understanding). For your particular case you are playing with my #mode=bytes PR would be the better solution as it intelligently only requests the bytes needed to load the data (from my understanding).

@gerritholl
Copy link
Member

There are two aspects of reading remotely that I'm particularly interested in:

  1. Reading only the parts of the file I need, which can apparently be done with #mode=bytes if available through https (see Allow for #mode=bytes in file URLs to support NetCDF byte ranges #1349), and
  2. caching on local disk with either WholeFileCacheFileSystem or (preferably) CachingFileSystem (which uses black magic sparse files).

Point (2) has priority for me, because when I'm developing I often need to try my code on the same selection of scenes repeatedly, and when moving from manually downloading selected files to having some code to do so, I found that I started to reinvent the wheel of WholeFileCacheFileSystem and develop something inferior to CachingFileSystem (which uses black magic sparse files).

I'm not sure how I could combine the two. Although I can wrap an S3FileSystem in a CachingFileSystem class, there isn't an fsspec implementation for https, probably because https does not usually expose typical file system like behaviour such as directory listings.

@gerritholl
Copy link
Member

Some network data usage measurements:

  • When I use an instance of S3FileSystem directly to read a single value from a 89 MiB NetCDF file using xarray and the h5netcdf engine, it receives 56.281 MB. When I run it again, it uses 56.167 MB.
  • When I wrap this S3FileSystem instance in a WholeFileCacheSystem instance, the first time I call it it downloads 90.838 MB, the second time I call it it downloads 0.009 MB.
  • When I wrap this S3FileSystem instance in a CachingFileSystem instance, the first time I call it it downloads 3.565 MB, the second time I call it it downloads 0.005 MB.

The only case in which it downloads the whole file is with WholeFileCacheSystem, which is expected. I don't quite understand why wrapping it in a CachingFileSystem causes much less data to be read compared to not using any caching, but in neither case does it download the entire file.

I used nethogs -v 3 to monitor this, and did so on my personal notebook because this tool requires superuser access which I don't have on my work computer. The script I used for this is approximately:

import s3fs, fsspec.implementations.cached, xarray
cache_storage_block = "/data/gholl/cache/fscache/blockcache"
cache_storage_whole = "/data/gholl/cache/fscache/filecache"
fs_s3 = s3fs.S3FileSystem(anon=True)
fs_whole = fsspec.implementations.cached.WholeFileCacheFileSystem(
        fs=fs_s3,
        cache_storage=cache_storage_whole,
        check_files=False,
        expiry_times=False,
        same_names=True)
fs_block = fsspec.implementations.cached.CachingFileSystem(
        fs=fs_s3,
        cache_storage=cache_storage_block,
        check_files=False,
        expiry_times=False,
        same_names=True)
target = "noaa-goes16/ABI-L1b-RadF/2019/321/14/OR_ABI-L1b-RadF-M6C01_G16_s20193211440283_e20193211449591_c20193211450047.nc"
print("loading", target)
#with fs_s3.open(target) as of:
with fs_block.open(target, block_size=2**20) as of:
    with xarray.open_dataset(of, decode_cf=False, engine="h5netcdf") as ds:
        print(ds["Rad"][3000, 4000])
del ds, of  # prevents https://github.com/intake/filesystem_spec/issues/404

where I adapted the context manager opening block to test the different scenarios.

Conclusions:

  • h5netcdf isn't greatly efficient, but it does not download the entire file;
  • the block caching brings down the amount downloaded much further, but I don't understand how.

@djhoese
Copy link
Member

djhoese commented Sep 30, 2020

I'd be curious if the download size changes with variable/attribute being read. Maybe it is only reading as far as it has to?

Also mode=bytes would be an interesting test.

@gerritholl
Copy link
Member

I found that netCDF4.Dataset downloads the least data, but is nevertheless by far the slowest alternative (except for blockcache with a very small block size).

I've added the loading of two more pixels, and an alternative with netCDF4.Dataset such that the end of the script is now:

bucket = "noaa-goes16"
path = "/ABI-L1b-RadF/2019/321/14/OR_ABI-L1b-RadF-M6C01_G16_s20193211440283_e20193211449591_c20193211450047.nc"
target = bucket + path
url = f"https://{bucket:s}.s3.amazonaws.com" + path + "#mode=bytes"
print("loading", target)
#with fs_s3.open(target) as of:
with netCDF4.Dataset(url, mode="r") as ds:
#with fs_block.open(target, block_size=2**20) as of:
#    with xarray.open_dataset(of, decode_cf=False, engine="h5netcdf") as ds:
        print(ds["Rad"][3000, 4000])
        print(ds["Rad"][8000, 9000])
        print(ds["DQF"][8000, 9000])

I find:

  • netCDF4.Dataset with #mode=bytes downloads 0.270 MB and takes 28s
  • netCDF4.Dataset without #mode=bytes fails with OSError: file not found.
  • h5netcdf directly over s3fs downloads 81.189 MB and takes 13.6s
  • h5netcdf over CachingFileSystem with block_size==2**20 downloads 5.810 MB and takes 5.5s
  • h5netcdf over CachingFileSystem with block_size==2**10 downloads 0.378 MB and takes 43s
  • h5netcdf over CachingFileSystem with block_size==2**15 downloads 0.446 MB and takes 6.1s
  • h5netcdf over CachingFileSystem with block_size==2**25 downloads 57.591 MB and takes 8.3s

I don't know what exactly the netCDF4 library does but I find it interesting that it manages to take twice as long to download 0.270 MB as h5netcdf when it downloads almost the entire file. The winner here would appear to be h5netcdf over CachingFileSystem with a smartly chosen block_size, although smartly chosen may depend on the use case.

@djhoese
Copy link
Member

djhoese commented Sep 30, 2020

I'm very confused by those download numbers. In the near future we may have to see what requests the NetCDF4 library is actually making to the S3 end point. It shouldn't be that slow (in my opinion). However, what it should be doing is reading a known size for all the header information, then downloading exactly the block of data requested. My guess is the caching file system is requesting a block that ends up including the header and maybe the data being requested. Perhaps the overhead of each request on this particular S3 bucket is too much so even though NetCDF4 (the C library) is downloading as little as possible, it takes so long to make each request it needs that it is much slower than the other options.

@@ -704,7 +704,7 @@ def find_files_and_readers(start_time=None, end_time=None, base_dir=None,


def load_readers(filenames=None, reader=None, reader_kwargs=None,
ppp_config_dir=None):
ppp_config_dir=None, file_system=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this argument be called file_system or fs? It's already called fs in find_files_and_readers, and the fsspec classes CachingFileSystem and WholeFileCacheFileSystem also use fs for this parameter, so fs would be more consistent with naming elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think file_system is better, more explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — should we then change it in find_files_and_readers as well for consistency? Otherwise it's a tradeoff between consistency and what is more explicit.

Personally, I prefer fs, which is shorter and already very common as an abbreviation for filesystem: nfs, ntfs, zfs, xfs, reiserfs, fsspec, s3fs, fspath, and elsewhere.

@gerritholl
Copy link
Member

More download number confusions: when I use h5netcdf over CachingFileSystem but use h5netcdf as an xarray backend through the satpy ABI reader, it again appears to download up to the place requested (as in this case determined by storage used by the sparse files in the block cache directory).

Add support for passing an fsspec filesystem instance to the ABI L1B
reader.  Still needs testing and documentation.
@gerritholl
Copy link
Member

Why pass file_system explicitly around from method to method down the line, when it could be passed as part of the reader_kwargs or fh_kwargs dictionary? It doesn't seem to be used anywhere in the stack, apart from passing from Scene.__init__ to find_files_and_readers, which was long deprecated.

@gerritholl
Copy link
Member

I suppose one reason would be that the file_system argument is envisioned to be shared by many different readers, so it's not reader-specific as such.

Instead of passing file_system explicitly in every step in the stack,
pass it as part of reader_kwargs and fh_kwargs.  Instead of LBYL, use
EAFP to check if reader fails to support it.
Where ABI (or GLM) mocks open_dataset, also mock LocalFileSystem, as
with the file_system passing there are two open commands.  Now all tests
pass except one in the GAC/LAC, which I don't understand well enough to
fix, but perhaps @carloshorn can fix this one.
@gerritholl
Copy link
Member

@carloshorn Would you be willing to give me write access to your branch so we can cooperate on this PR?

@carloshorn
Copy link
Author

@gerritholl, you should now have access to the repository.
Sorry for the late reply, I did not have much time the last weeks to work on it.

@gerritholl
Copy link
Member

I just realised that as currently written, this does not allow to pass files corresponding to multiple readers, where those are not from the same filesystem.

@gerritholl
Copy link
Member

Playing around with this some more I discovered another problem. I was using it in a MultiScene processing one hour worth of ABI M1 + M2 data using a CachingFileSystem wrapping an S3FileSystem. When I'm processing all ABI channels, that means an hour of data consists of 2 * 60 * 16 = 1920 files. When processing files on-disk and passing a filename to xarray.open_dataset, the autoclose=True argument (apparently true by default) ensures files are closed when not needed. When passing open files to xarray, xarray doesn't close them (which is very reasonable of xarray, because it can't reopen them). Unless satpy takes responsibility for closing files and reopening them as needed, this means the number of open files per process (1024 on my system) is quickly exhausted:

Traceback resulting in too many open files
  File "/home/gholl/checkouts/sattools/src/sattools/scutil.py", line 43, in _get_all_areas_from_multiscene
    for sc in ms.scenes:
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 221, in scenes
    self._scenes = list(self._scenes)
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 123, in __iter__
    scn = next(self._self_iter)
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 113, in _create_cached_iter
    for scn in self._scene_gen:
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 262, in _call_scene_func
    for scn in gen:
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 123, in __iter__
    scn = next(self._self_iter)
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 113, in _create_cached_iter
    for scn in self._scene_gen:
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 200, in <genexpr>
    scenes = (Scene(filenames=fg, **scene_kwargs) for fg in file_groups)
  File "/home/gholl/checkouts/satpy/satpy/scene.py", line 166, in __init__
    self._readers = self._create_reader_instances(
  File "/home/gholl/checkouts/satpy/satpy/scene.py", line 215, in _create_reader_instances
    return load_readers(filenames=filenames,
  File "/home/gholl/checkouts/satpy/satpy/readers/__init__.py", line 513, in load_readers
    reader_instance.create_filehandlers(
  File "/home/gholl/checkouts/satpy/satpy/readers/yaml_reader.py", line 612, in create_filehandlers
    filehandlers = self._new_filehandlers_for_filetype(filetype_info,
  File "/home/gholl/checkouts/satpy/satpy/readers/yaml_reader.py", line 600, in _new_filehandlers_for_filetype
    return list(filtered_iter)
  File "/home/gholl/checkouts/satpy/satpy/readers/yaml_reader.py", line 568, in filter_fh_by_metadata
    for filehandler in filehandlers:
  File "/home/gholl/checkouts/satpy/satpy/readers/yaml_reader.py", line 502, in _new_filehandler_instances
    yield filetype_cls(filename, filename_info, filetype_info,
  File "/home/gholl/checkouts/satpy/satpy/readers/abi_base.py", line 52, in __init__
    self.of = self.file_system.open(self.filename, block_size=self.block_size)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 376, in <lambda>
    return lambda *args, **kw: getattr(type(self), item)(self, *args, **kw)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/spec.py", line 897, in open
    f = self._open(
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 376, in <lambda>
    return lambda *args, **kw: getattr(type(self), item)(self, *args, **kw)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 327, in _open
    self.save_cache()
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 376, in <lambda>
    return lambda *args, **kw: getattr(type(self), item)(self, *args, **kw)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/fsspec/implementations/cached.py", line 177, in save_cache
    move(fn2, fn)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/shutil.py", line 802, in move
    copy_function(src, real_dst)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/shutil.py", line 432, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/shutil.py", line 261, in copyfile
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
OSError: [Errno 24] Too many open files: '/data/gholl/cache/ABI-block-cache/cache'

I tried to implement in the ABI reader to .close() and open() whenever data are read, but that doesn't resolve the problem because of the lazy loading in xarray and the dask-aware processing through satpy, which mean data reading is postponed until access. That access will fail if the file has been closed. That access could be anywhere, which makes reopening-just-in-time challenging. Yet it would have to be in satpy: it's satpy that (in this PR) uses the file system object (which may be CachingFileSystem) to open a file, and it's to be expected that each CachingFileSystem open file keeps an open file corresponding to the locally cached file.

A single Scene object is unlikely to run into the 1024-file problem (although we've had problems with FCI files in the past because we needed an xarray.open_dataset for each group, but there hopefully different groups in the same file could perhaps share a file object?), but with a 'MultiScene' it's quite likely to happen. Perhaps the MultiScene could take some responsibility to not open more Scenes than needed, but I'm not sure how this could work exactly.

Maybe we need to extend xarray's lazy loading to include lazy opening (and thus autoclosing), but that would require xarray to be filesystem-aware and to know parameters for opening a file. Or we'd have to hack this ourselves subclassing xarray.DataArray. Or provide some interface where users would need to signal explicitly that they will access data for a particular scene, which would then open and close, perhaps with a context manager.

I don't know how to resolve this problem. Note that it doesn't block this PR at all; it just limits its usability for some use cases.

@gerritholl
Copy link
Member

Another regression with this PR. This MCVE:

from satpy.multiscene import MultiScene
from glob import glob

def get_ms():
    mscn = MultiScene.from_files(glob("/data/gholl/cache/fogtools/abi/2020/04/12/18/C1/OR_ABI-L1b-RadM1-M6C*_G16_s2020103180*.nc"), reader="abi_l1b")
    mscn.load(["C01"])
    mscn.scenes
    return mscn.resample(mscn.first_scene["C01"].attrs["area"])

ms = get_ms()
ms.save_animation("/tmp/{name}_{start_time:%Y%m%d_%H%M%S}.mp4")

fails with the following traceback:

Traceback for MCVE
Traceback (most recent call last):                              
  File "mwe98.py", line 11, in <module>
    ms.save_animation("/tmp/{name}_{start_time:%Y%m%d_%H%M%S}.mp4")          
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 653, in save_animation
    self._simple_frame_compute(writers, frame_keys, frames_to_write)                                                        
  File "/home/gholl/checkouts/satpy/satpy/multiscene.py", line 533, in _simple_frame_compute                              
    w.append_data(product_frame.compute())                                                                 
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/base.py", line 167, in compute   
    (result,) = compute(self, traverse=False, **kwargs)                
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/base.py", line 452, in compute
    results = schedule(dsk, keys, **kwargs)             
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(                                          
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)      
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/local.py", line 316, in reraise                  
    raise exc                                                                                                             
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)                                                                      
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))              
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/core.py", line 121, in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))  
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))                    
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/core.py", line 121, in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))                                                                   
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task           
    return func(*(_execute_task(a, cache) for a in args))                                                  
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/dask/array/core.py", line 102, in getter
    c = np.asarray(c)                                                  
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)     
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 495, in __array__
    return np.asarray(self.array, dtype=dtype)                    
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 655, in __array__      
    return np.asarray(self.array, dtype=dtype)                                                                            
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)                                                         
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 560, in __array__
    return np.asarray(array[self.key], dtype=None)                     
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/backends/h5netcdf_.py", line 28, in __getitem__
    return indexing.explicit_indexing_adapter(             
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 845, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)                              
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/backends/h5netcdf_.py", line 38, in _getitem
    return array[key]                                                                                                       
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5netcdf/core.py", line 146, in __getitem__         
    return self._h5ds[key]                                                                                 
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper                                     
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 573, in __getitem__
    self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 181, in h5py.h5d.DatasetID.read
  File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw
  File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread
  File "h5py/defs.pyx", line 484, in h5py.defs.H5Dread
  File "h5py/h5fd.pyx", line 155, in h5py.h5fd.H5FD_fileobj_read
ValueError: seek of closed file

Apparently, then a MultiScene is resampled, references to the original file handlers (in Scene._readers[...].file_handlers) are destroyed, which leads to the calling of satpy.readers.abi_l1b.NC_ABI_L1B.__del__, where the xarray object is explicitly closed. I'm not sure why that is causing a problem only with this PR and not with satpy master.

@gerritholl
Copy link
Member

The latter has to do with closing xarray objects vs. closing underlying open files.

This (analogous to this PR) fails with a ValueError due to operating on a closed file:

import xarray
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
fp = open(fn, "rb")
ds = xarray.open_dataset(fp, chunks=1024)
ds.close()
fp.close()
print(ds["Rad"][400:402, 300:302].load())

Interestingly, this (analogous to current master) succeeds:

import xarray
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
ds = xarray.open_dataset(fn, chunks=1024)
ds.close()
print(ds["Rad"][400:402, 300:302].load())

Apparently, Dataset.close() closes file objects associated with the Dataset, but xarray happily reopens them when data access are needed. To avoid the ValueError, we can either:

  1. make sure references to the FileHandler are not lost, so that NC_ABI_L1B.__del__ isn't called until data are actually no longer needed, or
  2. refrain from closing the file opened via fsspec in NC_ABI_L1B.__del__.

Both would use the same resources. The first would probably be the more appropriate solution, but the second is easier to implement.

Incidentally, considering that autoclose=True is already the default when opening datasets, the currently existing self.nc.close() in the ABI reader probably does absolutely nothing at all.

@mraspaud
Copy link
Member

Do we want to keep this open?

@carloshorn
Copy link
Author

I am fine with closing this draft PR, which was intended as a starting point of the discussion.
The idea of how to implement the requested feature is given in #1439 using the introduced FSFile as filename.
The task of enabling more readers to accept FSFile instances is beyond the scope of this PR.

@mraspaud
Copy link
Member

Thanks, closing then.

@mraspaud mraspaud closed this Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow reading files passing file objects
4 participants