-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opening fsspec s3 file twice results in invalid start byte #6813
Comments
Note that the second example fails on the first open_dataset, which I assume is expected since it's not mode |
Adding
|
I talked with @wroberts4 about this in person and if we're not missing some reason to not def read_magic_number_from_file(filename_or_obj, count=8) -> bytes:
# check byte header to determine file type
if isinstance(filename_or_obj, bytes):
magic_number = filename_or_obj[:count]
elif isinstance(filename_or_obj, io.IOBase):
if filename_or_obj.tell() != 0:
filename_or_obj.seek(0)
# warn about re-seeking?
magic_number = filename_or_obj.read(count)
filename_or_obj.seek(0)
else:
raise TypeError(f"cannot read the magic number form {type(filename_or_obj)}")
return magic_number Additionally, the https://docs.python.org/3/library/io.html#class-hierarchy @kmuehlbauer @lamorton I saw you commented on the almost related #3991, do you have any thoughts on this? Should we put a PR together to continue the discussion? Maybe the fsspec folks (@martindurant?) have an opinion on this? |
Yes, it is reasonable to always I am mildly against subclassing from RawIOBase, since some file-likes might choose to implement text mode right in the class (as opposed to a text wrapper layered on top). Pretty surprised that it doesn't have read()/write(), though, since all the derived classes do. |
Oh duh, that's a good point. So it might be fine dask-wise if the assumption is that Yeah I thought the |
@djhoese Is that not already an existing problem since |
Good point. My initial answer was going to be that it isn't a problem because in the second usage of the file we would get the exception about |
@wroberts4 I'd say maybe make a pull request and we'll see what (if any) tests fail and what the people in charge of merging think about it. I think we've gone through the various possibilities and I think if there were any thread-safety issues trying to be protected against with the exception as it was, they weren't actually being protected against (later reading of the file could have caused an issue). |
Just hitting into this same issue mentioned downstream at xarray-contrib/datatree#130 while trying to read ICESat-2 HDF5 files from S3, but realized that the fix should happening in |
What happened?
When I open an fsspec s3 file twice, it results in an error, "file-like object read/write pointer not at the start of the file".
Here's a Dockerfile I used for the environment:
Input1:
Output1:
----- INVALID EXAMPLE 2 -----
Input2:
Output2:
----- INVALID EXAMPLE 2 -----
What did you expect to happen?
I expect both calls to open_dataset to yield the same result and not error. The following runs without errors:
Minimal Complete Verifiable Example
No response
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
I see the same error mentioned in other issues like #3991, but it was determined to be a problem with the input data.
Environment
INSTALLED VERSIONS
commit: None
python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-348.20.1.el8_5.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: None
xarray: 2022.6.0rc0
pandas: 1.4.3
numpy: 1.23.1
scipy: None
netCDF4: None
pydap: None
h5netcdf: 1.0.1
h5py: 3.7.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.7.0
distributed: 2022.7.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.5.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 63.2.0
pip: 22.0.4
conda: 4.13.0
pytest: None
IPython: None
sphinx: None
The text was updated successfully, but these errors were encountered: