-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise an informative error message when object array has mixed types #4700
Raise an informative error message when object array has mixed types #4700
Conversation
BeforeIn [2]: data = np.array([["x", 1], ["y", 2]], dtype="object")
In [3]: xr.conventions._infer_dtype(data, 'test')
Out[3]: dtype('O') As pointed out in #2620, this doesn't seem problematic until the user tries writing the xarray object to disk. This results in a very cryptic error message: In [7]: ds.to_netcdf('test.nc', engine='netcdf4')
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__setitem__()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._put()
TypeError: expected bytes, int found AfterIn [2]: data = np.array([["x", 1], ["y", 2]], dtype="object")
In [3]: xr.conventions._infer_dtype(data, 'test')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-addaab43c03a> in <module>
----> 1 xr.conventions._infer_dtype(data, 'test')
~/devel/pydata/xarray/xarray/conventions.py in _infer_dtype(array, name)
142 native_dtypes = set(map(lambda x: type(x), array.flatten()))
143 if len(native_dtypes) > 1:
--> 144 raise ValueError(
145 "unable to infer dtype on variable {!r}; object array "
146 "contains mixed native types: {}".format(
ValueError: unable to infer dtype on variable 'test'; object array contains mixed native types: str,int During I/O, the user gets: ...
~/devel/pydata/xarray/xarray/conventions.py in ensure_dtype_not_object(var, name)
223 data[missing] = fill_value
224 else:
--> 225 data = _copy_with_dtype(data, dtype=_infer_dtype(data, name))
226
227 assert data.dtype.kind != "O" or data.dtype.metadata
~/devel/pydata/xarray/xarray/conventions.py in _infer_dtype(array, name)
142 native_dtypes = set(map(lambda x: type(x), array.flatten()))
143 if len(native_dtypes) > 1:
--> 144 raise ValueError(
145 "unable to infer dtype on variable {!r}; object array "
146 "contains mixed native types: {}".format(
ValueError: unable to infer dtype on variable 'test'; object array contains mixed native types: str,int |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good!
I ran a quick performance check — I'm not sure how large an impact this is in proportional terms though.
Alternatives — not ideal ones — would be to wait until the main error is raised, or only test a subset of the values.
In [7]: np.asarray(list("abcdefghijklmnopqrstuvwxyz"))
Out[7]:
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
dtype='<U1')
In [14]: array = np.repeat(x, 5_000_000)
In [15]: %timeit set(np.vectorize(type, otypes=[object])(array.ravel()))
10.2 s ± 165 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I thought of taking a random sample from the array and checking the types on the sample only, but I wasn't so confident about how representative this sample would be and/or how to deal with misleading, skewed samples. If anyone has thoughts on this, please let me know. |
It took around 40 s for an array of 10**9 elements. That would be around 150 years of daily data (180*360*150*365). I am not sure though how much sense it makes to have such a large array with object dtype. Also an array of this size is likely a dask array and there is already a performance warning on this. So I'd say go ahead. Lines 194 to 197 in 68d3c34
|
@mathause, just to make sure I am not misinterpreting your comment, is this a go ahead to sampling the array to determine the types? :) |
Yes, I'd say go ahead. (I just hope it's not too big of a performance hit for normal use cases.) |
@mathause, I am noticing a performance hit even for the special use cases. Here's how I am doing the sampling sample_indices = np.random.choice(array.size, size=min(20, array.size), replace=False)
native_dtypes = set(np.vectorize(type, otypes=[object])(array.ravel()[sample_indices])) and here's the code snippet I tested this on: In [1]: import xarray as xr, numpy as np
In [2]: x = np.asarray(list("abcdefghijklmnopqrstuvwxyz"), dtype="object")
In [3]: array = np.repeat(x, 5_000_000)
In [4]: array.size
Out[4]: 130000000
In [5]: array.dtype
Out[5]: dtype('O') Without samplingIn [6]: %timeit xr.conventions._infer_dtype(array, "test")
7.63 s ± 515 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) With samplingIn [15]: %timeit xr.conventions._infer_dtype(array, "test")
8.31 s ± 395 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) I could be wrong, but the sampling doesn't seem to be worth it. |
No I wouldn't subsample. With normal use case I meant saving legitimate object arrays. I am not sure how often they occur in the wild. |
Unit Test Results 6 files 6 suites 54m 27s ⏱️ Results for commit 2107949. |
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
a6faf00
to
9b691c2
Compare
I think this is most likely for lots of strings which aren't using a numpy string type. @andersy005 to the extent the perf difference is 7 seconds vs 8 seconds, I would definitely vote to merge. The case I was originally concerned about is that something suddenly takes 1000 seconds. If we're confident that that's not the case, I would also vote to merge... |
* upstream/main: Raise an informative error message when object array has mixed types (pydata#4700) Start renaming `dims` to `dim` (pydata#8487) Reduce redundancy between namedarray and variable tests (pydata#8405) Fix Zarr region transpose (pydata#8484) Refine rolling_exp error messages (pydata#8485) Use numbagg for `ffill` by default (pydata#8389) Fix bug for categorical pandas index with categories with EA dtype (pydata#8481) Improve "variable not found" error message (pydata#8474) Add whatsnew for pydata#8475 (pydata#8478) Allow `rank` to run on dask arrays (pydata#8475) Fix mypy tests (pydata#8476) Use concise date format when plotting (pydata#8449) Fix `map_blocks` docs' formatting (pydata#8464) Consolidate `_get_alpha` func (pydata#8465)
isort . && black . && mypy . && flake8
whats-new.rst
api.rst