Raise an informative error message when object array has mixed types #4700

andersy005 · 2020-12-16T15:07:21Z

Closes Dataset.to_netcdf with mixed int and str in data #2620
Tests added
Passes isort . && black . && mypy . && flake8
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

andersy005 · 2020-12-16T15:11:12Z

Before

In [2]: data = np.array([["x", 1], ["y", 2]], dtype="object")

In [3]: xr.conventions._infer_dtype(data, 'test')
Out[3]: dtype('O')

As pointed out in #2620, this doesn't seem problematic until the user tries writing the xarray object to disk. This results in a very cryptic error message:

In [7]: ds.to_netcdf('test.nc', engine='netcdf4')
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__setitem__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._put()

TypeError: expected bytes, int found

After

In [2]: data = np.array([["x", 1], ["y", 2]], dtype="object")

In [3]: xr.conventions._infer_dtype(data, 'test')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-addaab43c03a> in <module>
----> 1 xr.conventions._infer_dtype(data, 'test')

~/devel/pydata/xarray/xarray/conventions.py in _infer_dtype(array, name)
    142     native_dtypes = set(map(lambda x: type(x), array.flatten()))
    143     if len(native_dtypes) > 1:
--> 144         raise ValueError(
    145             "unable to infer dtype on variable {!r}; object array "
    146             "contains mixed native types: {}".format(

ValueError: unable to infer dtype on variable 'test'; object array contains mixed native types: str,int

During I/O, the user gets:

...
~/devel/pydata/xarray/xarray/conventions.py in ensure_dtype_not_object(var, name)
    223             data[missing] = fill_value
    224         else:
--> 225             data = _copy_with_dtype(data, dtype=_infer_dtype(data, name))
    226 
    227         assert data.dtype.kind != "O" or data.dtype.metadata

~/devel/pydata/xarray/xarray/conventions.py in _infer_dtype(array, name)
    142     native_dtypes = set(map(lambda x: type(x), array.flatten()))
    143     if len(native_dtypes) > 1:
--> 144         raise ValueError(
    145             "unable to infer dtype on variable {!r}; object array "
    146             "contains mixed native types: {}".format(

ValueError: unable to infer dtype on variable 'test'; object array contains mixed native types: str,int

xarray/conventions.py

max-sixty

This looks good!

I ran a quick performance check — I'm not sure how large an impact this is in proportional terms though.

Alternatives — not ideal ones — would be to wait until the main error is raised, or only test a subset of the values.

In [7]: np.asarray(list("abcdefghijklmnopqrstuvwxyz"))
Out[7]:
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
       'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
      dtype='<U1')

In [14]: array = np.repeat(x, 5_000_000)


In [15]: %timeit set(np.vectorize(type, otypes=[object])(array.ravel()))
10.2 s ± 165 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

xarray/conventions.py

andersy005 · 2020-12-17T05:44:55Z

Alternatives — not ideal ones — would be to wait until the main error is raised, or only test a subset of the values.

I thought of taking a random sample from the array and checking the types on the sample only, but I wasn't so confident about how representative this sample would be and/or how to deal with misleading, skewed samples. If anyone has thoughts on this, please let me know.

mathause · 2020-12-17T08:40:57Z

It took around 40 s for an array of 10**9 elements. That would be around 150 years of daily data (180*360*150*365). I am not sure though how much sense it makes to have such a large array with object dtype. Also an array of this size is likely a dask array and there is already a performance warning on this. So I'd say go ahead.

xarray/xarray/conventions.py

Lines 194 to 197 in 68d3c34

    
           "variable {} has data in the form of a dask array with " 
        
           "dtype=object, which means it is being loaded into memory " 
        
           "to determine a data type that can be safely stored on disk. " 
        
           "To avoid this, coerce this variable to a fixed-size dtype "

andersy005 · 2021-01-20T23:36:43Z

Also an array of this size is likely a dask array and there is already a performance warning on this. So I'd say go ahead.

@mathause, just to make sure I am not misinterpreting your comment, is this a go ahead to sampling the array to determine the types? :)

xarray/conventions.py

mathause · 2021-01-21T20:52:18Z

Yes, I'd say go ahead. (I just hope it's not too big of a performance hit for normal use cases.)

andersy005 · 2021-01-22T17:13:39Z

Yes, I'd say go ahead. (I just hope it's not too big of a performance hit for normal use cases.)

@mathause, I am noticing a performance hit even for the special use cases. Here's how I am doing the sampling

sample_indices = np.random.choice(array.size, size=min(20, array.size), replace=False)
native_dtypes = set(np.vectorize(type, otypes=[object])(array.ravel()[sample_indices]))

and here's the code snippet I tested this on:

In [1]: import xarray as xr, numpy as np

In [2]: x = np.asarray(list("abcdefghijklmnopqrstuvwxyz"), dtype="object")

In [3]: array = np.repeat(x, 5_000_000)

In [4]: array.size
Out[4]: 130000000

In [5]: array.dtype
Out[5]: dtype('O')

Without sampling

In [6]: %timeit xr.conventions._infer_dtype(array, "test")
7.63 s ± 515 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With sampling

In [15]: %timeit xr.conventions._infer_dtype(array, "test")
8.31 s ± 395 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I could be wrong, but the sampling doesn't seem to be worth it.

mathause · 2021-01-22T20:08:54Z

No I wouldn't subsample. With normal use case I meant saving legitimate object arrays. I am not sure how often they occur in the wild.

github-actions · 2021-09-26T22:07:42Z

Unit Test Results

        6 files         6 suites 54m 27s ⏱️
16 229 tests 14 508 ✔️ 1 721 💤 0 ❌
90 570 runs 82 410 ✔️ 8 160 💤 0 ❌

Results for commit 2107949.

… types

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

… types

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

max-sixty · 2023-11-28T18:43:34Z

No I wouldn't subsample. With normal use case I meant saving legitimate object arrays. I am not sure how often they occur in the wild.

I think this is most likely for lots of strings which aren't using a numpy string type.

@andersy005 to the extent the perf difference is 7 seconds vs 8 seconds, I would definitely vote to merge. The case I was originally concerned about is that something suddenly takes 1000 seconds. If we're confident that that's not the case, I would also vote to merge...

* upstream/main: Raise an informative error message when object array has mixed types (pydata#4700) Start renaming `dims` to `dim` (pydata#8487) Reduce redundancy between namedarray and variable tests (pydata#8405) Fix Zarr region transpose (pydata#8484) Refine rolling_exp error messages (pydata#8485) Use numbagg for `ffill` by default (pydata#8389) Fix bug for categorical pandas index with categories with EA dtype (pydata#8481) Improve "variable not found" error message (pydata#8474) Add whatsnew for pydata#8475 (pydata#8478) Allow `rank` to run on dask arrays (pydata#8475) Fix mypy tests (pydata#8476) Use concise date format when plotting (pydata#8449) Fix `map_blocks` docs' formatting (pydata#8464) Consolidate `_get_alpha` func (pydata#8465)

andersy005 marked this pull request as ready for review December 16, 2020 15:22

mathause reviewed Dec 16, 2020

View reviewed changes

xarray/conventions.py Outdated Show resolved Hide resolved

xarray/conventions.py Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

max-sixty reviewed Dec 17, 2020

View reviewed changes

xarray/conventions.py Outdated Show resolved Hide resolved

keewis reviewed Jan 20, 2021

View reviewed changes

xarray/conventions.py Show resolved Hide resolved

andersy005 and others added 17 commits November 20, 2023 16:19

Raise an informative error message when object array has mixed native…

ac60567

… types

Add match expr

dbabc8c

Exclude bytes, and str

5d2f045

Parametrize test

46b57ad

Add space

865f67b

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Use np.vectorize

2fac054

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Fix typo

12d9f2d

Use comprehension

3736750

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

Remove unmatched paranthesis

ad60967

Update docstring

a16cedc

Raise an informative error message when object array has mixed native…

2ecceef

… types

Add match expr

4bbc06f

Exclude bytes, and str

c7edb83

Parametrize test

3a9e524

Add space

d54241e

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Use np.vectorize

7c635bd

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Fix typo

bdf1e77

andersy005 and others added 3 commits November 20, 2023 16:20

Use comprehension

95bdd70

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

Remove unmatched paranthesis

bddd930

Update docstring

9b691c2

andersy005 force-pushed the bugfix/mixed-int-str-data branch from a6faf00 to 9b691c2 Compare November 21, 2023 00:21

Merge branch 'main' into bugfix/mixed-int-str-data

2da26d7

github-actions bot added the topic-CF conventions label Nov 21, 2023

remove duplicated code

f85766e

andersy005 requested a review from keewis November 21, 2023 00:44

Merge branch 'main' into bugfix/mixed-int-str-data

63b9bbe

andersy005 added the needs review label Nov 28, 2023

andersy005 requested a review from max-sixty November 28, 2023 05:53

max-sixty approved these changes Nov 28, 2023

View reviewed changes

andersy005 added plan to merge Final call for comments and removed needs review labels Nov 28, 2023

andersy005 added 2 commits November 28, 2023 11:18

Merge branch 'main' into bugfix/mixed-int-str-data

3937980

Merge branch 'main' into bugfix/mixed-int-str-data

3934a96

andersy005 merged commit dc0931a into pydata:main Nov 28, 2023

andersy005 deleted the bugfix/mixed-int-str-data branch November 28, 2023 22:19

mathause mentioned this pull request Jan 8, 2024

_infer_dtype: remove duplicated code #8597

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise an informative error message when object array has mixed types #4700

Raise an informative error message when object array has mixed types #4700

andersy005 commented Dec 16, 2020 •

edited

Loading

andersy005 commented Dec 16, 2020 •

edited

Loading

This comment has been minimized.

max-sixty left a comment

andersy005 commented Dec 17, 2020

mathause commented Dec 17, 2020

andersy005 commented Jan 20, 2021

mathause commented Jan 21, 2021

andersy005 commented Jan 22, 2021 •

edited

Loading

mathause commented Jan 22, 2021

github-actions bot commented Sep 26, 2021

max-sixty commented Nov 28, 2023

Raise an informative error message when object array has mixed types #4700

Raise an informative error message when object array has mixed types #4700

Conversation

andersy005 commented Dec 16, 2020 • edited Loading

andersy005 commented Dec 16, 2020 • edited Loading

Before

After

This comment has been minimized.

max-sixty left a comment

Choose a reason for hiding this comment

andersy005 commented Dec 17, 2020

mathause commented Dec 17, 2020

andersy005 commented Jan 20, 2021

mathause commented Jan 21, 2021

andersy005 commented Jan 22, 2021 • edited Loading

Without sampling

With sampling

mathause commented Jan 22, 2021

github-actions bot commented Sep 26, 2021

Unit Test Results

max-sixty commented Nov 28, 2023

andersy005 commented Dec 16, 2020 •

edited

Loading

andersy005 commented Dec 16, 2020 •

edited

Loading

andersy005 commented Jan 22, 2021 •

edited

Loading