to_dict() on a boolean series sometimes returns numpy types instead of Python types #27616

maximz · 2019-07-26T21:55:33Z

Problem description

I construct a Series in several ways that should give the same output from to_dict(), but instead I get different output types. In my case, this breaks downstream JSON serializers.

The code sample below includes cases with correct output (bool) and incorrect (numpy.bool_) -- see inline comments.

Related issues, though none seem exactly the same: #13258, #13830, #16048, #17491, #19381, #20791, #23753, #23921, #24908, #25969

Code sample

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({ 'a': [True, False], 'b': [0, 1]} )

In [3]: df
Out[3]:
       a  b
0   True  0
1  False  1

In [27]: type(df['a'].iloc[0])
Out[27]: numpy.bool_

In [48]: type(df[['a']].iloc[0, 0])
Out[48]: numpy.bool_

In [33]: type(df.iloc[0,0])
Out[33]: numpy.bool_

In [24]: type(df.iloc[0]['a'])
Out[24]: numpy.bool_

# ----

In [4]: df[['a']].iloc[0].to_dict()
Out[4]: {'a': True}

# correct
In [5]: type(df[['a']].iloc[0].to_dict()['a'])
Out[5]: bool

In [6]: df.iloc[0][['a']].to_dict()
Out[6]: {'a': True}

# this one is incorrect, should return bool
In [7]: type(df.iloc[0][['a']].to_dict()['a'])
Out[7]: numpy.bool_

# ----

In [8]: df[['a', 'b']].to_dict(orient='records')[0]
Out[8]: {'a': True, 'b': 0}

# correct
In [9]: type(df[['a', 'b']].to_dict(orient='records')[0]['a'])
Out[9]: bool

In [10]: df[['a', 'b']].iloc[0].to_dict()
Out[10]: {'a': True, 'b': 0}

# this one is incorrect, should return bool
In [11]: type(df[['a', 'b']].iloc[0].to_dict()['a'])
Out[11]: numpy.bool_

This may explain what's going on:

In [54]: df.iloc[0][['a']]
Out[54]:
a    True
Name: 0, dtype: object


In [56]: df[['a']].iloc[0]
Out[56]:
a    True
Name: 0, dtype: bool

That relates to #25969, where @mroeschke commented about a similar dtype discrepancy:

This probably occurs because s2 is object dtype and it's trying to preserve the dtype of each input argument while the arguments in s1 can both be coerced to int64.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.4.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.6.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.0
numpy            : 1.16.4
pytz             : 2019.1
dateutil         : 2.8.0
pip              : 19.0.3
setuptools       : 40.8.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.1
IPython          : 7.6.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : 2.6.9
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : 3.5.2
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-07-27T00:15:39Z

IIUC you have a mixed dtype column and the inferencing is ultimately winding up in different spots based on how you select things (one stays an object and the other a bool), right?

maximz · 2019-07-27T00:23:02Z

Yeah that seems like the right explanation. It seems the inferencing should ideally end up the same, unlike today. But even if it doesn't, shouldn't to_dict be expected to return a python type no matter what? I'm speculating on the expected behavior here.

jreback · 2019-07-27T00:52:05Z

there are likely some embedded bugs here - welcome for you to have a look

mroeschke · 2019-07-27T04:13:20Z

Not sure if this is the expected behavior, but it might be an underlying bug/undocumented feature in numpy:

In [3]: type(np.array([np.bool_(True)], dtype=object).item())
Out[3]: numpy.bool_

In [4]: type(np.array([np.int64(1)], dtype=object).item())
Out[4]: numpy.int64

mroeschke · 2019-07-27T04:22:15Z

xref numpy/numpy#14139

mroeschke · 2019-07-28T20:30:43Z

Based on the response in numpy/numpy#14139. Looks like when calling to_dict on object columns, python scalars are not guaranteed.

NLaws · 2019-08-05T21:38:29Z

I am having this problem, except on integers (not a boolean series) using df = pd.read_csv(path_to_csv) then to_dict() I get a type of <class 'numpy.int64'> for integer values in the original CSV.

maximz · 2020-02-11T17:14:47Z

Sorry for my delay here. I followed up on numpy/numpy#14139 — the numpy folks suggest reliably converting to vanilla python types as follows:

you could call item() recursively - but your recursion would be unbounded for np.longdouble, for which item returns self.

Do you believe pandas should account for this edge case and call item() recursively? I think the use case of passing to_dict() outputs to serializers is fairly common.

mroeschke added the Dtype Conversions Unexpected or buggy dtype conversions label Nov 2, 2019

maximz mentioned this issue Feb 11, 2020

DOC: np.array(<numpy data type>, dtype=object).item() doesn't return python scalar numpy/numpy#14139

Closed

dmlogv mentioned this issue Jun 9, 2020

BUG: DataFrame.to_dict() converts Nullable Int types to numpy.int #34665

Closed

jreback mentioned this issue Oct 24, 2020

ENH/BUG: implement __iter__ for IntegerArray so conversions (to_dict, tolist, etc.) return python native types #37377

Closed

6 tasks

arw2019 mentioned this issue Nov 3, 2020

BUG: to_dict should return a native datetime object for NumPy backed dataframes #37571

Merged

5 tasks

mbiokyle29 mentioned this issue Jan 21, 2021

BUG: Series.to_json returning a native python float from a column of type int64 #39324

Closed

3 tasks

mroeschke added the Bug label Jul 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_dict() on a boolean series sometimes returns numpy types instead of Python types #27616

to_dict() on a boolean series sometimes returns numpy types instead of Python types #27616

maximz commented Jul 26, 2019

WillAyd commented Jul 27, 2019

maximz commented Jul 27, 2019

jreback commented Jul 27, 2019

mroeschke commented Jul 27, 2019

mroeschke commented Jul 27, 2019

mroeschke commented Jul 28, 2019

NLaws commented Aug 5, 2019

maximz commented Feb 11, 2020

to_dict() on a boolean series sometimes returns numpy types instead of Python types #27616

to_dict() on a boolean series sometimes returns numpy types instead of Python types #27616

Comments

maximz commented Jul 26, 2019

Problem description

Code sample

Output of pd.show_versions()

WillAyd commented Jul 27, 2019

maximz commented Jul 27, 2019

jreback commented Jul 27, 2019

mroeschke commented Jul 27, 2019

mroeschke commented Jul 27, 2019

mroeschke commented Jul 28, 2019

NLaws commented Aug 5, 2019

maximz commented Feb 11, 2020

Output of `pd.show_versions()`