Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_dict() on a boolean series sometimes returns numpy types instead of Python types #27616

Open
maximz opened this issue Jul 26, 2019 · 8 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions

Comments

@maximz
Copy link

maximz commented Jul 26, 2019

Problem description

I construct a Series in several ways that should give the same output from to_dict(), but instead I get different output types. In my case, this breaks downstream JSON serializers.

The code sample below includes cases with correct output (bool) and incorrect (numpy.bool_) -- see inline comments.

Related issues, though none seem exactly the same: #13258, #13830, #16048, #17491, #19381, #20791, #23753, #23921, #24908, #25969

Code sample

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({ 'a': [True, False], 'b': [0, 1]} )

In [3]: df
Out[3]:
       a  b
0   True  0
1  False  1

In [27]: type(df['a'].iloc[0])
Out[27]: numpy.bool_

In [48]: type(df[['a']].iloc[0, 0])
Out[48]: numpy.bool_

In [33]: type(df.iloc[0,0])
Out[33]: numpy.bool_

In [24]: type(df.iloc[0]['a'])
Out[24]: numpy.bool_

# ----

In [4]: df[['a']].iloc[0].to_dict()
Out[4]: {'a': True}

# correct
In [5]: type(df[['a']].iloc[0].to_dict()['a'])
Out[5]: bool

In [6]: df.iloc[0][['a']].to_dict()
Out[6]: {'a': True}

# this one is incorrect, should return bool
In [7]: type(df.iloc[0][['a']].to_dict()['a'])
Out[7]: numpy.bool_

# ----

In [8]: df[['a', 'b']].to_dict(orient='records')[0]
Out[8]: {'a': True, 'b': 0}

# correct
In [9]: type(df[['a', 'b']].to_dict(orient='records')[0]['a'])
Out[9]: bool

In [10]: df[['a', 'b']].iloc[0].to_dict()
Out[10]: {'a': True, 'b': 0}

# this one is incorrect, should return bool
In [11]: type(df[['a', 'b']].iloc[0].to_dict()['a'])
Out[11]: numpy.bool_

This may explain what's going on:

In [54]: df.iloc[0][['a']]
Out[54]:
a    True
Name: 0, dtype: object


In [56]: df[['a']].iloc[0]
Out[56]:
a    True
Name: 0, dtype: bool

That relates to #25969, where @mroeschke commented about a similar dtype discrepancy:

This probably occurs because s2 is object dtype and it's trying to preserve the dtype of each input argument while the arguments in s1 can both be coerced to int64.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.4.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.6.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.0
numpy            : 1.16.4
pytz             : 2019.1
dateutil         : 2.8.0
pip              : 19.0.3
setuptools       : 40.8.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.1
IPython          : 7.6.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : 2.6.9
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : 3.5.2
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None
@WillAyd
Copy link
Member

WillAyd commented Jul 27, 2019

IIUC you have a mixed dtype column and the inferencing is ultimately winding up in different spots based on how you select things (one stays an object and the other a bool), right?

@maximz
Copy link
Author

maximz commented Jul 27, 2019

Yeah that seems like the right explanation. It seems the inferencing should ideally end up the same, unlike today. But even if it doesn't, shouldn't to_dict be expected to return a python type no matter what? I'm speculating on the expected behavior here.

@jreback
Copy link
Contributor

jreback commented Jul 27, 2019

there are likely some embedded bugs here - welcome for you to have a look

@mroeschke
Copy link
Member

Not sure if this is the expected behavior, but it might be an underlying bug/undocumented feature in numpy:

In [3]: type(np.array([np.bool_(True)], dtype=object).item())
Out[3]: numpy.bool_

In [4]: type(np.array([np.int64(1)], dtype=object).item())
Out[4]: numpy.int64

@mroeschke
Copy link
Member

xref numpy/numpy#14139

@mroeschke
Copy link
Member

Based on the response in numpy/numpy#14139. Looks like when calling to_dict on object columns, python scalars are not guaranteed.

@NLaws
Copy link

NLaws commented Aug 5, 2019

I am having this problem, except on integers (not a boolean series) using df = pd.read_csv(path_to_csv) then to_dict() I get a type of <class 'numpy.int64'> for integer values in the original CSV.

@maximz
Copy link
Author

maximz commented Feb 11, 2020

Sorry for my delay here. I followed up on numpy/numpy#14139 — the numpy folks suggest reliably converting to vanilla python types as follows:

you could call item() recursively - but your recursion would be unbounded for np.longdouble, for which item returns self.

Do you believe pandas should account for this edge case and call item() recursively? I think the use case of passing to_dict() outputs to serializers is fairly common.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

5 participants