Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.to_dict() converts Nullable Int types to numpy.int #34665

Closed
dmlogv opened this issue Jun 9, 2020 · 4 comments · Fixed by #50874
Closed

BUG: DataFrame.to_dict() converts Nullable Int types to numpy.int #34665

dmlogv opened this issue Jun 9, 2020 · 4 comments · Fixed by #50874
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@dmlogv
Copy link

dmlogv commented Jun 9, 2020

Problem description

DataFrame.to_dict() method do not cast Nullable Int types (Int*Dtype) into Python int type. Instead, it unwrapping into numpy.int* types.

Possibly related to: #27616, #25969, #21256

Expected Output

Native Python int type.

Reproduction

Make some data:

import pandas as pd

df = pd.DataFrame({'id': range(5),
                   'coeff': [i * 0.1 for i in range(5)],
                   'is_hot': [True] * 2 + [False] * 3,
                   'value': [1, None, 2, 3, None]})
df
id coeff is_hot value
0 0 0.0 True 1.0
1 1 0.1 True NaN
2 2 0.2 False 2.0
3 3 0.3 False 3.0
4 4 0.4 False NaN
df.dtypes

id          int64
coeff     float64
is_hot       bool
value     float64
dtype: object

value have to be a nullable int:

df['value'] = df['value'].astype(pd.Int64Dtype())
df.dtypes

id          int64
coeff     float64
is_hot       bool
value       Int64
dtype: object

Looks great. But convert a dataframe to dict:

dicts = df.to_dict(orient='records')
dicts

[{'id': 0, 'coeff': 0.0, 'is_hot': True, 'value': 1},
 {'id': 1, 'coeff': 0.1, 'is_hot': True, 'value': nan},
 {'id': 2, 'coeff': 0.2, 'is_hot': False, 'value': 2},
 {'id': 3, 'coeff': 0.30000000000000004, 'is_hot': False, 'value': 3},
 {'id': 4, 'coeff': 0.4, 'is_hot': False, 'value': nan}]
pd.DataFrame(
    [[type(v) for k, v in row.items()] for row in dicts], 
    columns=dicts[0].keys())
id coeff is_hot value
0 <class 'int'> <class 'float'> <class 'bool'> <class 'numpy.int64'>
1 <class 'int'> <class 'float'> <class 'bool'> <class 'float'>
2 <class 'int'> <class 'float'> <class 'bool'> <class 'numpy.int64'>
3 <class 'int'> <class 'float'> <class 'bool'> <class 'numpy.int64'>
4 <class 'int'> <class 'float'> <class 'bool'> <class 'float'>

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.7.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.4.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : en_US.UTF-8

pandas           : 1.0.4
numpy            : 1.18.4
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 19.2.3
setuptools       : 41.2.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.15.0
pandas_datareader: None
bs4              : 4.8.1
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.2.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None
@dmlogv dmlogv added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 9, 2020
@jorisvandenbossche
Copy link
Member

@dm-logv thanks for the report. This is probably related to #29738

@jorisvandenbossche jorisvandenbossche added NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 9, 2020
@jreback jreback added this to the 1.2 milestone Oct 24, 2020
@arw2019
Copy link
Member

arw2019 commented Oct 24, 2020

Minimal reproducer:

In [3]: import pandas as pd 
   ...:  
   ...: df = pd.DataFrame({'A': [1, None, 2, 3, None]}) 
   ...: df['A'] = df['A'].astype('Int64') 
   ...: dicts = df.to_dict(orient="records") 
   ...: pd.DataFrame( 
   ...:     [[type(v) for k, v in row.items()] for row in dicts],  
   ...:     columns=dicts[0].keys())                                                                                                                                                                               
Out[3]: 
                                       A
0                  <class 'numpy.int64'>
1  <class 'pandas._libs.missing.NAType'>
2                  <class 'numpy.int64'>
3                  <class 'numpy.int64'>
4  <class 'pandas._libs.missing.NAType'>

@arw2019
Copy link
Member

arw2019 commented Oct 24, 2020

Even smaller:

In [2]: import pandas as pd 
   ...:  
   ...: df = pd.DataFrame({'A': [1, None]}) 
   ...: df['A'] = df['A'].astype('Int64') 
   ...: records_as_dicts = df.to_dict(orient="records") 
   ...: pd.DataFrame([[type(v) for v in row.values()] for row in records_as_dicts], columns=records_as_dicts[0].keys())                                                                                            
Out[2]: 
                                       A
0                  <class 'numpy.int64'>
1  <class 'pandas._libs.missing.NAType'>

@jreback jreback modified the milestones: 1.2, Contributions Welcome Nov 19, 2020
@VikingPathak
Copy link

I am using pd.read_sql() which returned a dataframe and then applying .to_dict() gave me a dictionary with value True having type numpy.bool_.

Could not find a better approach so I used to_json() instead of to_dict() and enclosed it in json.loads(). All the numpy types were converted to python object.

json.loads(dataframe.iloc[0, :].to_json())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment