Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent types in output of series.to_dict() and DataFrame([series]).loc[0].to_dict() #13830

Closed
mikepqr opened this issue Jul 28, 2016 · 3 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Usage Question

Comments

@mikepqr
Copy link

mikepqr commented Jul 28, 2016

to_dict() extracts the elements from a Series as different types depending on whether or not the series was accessed by, e.g. loc[0] on a DataFrame or not:

>>> s = pd.Series({'a': None, 'b': 99, 'c': 'hello'})
>>> df = pd.DataFrame([s])
>>> [type(v) for k, v in s.to_dict().items()]
[NoneType, str, int]
>>> [type(v) for k, v in df.loc[0].to_dict().items()]
[NoneType, str, numpy.int64]

Note that the number is a base int when extracted with s.to_dict(), but it's a numpy.int64 when extracted from df.loc[0]. The same inconsistency applies to tolist().

Is this inconsistency a feature or a bug? And if it's a feature, does anyone know how do I reliably extract the values of a row from a DataFrame in base python types, using either to_dict() or tolist()?

output of pd.show_versions()

commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-56-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 1.5.6
setuptools: 12.2
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@mikepqr mikepqr changed the title Inconsistent types in output of series.to_dict() and df.loc[0].to_dict() Inconsistent types in output of series.to_dict() and DataFrame([s]).loc[0].to_dict() Jul 28, 2016
@mikepqr mikepqr changed the title Inconsistent types in output of series.to_dict() and DataFrame([s]).loc[0].to_dict() Inconsistent types in output of series.to_dict() and DataFrame([series]).loc[0].to_dict() Jul 28, 2016
@TomAugspurger
Copy link
Contributor

Two things going on here.

  1. the int vs. int64 is present after constructing the data frame. nothing to do with to_dict.
In [21]: type(s.iloc[1])
Out[21]: int

In [22]: type(df.iloc[0, 1])
Out[22]: numpy.int64

xref: #9108

@mikepqr
Copy link
Author

mikepqr commented Jul 28, 2016

My use case was the same as #9108: I wanted to assemble a somewhat complicated JSON object that contains things that aren't only in the DataFrame. pd.io.json.dumps on the dictionary works fine. So my problem is solved. Thanks!

Just out of interest then: is the inconsistency in the types being stored in s.iloc[1] and df.iloc[0, 1] correct behaviour?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 28, 2016

Great to hear.

Just out of interest then: is the inconsistency in the types being stored in s.iloc[1] and df.iloc[0, 1] correct behaviour?

Yeah, I think so. Series has to have a single dtype, which must be object in this case since you have mixed types (not a good idea in general). That means we can optimize to a numpy dtype. When you go to a DataFrame each col can have it's own type, which will use NumPy if possible. A good comparison is to pd.Series([1, 2]), which does use numpy ints, even though you pass in python ints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Usage Question
Projects
None yet
Development

No branches or pull requests

2 participants