Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unique(), values discards timezone information in some cases #14548

Closed
aavanian opened this issue Nov 1, 2016 · 5 comments
Closed

unique(), values discards timezone information in some cases #14548

aavanian opened this issue Nov 1, 2016 · 5 comments
Labels
Timezones Timezone data dtype Usage Question

Comments

@aavanian
Copy link

aavanian commented Nov 1, 2016

Not sure if it's an expected behavior. This seems strange to me.

A small, complete example of the issue

rng = pd.date_range('1/1/2011', periods=5, freq='D', tz='utc')
rng  # Timezone is good

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05'],
dtype='datetime64[ns, UTC]', freq='D')

rng.unique()  # Timezone is still good

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05'],
dtype='datetime64[ns, UTC]', freq='D')

rng.values  # Timezone lost!

array(['2011-01-01T00:00:00.000000000', '2011-01-02T00:00:00.000000000',
'2011-01-03T00:00:00.000000000', '2011-01-04T00:00:00.000000000',
'2011-01-05T00:00:00.000000000'], dtype='datetime64[ns]')

serie = rng.to_series().reset_index()['index']
serie  # Timezone is good

0 2011-01-01 00:00:00+00:00
1 2011-01-02 00:00:00+00:00
2 2011-01-03 00:00:00+00:00
3 2011-01-04 00:00:00+00:00
4 2011-01-05 00:00:00+00:00
Name: index, dtype: datetime64[ns, UTC]

serie.unique()  # Timezone is lost!

array(['2011-01-01T00:00:00.000000000', '2011-01-02T00:00:00.000000000',
'2011-01-03T00:00:00.000000000', '2011-01-04T00:00:00.000000000',
'2011-01-05T00:00:00.000000000'], dtype='datetime64[ns]')

serie.values  # Timezone is lost!

array(['2011-01-01T00:00:00.000000000', '2011-01-02T00:00:00.000000000',
'2011-01-03T00:00:00.000000000', '2011-01-04T00:00:00.000000000',
'2011-01-05T00:00:00.000000000'], dtype='datetime64[ns]')

Expected Output

Outputs of pd.DatetimeIndex.values, pd.Series.unique() and pd.Series.values should be preserve the timezone information.

Note: I did a quick test with Python 3.4.5 and pandas 0.19
only difference is (to be clear, rng.values and serie.values still discard the tzinfo):

serie.unique()  # Timezone is good again!! but dtype is object (preserving pd.Timestamp)

array([Timestamp('2011-01-01 00:00:00+0000', tz='UTC', freq='D'),
Timestamp('2011-01-02 00:00:00+0000', tz='UTC', freq='D'),
Timestamp('2011-01-03 00:00:00+0000', tz='UTC', freq='D'),
Timestamp('2011-01-04 00:00:00+0000', tz='UTC', freq='D'),
Timestamp('2011-01-05 00:00:00+0000', tz='UTC', freq='D')], dtype=object)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.4.4.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.6.None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: 0.2.1

@jreback
Copy link
Contributor

jreback commented Nov 1, 2016

discussion here: #13395

To summarize:

.values returns a numpy array, datetime64 are converted to naive as numpy doesn't support anything else. .unique will give you back an object array if the tz is set.

DTI return another DTI as .unique(), and .values is the same as Series.

So this IS different for a DTI and a Series for .unique.

It has been this way for quite some time and its not worth it to return a '1-D' object for Series.unique, which you would think would be a DTI. Note that this is the only thing that makes sense as the index is pretty meaningless. Apparently a DTI is unexpected for numpy users.

See my comments in the xref issue.

@jreback jreback closed this as completed Nov 1, 2016
@jreback jreback added API Design Timezones Timezone data dtype labels Nov 1, 2016
@jreback jreback added this to the No action milestone Nov 1, 2016
@jorisvandenbossche
Copy link
Member

Maybe we should an explanation about this somewhere in the docs as a FAQ or gotcha, to which we can refer when this comes up.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2016

could update the doc strings with more xref

to be honest .values use should be discouraged

@aavanian
Copy link
Author

aavanian commented Nov 2, 2016

Thanks for the clear answer and sorry about not catching these other issues (somehow forgot to remove is:open while searching for past similar issues).

as for .values, I can't say I found a very clear/consistent pattern to get the values out of a Series/DF to pass to non-pandas functions/objects.
I often resort to .values or .values.tolist() which I supposed was probably ugly but found the most predictable. As in, I can use that everywhere without checking too much what comes in, I'm sure I get float-like for float-like and 'us since epoch' for datetime-like, the issue being of course the latter.

@jorisvandenbossche
Copy link
Member

@aavanian many functions that expect a numpy array also can handle a Series, but if you need the values, then .values is the recommended way to get it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Timezones Timezone data dtype Usage Question
Projects
None yet
Development

No branches or pull requests

3 participants