Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specific Timestamps breaks time series indexing (.loc returns wrong results) #18029

Closed
linar-jether opened this issue Oct 30, 2017 · 10 comments
Closed
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Testing pandas testing functions or related to the test suite Timezones Timezone data dtype

Comments

@linar-jether
Copy link

linar-jether commented Oct 30, 2017

When try to access labels (.loc) by using a specific list of Timestamp objects or a DatetimeIndex object (See attached csv file), the resulting index is returned in UTC offset but the original timezone is not removed.

This seems to happen only in very specific cases, when the index passed to .loc contains labels that do not exist in the DataFrame and also contains duplicates.

@yuval-jether

import pandas as pd
import pytz
idx = pd.date_range('2011-01-01', '2017-10-01 00:00:00', freq='h', tz='America/Chicago')
s = pd.Series(np.random.rand(len(idx)), index=idx)
i = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00',]).tz_localize('America/Chicago')
s.loc[i]

Out[39]: 
2017-09-30 06:00:00-05:00    0.380138
2017-09-30 23:00:00-05:00    0.774696
2017-09-30 23:00:00-05:00    0.774696
2017-10-01 00:00:00-05:00    0.728027
dtype: float64

# Added a label that does not exist in the Series
import pandas as pd
import pytz
idx = pd.date_range('2011-01-01', '2017-10-01 00:00:00', freq='h', tz='America/Chicago')
s = pd.Series(np.random.rand(len(idx)), index=idx)
i = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00', '2017-10-01 01:00:00',]).tz_localize('America/Chicago')
s.loc[i]

Out[40]: 
2017-09-30 11:00:00-05:00    0.645350
2017-10-01 04:00:00-05:00    0.099323
2017-10-01 04:00:00-05:00    0.099323
2017-10-01 05:00:00-05:00    0.037136
2017-10-01 06:00:00-05:00         NaN
dtype: float64

Output of pd.show_versions()

pd.show_versions()
2017-10-30 03:15:08 [pip.utils] [DEBUG] lzma module is not available
2017-10-30 03:15:08 [pip.vcs] [DEBUG] Registered VCS backend: git
2017-10-30 03:15:08 [pip.vcs] [DEBUG] Registered VCS backend: hg
2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'Py_DEBUG' is unset, Python ABI tag may be incorrect
2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'WITH_PYMALLOC' is unset, Python ABI tag may be incorrect
2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'Py_UNICODE_SIZE' is unset, Python ABI tag may be incorrect
2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'Py_DEBUG' is unset, Python ABI tag may be incorrect
2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'WITH_PYMALLOC' is unset, Python ABI tag may be incorrect
2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'Py_UNICODE_SIZE' is unset, Python ABI tag may be incorrect
2017-10-30 03:15:08 [pip.vcs] [DEBUG] Registered VCS backend: svn
2017-10-30 03:15:09 [pip.vcs] [DEBUG] Registered VCS backend: bzr
INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 8.1
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 36.5.0
Cython: None
numpy: 1.13.1
scipy: 0.18.1
xarray: 0.9.6
IPython: 5.5.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: 3.8.0
bs4: 4.5.1
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.4.0

@linar-jether
Copy link
Author

linar-jether commented Oct 30, 2017

[deleted - see top post]

@jorisvandenbossche
Copy link
Member

@linar-jether Thanks for the report!

First note: in the just released 0.21.0 release, using loc with some non-existant labels will raise a warning and this functionality will be removed in a future version.
If you want to do that, it points to reindex as an alternative:

In [17]: idx = pd.date_range('2011-01-01', '2017-10-01 00:00:00', freq='h', tz='America/Chicago')
    ...: s = pd.Series(np.random.rand(len(idx)), index=idx)
    ...: i = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00',]).tz_localize('America/Chicago')

In [18]: i2 = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00', '2017-10-01 01:00:00',]).tz_localize('America/Chicago')

In [19]: s.loc[i]
Out[19]: 
2017-09-30 06:00:00-05:00    0.037572
2017-09-30 23:00:00-05:00    0.771407
2017-09-30 23:00:00-05:00    0.771407
2017-10-01 00:00:00-05:00    0.778859
dtype: float64

In [20]: s.loc[i2]
/home/joris/miniconda3/envs/dev/bin/ipython:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[20]: 
2017-09-30 11:00:00-05:00    0.037572     # <------- still buggy (11:00 instead of 06:00)
2017-10-01 04:00:00-05:00    0.771407
2017-10-01 04:00:00-05:00    0.771407
2017-10-01 05:00:00-05:00    0.778859
2017-10-01 06:00:00-05:00         NaN
dtype: float64

In [22]: s.reindex(i)
Out[22]: 
2017-09-30 06:00:00-05:00    0.037572
2017-09-30 23:00:00-05:00    0.771407
2017-09-30 23:00:00-05:00    0.771407
2017-10-01 00:00:00-05:00    0.778859
dtype: float64

In [23]: s.reindex(i2)
Out[23]: 
2017-09-30 06:00:00-05:00    0.037572    # <------ now correct 06:00
2017-09-30 23:00:00-05:00    0.771407
2017-09-30 23:00:00-05:00    0.771407
2017-10-01 00:00:00-05:00    0.778859
2017-10-01 01:00:00-05:00         NaN
dtype: float64

So the bug in loc is still present, but the recommended alternative reindex already works correctly.

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype labels Oct 30, 2017
@gfyoung
Copy link
Member

gfyoung commented Oct 30, 2017

@linar-jether : BTW, if you could by any chance present a smaller example to demonstrate the bug, that would be easier for us to read in the future.

@jorisvandenbossche
Copy link
Member

@gfyoung the second post (not the top one) already includes a smaller example (it is the one I used in my response). But would be good to update the top post with that for clarity

@jorisvandenbossche
Copy link
Member

So given this is deprecated behaviour, I am not sure we should put time into trying to fix this. But, I am trying to think of similar cases where this bug could come up that are not deprecated?

@linar-jether
Copy link
Author

Thanks for the response @jorisvandenbossche,

this bug can easily be overlooked and lead to false results, so maybe even a small patch to raise an exception when using loc with both duplicates and non-existing labels?

  • I've updated the top post with the more concise example

@linar-jether
Copy link
Author

And also note that using reindex is not the same as using loc, as reindex will not work when there are duplicates in the source dataframe.

@jreback
Copy link
Contributor

jreback commented Oct 30, 2017

as @jorisvandenbossche points out this already will show a deprecation warning; this will be changed to an exception in 1.0 (after 0.22.0). The 'this can be overlooked' is the reason.

And also note that using reindex is not the same as using loc, as reindex will not work when there are duplicates in the source dataframe.

Of course if you have duplicates in an index, then you are on your own, that behavior is also not directly supported.

@mroeschke
Copy link
Member

This behavior looks fixed on master if anyone wants to put up at test.

In [1]: pd.__version__
Out[1]: '0.24.0.dev0+371.g0b7a08b70'

In [3]: s.loc[i]
Out[3]:
2017-09-30 06:00:00-05:00    0.951915
2017-09-30 23:00:00-05:00    0.969771
2017-09-30 23:00:00-05:00    0.969771
2017-10-01 00:00:00-05:00    0.461426
dtype: float64

In [4]: s.reindex(i)
Out[4]:
2017-09-30 06:00:00-05:00    0.951915
2017-09-30 23:00:00-05:00    0.969771
2017-09-30 23:00:00-05:00    0.969771
2017-10-01 00:00:00-05:00    0.461426
dtype: float64

In [5]: s.loc[i2]
/anaconda3/envs/pandas-dev/bin/ipython:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  #!/anaconda3/envs/pandas-dev/bin/python
Out[5]:
2017-09-30 06:00:00-05:00    0.951915
2017-09-30 23:00:00-05:00    0.969771
2017-09-30 23:00:00-05:00    0.969771
2017-10-01 00:00:00-05:00    0.461426
2017-10-01 01:00:00-05:00         NaN
dtype: float64

In [6]: s.reindex(i2)
Out[6]:
2017-09-30 06:00:00-05:00    0.951915
2017-09-30 23:00:00-05:00    0.969771
2017-09-30 23:00:00-05:00    0.969771
2017-10-01 00:00:00-05:00    0.461426
2017-10-01 01:00:00-05:00         NaN
dtype: float64

@mroeschke mroeschke added Testing pandas testing functions or related to the test suite good first issue labels Jul 28, 2018
@gfyoung
Copy link
Member

gfyoung commented Aug 2, 2018

In light of those two facts, I'm going to close in fact.

@gfyoung gfyoung closed this as completed Aug 2, 2018
@gfyoung gfyoung added this to the No action milestone Aug 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Testing pandas testing functions or related to the test suite Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

5 participants