PERF: PeriodIndex.size #14822

max-sixty · 2016-12-08T04:03:54Z

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: index = pd.PeriodIndex(start='1952', periods=50000, freq='B')

In [3]: %timeit index.size
10 loops, best of 3: 102 ms per loop

In [4]: %timeit index._values.size
The slowest run took 24.79 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 233 ns per loop

Problem description

@sinhrks - now that the PeriodIndex call to .values unboxes all the periods, operations like PeriodIndex.size are much slower.

What's the best way around this? Should we override more methods so that they call into ._values rather than .values?

Output of `pd.show_versions()`

In [6]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: 1.1.2
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.5
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

max-sixty · 2016-12-15T14:39:32Z

@sinhrks thoughts?

jorisvandenbossche · 2016-12-15T14:50:01Z

I am going to release 0.19.2 in a few days, so if there is a PR, it could maybe still be included.

sinhrks · 2016-12-15T22:32:22Z

I think changing the definition of IndexOpsMixin.size to use _values should work (including Series with extension types)

max-sixty · 2016-12-15T23:07:46Z

@sinhrks do you know of any other properties or methods like this?

In this case, size could call __len__() and we can abstract over that

max-sixty · 2016-12-17T03:24:14Z

Should all of the methods of OpsIndexMixin: https://github.com/pandas-dev/pandas/blob/master/pandas/core/base.py#L817 use _values?

More generally, is it dangerous that calling .values on an object is a slow operation? Are we going to be chasing down performance issues for a while?

jreback · 2016-12-17T14:46:59Z

so _values are 'internal', while .values' is external. for lots of indexes (e.g. Int64) and non-pandas-dtypes Series these are the same. But for extensions these are different by design, e.g. a ._values should return for example a DatetimeIndex (for timezeone-aware series). So yes, using ._values will not involved coercions, while .values potentially could.

for PeriodIndex this previously didn't matter, but will going forward.

So changing the implementation would be fine (if any errors show up need to be looked at though), and potential perf comparisons...

max-sixty · 2016-12-20T00:20:49Z

OK, thanks @jreback

I think the problem is bigger than I imagined - a shallow copy takes 142ms, and even a basic lookup takes 1.4ms:

In [1]: import pandas as pd

In [2]: index=pd.PeriodIndex(start='2000', periods=50000, freq='D')

In [3]: %timeit index._shallow_copy()
1 loop, best of 3: 162 ms per loop

In [4]: %timeit index._shallow_copy(values=index._values)
The slowest run took 5.87 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.1 µs per loop

In [6]: all(index._shallow_copy(values=index._values) == index._shallow_copy())
Out[6]: True


In [9]: %timeit index.get_loc(index[500])
The slowest run took 26.78 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 1.46 ms per loop

So almost 1000x slower than Int64Index:

In [13]: index = pd.Int64Index(range(0,50000))

In [14]: %timeit index.get_loc(index[500])
The slowest run took 475.16 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.83 µs per loop

@jreback & @sinhrks do you have any suggestions for the most efficient way to solve this? I had planned to replace .values with ._values in a few places, but the required changes are probably much greater than that...

FWIW this makes the latest pandas unusable in our environment - the speed has fallen by a multiple, given how much we use PeriodIndex. Keen to be part of the solution though...

jreback · 2016-12-20T00:45:07Z

there is probably some boxing going on

this should be similar speed to a DTI index

the _shallow_copy is the tip off

max-sixty · 2016-12-20T03:40:58Z

there is probably some boxing going on
this should be similar speed to a DTI index
the _shallow_copy is the tip off

Boxing everywhere! For _shallow_copy it's here: https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/base.py#L374 (no cover in PeriodIndex)

I think the core issue is that lots of places we rely on .values being a fast array. Should these all be ._values?

max-sixty · 2016-12-20T15:28:05Z

The weirdness deepens... I've tracked down the get_loc slowdown to getting the ordinal from the Int64Index:

index = pd.PeriodIndex(start='2000', periods=50000, freq='B')

In [37]: index._int64index
Out[37]:
Int64Index([ 7827,  7828,  7829,  7830,  7831,  7832,  7833,  7834,  7835,
             7836,
            ...
            57817, 57818, 57819, 57820, 57821, 57822, 57823, 57824, 57825,
            57826],
           dtype='int64', length=50000)

In [35]: %timeit index._int64index.get_loc(12827)
100 loops, best of 3: 1.57 ms per loop  # really slow

But if I create exactly the same index directly as an Int64Index, the problem goes away:

In [40]: int_index = pd.Int64Index(range(7827,57827))

In [44]: int_index.equals(index._int64index)
Out[44]: True

In [41]: %timeit int_index.get_loc(12827)
The slowest run took 765.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.87 µs per loop  # really fast

Any ideas?

jreback · 2016-12-20T15:51:08Z

I am not sure why this is not cached. The reason .asi8 doen't need caching it is a cheap view (as its on the values), but here we need an actual index. And there is some overhead to creating an index itself (even if in this case its a pass thru).

I don't think this will have any negative effects and should fix most of the speed issues.

In [1]: index = pd.PeriodIndex(start='2000', periods=50000, freq='B')

In [2]: %timeit index._int64index.get_loc(12827)
The slowest run took 1381.93 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.3 µs per loop

diff --git a/pandas/tseries/period.py b/pandas/tseries/period.py
index 4bab3bc..593ac3d 100644
--- a/pandas/tseries/period.py
+++ b/pandas/tseries/period.py
@@ -356,9 +356,8 @@ class PeriodIndex(DatelikeOps, DatetimeIndexOpsMixin, Int64Index):
     def asi8(self):
         return self._values.view('i8')
 
-    @property
+    @cache_readonly
     def _int64index(self):
-        # do not cache, same as .asi8
         return Int64Index(self.asi8, name=self.name, fastpath=True)
 
     @property

jreback · 2016-12-20T15:52:24Z

@MaximilianR if you want to make this change (and do a perf comparison) and no negative effects, and you do it soon, then could include in 0.19.2.

jreback · 2016-12-20T15:53:15Z

the size issue is still related to boxing though.

max-sixty · 2016-12-20T15:56:49Z

OK I'll work on that now, + the boxing.

One more q - should the .shape-like properties be calling into ._values or ._data? I'm not sure of the distinction

jreback · 2016-12-20T16:01:49Z

._values is fine (you can try to do this in pandas.core.base and it should be ok, but if you encounter issues then should do it in sub-classes (but try in the base class first)

jreback · 2016-12-20T16:02:17Z

we also may not have sufficient asv for period (though not sure). for these cases pls add.

jreback · 2016-12-21T11:11:36Z

closed by #14931

max-sixty mentioned this issue Dec 8, 2016

PERF: Use len rather than size pydata/xarray#1157

Merged

chris-b1 added Performance Memory or execution speed performance Period Period data type Regression Functionality that used to work in a prior pandas version labels Dec 8, 2016

jreback added this to the Next Major Release milestone Dec 20, 2016

jreback modified the milestones: 0.19.2, Next Major Release Dec 20, 2016

max-sixty mentioned this issue Dec 20, 2016

PERF: PeriodIndex speed up #14931

Merged

4 tasks

jreback closed this as completed Dec 21, 2016

max-sixty mentioned this issue Feb 8, 2017

PeriodIndex causes severe slow down pydata/xarray#1255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: PeriodIndex.size #14822

PERF: PeriodIndex.size #14822

max-sixty commented Dec 8, 2016

INSTALLED VERSIONS

max-sixty commented Dec 15, 2016

jorisvandenbossche commented Dec 15, 2016

sinhrks commented Dec 15, 2016

max-sixty commented Dec 15, 2016

max-sixty commented Dec 17, 2016

jreback commented Dec 17, 2016

max-sixty commented Dec 20, 2016

jreback commented Dec 20, 2016

max-sixty commented Dec 20, 2016

max-sixty commented Dec 20, 2016

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

max-sixty commented Dec 20, 2016 •

edited

Loading

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

jreback commented Dec 21, 2016

PERF: PeriodIndex.size #14822

PERF: PeriodIndex.size #14822

Comments

max-sixty commented Dec 8, 2016

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

max-sixty commented Dec 15, 2016

jorisvandenbossche commented Dec 15, 2016

sinhrks commented Dec 15, 2016

max-sixty commented Dec 15, 2016

max-sixty commented Dec 17, 2016

jreback commented Dec 17, 2016

max-sixty commented Dec 20, 2016

jreback commented Dec 20, 2016

max-sixty commented Dec 20, 2016

max-sixty commented Dec 20, 2016

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

max-sixty commented Dec 20, 2016 • edited Loading

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

jreback commented Dec 21, 2016

Output of `pd.show_versions()`

max-sixty commented Dec 20, 2016 •

edited

Loading