Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: PeriodIndex.size #14822

Closed
max-sixty opened this issue Dec 8, 2016 · 17 comments
Closed

PERF: PeriodIndex.size #14822

max-sixty opened this issue Dec 8, 2016 · 17 comments
Labels
Performance Memory or execution speed performance Period Period data type Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@max-sixty
Copy link
Contributor

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: index = pd.PeriodIndex(start='1952', periods=50000, freq='B')

In [3]: %timeit index.size
10 loops, best of 3: 102 ms per loop

In [4]: %timeit index._values.size
The slowest run took 24.79 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 233 ns per loop

Problem description

@sinhrks - now that the PeriodIndex call to .values unboxes all the periods, operations like PeriodIndex.size are much slower.

What's the best way around this? Should we override more methods so that they call into ._values rather than .values?

Output of pd.show_versions()

In [6]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: 1.1.2
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.5
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

@chris-b1 chris-b1 added Performance Memory or execution speed performance Period Period data type Regression Functionality that used to work in a prior pandas version labels Dec 8, 2016
@max-sixty
Copy link
Contributor Author

@sinhrks thoughts?

@jorisvandenbossche
Copy link
Member

I am going to release 0.19.2 in a few days, so if there is a PR, it could maybe still be included.

@sinhrks
Copy link
Member

sinhrks commented Dec 15, 2016

I think changing the definition of IndexOpsMixin.size to use _values should work (including Series with extension types)

@max-sixty
Copy link
Contributor Author

@sinhrks do you know of any other properties or methods like this?

In this case, size could call __len__() and we can abstract over that

@max-sixty
Copy link
Contributor Author

Should all of the methods of OpsIndexMixin: https://github.com/pandas-dev/pandas/blob/master/pandas/core/base.py#L817 use _values?

More generally, is it dangerous that calling .values on an object is a slow operation? Are we going to be chasing down performance issues for a while?

@jreback
Copy link
Contributor

jreback commented Dec 17, 2016

so _values are 'internal', while .values' is external. for lots of indexes (e.g. Int64) and non-pandas-dtypes Series these are the same. But for extensions these are different by design, e.g. a ._values should return for example a DatetimeIndex (for timezeone-aware series). So yes, using ._values will not involved coercions, while .values potentially could.

for PeriodIndex this previously didn't matter, but will going forward.

So changing the implementation would be fine (if any errors show up need to be looked at though), and potential perf comparisons...

@max-sixty
Copy link
Contributor Author

OK, thanks @jreback

I think the problem is bigger than I imagined - a shallow copy takes 142ms, and even a basic lookup takes 1.4ms:

In [1]: import pandas as pd

In [2]: index=pd.PeriodIndex(start='2000', periods=50000, freq='D')

In [3]: %timeit index._shallow_copy()
1 loop, best of 3: 162 ms per loop

In [4]: %timeit index._shallow_copy(values=index._values)
The slowest run took 5.87 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.1 µs per loop

In [6]: all(index._shallow_copy(values=index._values) == index._shallow_copy())
Out[6]: True

In [9]: %timeit index.get_loc(index[500])
The slowest run took 26.78 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 1.46 ms per loop

So almost 1000x slower than Int64Index:

In [13]: index = pd.Int64Index(range(0,50000))

In [14]: %timeit index.get_loc(index[500])
The slowest run took 475.16 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.83 µs per loop

@jreback & @sinhrks do you have any suggestions for the most efficient way to solve this? I had planned to replace .values with ._values in a few places, but the required changes are probably much greater than that...

FWIW this makes the latest pandas unusable in our environment - the speed has fallen by a multiple, given how much we use PeriodIndex. Keen to be part of the solution though...

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

there is probably some boxing going on

this should be similar speed to a DTI index

the _shallow_copy is the tip off

@max-sixty
Copy link
Contributor Author

there is probably some boxing going on
this should be similar speed to a DTI index
the _shallow_copy is the tip off

Boxing everywhere! For _shallow_copy it's here: https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/base.py#L374 (no cover in PeriodIndex)

I think the core issue is that lots of places we rely on .values being a fast array. Should these all be ._values?

@max-sixty
Copy link
Contributor Author

The weirdness deepens... I've tracked down the get_loc slowdown to getting the ordinal from the Int64Index:

index = pd.PeriodIndex(start='2000', periods=50000, freq='B')

In [37]: index._int64index
Out[37]:
Int64Index([ 7827,  7828,  7829,  7830,  7831,  7832,  7833,  7834,  7835,
             7836,
            ...
            57817, 57818, 57819, 57820, 57821, 57822, 57823, 57824, 57825,
            57826],
           dtype='int64', length=50000)

In [35]: %timeit index._int64index.get_loc(12827)
100 loops, best of 3: 1.57 ms per loop  # really slow

But if I create exactly the same index directly as an Int64Index, the problem goes away:

In [40]: int_index = pd.Int64Index(range(7827,57827))

In [44]: int_index.equals(index._int64index)
Out[44]: True

In [41]: %timeit int_index.get_loc(12827)
The slowest run took 765.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.87 µs per loop  # really fast

Any ideas?

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

I am not sure why this is not cached. The reason .asi8 doen't need caching it is a cheap view (as its on the values), but here we need an actual index. And there is some overhead to creating an index itself (even if in this case its a pass thru).

I don't think this will have any negative effects and should fix most of the speed issues.

In [1]: index = pd.PeriodIndex(start='2000', periods=50000, freq='B')

In [2]: %timeit index._int64index.get_loc(12827)
The slowest run took 1381.93 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.3 µs per loop
diff --git a/pandas/tseries/period.py b/pandas/tseries/period.py
index 4bab3bc..593ac3d 100644
--- a/pandas/tseries/period.py
+++ b/pandas/tseries/period.py
@@ -356,9 +356,8 @@ class PeriodIndex(DatelikeOps, DatetimeIndexOpsMixin, Int64Index):
     def asi8(self):
         return self._values.view('i8')
 
-    @property
+    @cache_readonly
     def _int64index(self):
-        # do not cache, same as .asi8
         return Int64Index(self.asi8, name=self.name, fastpath=True)
 
     @property

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

@MaximilianR if you want to make this change (and do a perf comparison) and no negative effects, and you do it soon, then could include in 0.19.2.

@jreback jreback added this to the Next Major Release milestone Dec 20, 2016
@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

the size issue is still related to boxing though.

@max-sixty
Copy link
Contributor Author

max-sixty commented Dec 20, 2016

OK I'll work on that now, + the boxing.

One more q - should the .shape-like properties be calling into ._values or ._data? I'm not sure of the distinction

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

._values is fine (you can try to do this in pandas.core.base and it should be ok, but if you encounter issues then should do it in sub-classes (but try in the base class first)

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

we also may not have sufficient asv for period (though not sure). for these cases pls add.

@jreback jreback modified the milestones: 0.19.2, Next Major Release Dec 20, 2016
@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

closed by #14931

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Period Period data type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

5 participants