-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: PeriodIndex.size #14822
Comments
@sinhrks thoughts? |
I am going to release 0.19.2 in a few days, so if there is a PR, it could maybe still be included. |
I think changing the definition of |
@sinhrks do you know of any other properties or methods like this? In this case, |
Should all of the methods of More generally, is it dangerous that calling |
so for So changing the implementation would be fine (if any errors show up need to be looked at though), and potential perf comparisons... |
OK, thanks @jreback I think the problem is bigger than I imagined - a shallow copy takes 142ms, and even a basic lookup takes 1.4ms: In [1]: import pandas as pd
In [2]: index=pd.PeriodIndex(start='2000', periods=50000, freq='D')
In [3]: %timeit index._shallow_copy()
1 loop, best of 3: 162 ms per loop
In [4]: %timeit index._shallow_copy(values=index._values)
The slowest run took 5.87 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.1 µs per loop
In [6]: all(index._shallow_copy(values=index._values) == index._shallow_copy())
Out[6]: True
So almost 1000x slower than In [13]: index = pd.Int64Index(range(0,50000))
In [14]: %timeit index.get_loc(index[500])
The slowest run took 475.16 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.83 µs per loop @jreback & @sinhrks do you have any suggestions for the most efficient way to solve this? I had planned to replace FWIW this makes the latest pandas unusable in our environment - the speed has fallen by a multiple, given how much we use |
there is probably some boxing going on this should be similar speed to a DTI index the _shallow_copy is the tip off |
Boxing everywhere! For I think the core issue is that lots of places we rely on |
The weirdness deepens... I've tracked down the index = pd.PeriodIndex(start='2000', periods=50000, freq='B')
In [37]: index._int64index
Out[37]:
Int64Index([ 7827, 7828, 7829, 7830, 7831, 7832, 7833, 7834, 7835,
7836,
...
57817, 57818, 57819, 57820, 57821, 57822, 57823, 57824, 57825,
57826],
dtype='int64', length=50000)
In [35]: %timeit index._int64index.get_loc(12827)
100 loops, best of 3: 1.57 ms per loop # really slow But if I create exactly the same index directly as an In [40]: int_index = pd.Int64Index(range(7827,57827))
In [44]: int_index.equals(index._int64index)
Out[44]: True
In [41]: %timeit int_index.get_loc(12827)
The slowest run took 765.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.87 µs per loop # really fast Any ideas? |
I am not sure why this is not cached. The reason I don't think this will have any negative effects and should fix most of the speed issues.
|
@MaximilianR if you want to make this change (and do a perf comparison) and no negative effects, and you do it soon, then could include in 0.19.2. |
the size issue is still related to boxing though. |
OK I'll work on that now, + the boxing. One more q - should the |
|
we also may not have sufficient asv for period (though not sure). for these cases pls add. |
closed by #14931 |
Code Sample, a copy-pastable example if possible
Problem description
@sinhrks - now that the
PeriodIndex
call to.values
unboxes all the periods, operations likePeriodIndex.size
are much slower.What's the best way around this? Should we override more methods so that they call into
._values
rather than.values
?Output of
pd.show_versions()
In [6]: pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: 1.1.2
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.5
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: