Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: PeriodIndex speed up #14931

Merged
merged 1 commit into from
Dec 21, 2016
Merged

PERF: PeriodIndex speed up #14931

merged 1 commit into from
Dec 21, 2016

Conversation

max-sixty
Copy link
Contributor

@max-sixty max-sixty commented Dec 20, 2016

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

great! pls add a whatsnew note. and if you can post the asv results would be great.

@jreback jreback added Performance Memory or execution speed performance Period Period data type labels Dec 20, 2016
@jreback jreback added this to the 0.19.2 milestone Dec 20, 2016
@max-sixty
Copy link
Contributor Author

max-sixty commented Dec 20, 2016

    before     after       ratio
  [3ccb5013] [13aa1309]
+    4.48ms     6.10ms      1.36  period.period_standard_indexing.time_shape
-  659.85μs   572.45μs      0.87  period.period_standard_indexing.time_intersection
-  258.36μs   166.30μs      0.64  period.period_standard_indexing.time_series_loc
-  135.68μs    48.90μs      0.36  period.period_standard_indexing.time_get_loc
-    3.95ms    31.21μs      0.01  period.period_standard_indexing.time_shallow_copy

This is before the Shape change.

Current branch is v fast:

In [5]: index = pd.PeriodIndex(start=2000, periods=20000, freq='B')

In [7]: %timeit -n 1 -r 1 index.shape
1 loop, best of 1: 12.2 µs per loop

Re-running ASV now (doesn't seem easy to do a quick dev run...)

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

@MaximilianR that's fine. maybe needs np.random.seed to be set (to make it consitent). but no big deal.

@@ -49,3 +49,28 @@ def time_value_counts_pindex(self):
self.i.value_counts()


class period_standard_indexing(object):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would be parameterized with Index type. For another day though.

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

lgtm. ping on green.

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

======================================================================
FAIL: test_getitem_ndim2 (pandas.tseries.tests.test_period.TestPeriodIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\tseries\tests\test_period.py", line 2110, in test_getitem_ndim2
    self.assertEqual(result.shape, (len(idx), ))
AssertionError: Tuples differ: (3L, 1L) != (3,)
First tuple contains 1 additional elements.
First extra element 1:
1L
- (3L, 1L)
+ (3,)
------------------

@max-sixty
Copy link
Contributor Author

I'll look

@max-sixty
Copy link
Contributor Author

Any idea why this is here? Creating another axis on an index?

   2104     def test_getitem_ndim2(self):
   2105         idx = period_range('2007-01', periods=3, freq='M')
   2106
   2107         result = idx[:, None]
   2108         # MPL kludge, internally has incorrect shape
   2109         tm.assertIsInstance(result, PeriodIndex)
-> 2110         self.assertEqual(result.shape, (len(idx), ))

@max-sixty
Copy link
Contributor Author

Weird but different things happen with other indexes:

In [7]: pd.DatetimeIndex(start='2000', periods=10, freq='B')
Out[7]:
DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
               '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
               '2000-01-13', '2000-01-14'],
              dtype='datetime64[ns]', freq='B')

In [8]: pd.DatetimeIndex(start='2000', periods=10, freq='B')[:, None].shape
Out[8]: (10, 1)

In [9]: pd.DatetimeIndex(start='2000', periods=10, freq='B')[:, None]
Out[9]:
array([['2000-01-03T00:00:00.000000000'],
       ['2000-01-04T00:00:00.000000000'],
       ['2000-01-05T00:00:00.000000000'],
       ['2000-01-06T00:00:00.000000000'],
       ['2000-01-07T00:00:00.000000000'],
       ['2000-01-10T00:00:00.000000000'],
       ['2000-01-11T00:00:00.000000000'],
       ['2000-01-12T00:00:00.000000000'],
       ['2000-01-13T00:00:00.000000000'],
       ['2000-01-14T00:00:00.000000000']], dtype='datetime64[ns]')

In [10]: pd.DatetimeIndex(start='2000', periods=10, freq='B').shape
Out[10]: (10,)

@@ -305,7 +306,7 @@ def _simple_new(cls, values, name=None, freq=None, **kwargs):
if (len(values) > 0 and is_float_dtype(values)):
raise TypeError("PeriodIndex can't take floats")
else:
return PeriodIndex(values, name=name, freq=freq, **kwargs)
return cls(values, name=name, freq=freq, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this here, maybe causing the ndim test sto fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not what's causing the test failure (it could only be different if cls != PeriodIndex

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

this is a bit bogus, but I suspect this is in there for some matplotib compat (we pass periods to it I think for it to display some things).

@max-sixty
Copy link
Contributor Author

max-sixty commented Dec 20, 2016

It's only changing the output of .shape - nothing else. This is from master:

In [3]: pd.PeriodIndex(start='2000', periods=10, freq='B')[:, None].shape
Out[3]: (10,)

In [4]: pd.PeriodIndex(start='2000', periods=10, freq='B')[:, None].values.shape
Out[4]: (10,)

In [5]: pd.PeriodIndex(start='2000', periods=10, freq='B')[:, None]._values.shape
Out[5]: (10, 1)

I can hack change .shape to only look at the first dimension, or if nothing else fails are you OK taking that test out?

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

ok taking the test out, if all plotting things pass (iow everything else is ok).

@codecov-io
Copy link

Current coverage is 84.64% (diff: 100%)

Merging #14931 into master will decrease coverage by <.01%

@@             master     #14931   diff @@
==========================================
  Files           144        144          
  Lines         51023      51025     +2   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43190      43191     +1   
- Misses         7833       7834     +1   
  Partials          0          0          

Powered by Codecov. Last update 3ab369c...72b14aa

@jreback jreback merged commit 4c3d4d4 into pandas-dev:master Dec 21, 2016
@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

thanks!

@max-sixty max-sixty deleted the period-perf branch December 21, 2016 03:28
@jreback jreback mentioned this pull request Dec 21, 2016
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Dec 24, 2016
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jan 3, 2017
Version 0.19.2

* tag 'v0.19.2': (78 commits)
  RLS: v0.19.2
  DOC: update release notes for 0.19.2
  TST: skip gbq upload test as flakey
  DOC: clean-up v0.19.2 whatsnew
  DOC: update Pandas Cheat Sheet (GH13202)
  DOC: Pandas Cheat Sheet
  TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957)
  flake8 fix import
  Remove test - from 0.20.0 PR slipped in
  PERF: fix getitem unique_check / initialization issue
  cache and remove boxing (pandas-dev#14931)
  CLN: Resubmit of GH14700.  Fixes GH14554.  Errors other than Indexing…
  Clean up construction of Series with dictionary and datetime index
  BUG: .fillna() for datetime64 with tz is passing thru floats
  BUG: Patch read_csv NA values behaviour
  ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936)
  [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886)
  Fixed KDE Plot to drop the missing values (pandas-dev#14820)
  ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531)
  TST: correct url for test file on s3 (xref pandas-dev#14587)
  ...
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jan 3, 2017
* releases: (78 commits)
  RLS: v0.19.2
  DOC: update release notes for 0.19.2
  TST: skip gbq upload test as flakey
  DOC: clean-up v0.19.2 whatsnew
  DOC: update Pandas Cheat Sheet (GH13202)
  DOC: Pandas Cheat Sheet
  TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957)
  flake8 fix import
  Remove test - from 0.20.0 PR slipped in
  PERF: fix getitem unique_check / initialization issue
  cache and remove boxing (pandas-dev#14931)
  CLN: Resubmit of GH14700.  Fixes GH14554.  Errors other than Indexing…
  Clean up construction of Series with dictionary and datetime index
  BUG: .fillna() for datetime64 with tz is passing thru floats
  BUG: Patch read_csv NA values behaviour
  ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936)
  [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886)
  Fixed KDE Plot to drop the missing values (pandas-dev#14820)
  ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531)
  TST: correct url for test file on s3 (xref pandas-dev#14587)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants