Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise ValueError when using levels with non-unique values in MultiIndex constructor #17557

Closed
wants to merge 3 commits into from

Conversation

alanbato
Copy link
Contributor

@alanbato alanbato commented Sep 17, 2017

verify_integrity now also checks if any level has non-unique values and raises ValueError if one does.

However, some tests were broken due to this new behaviour.
I'd like to know what should I do in this case, should I add verify_integrity=False to those tests, change them, or do something else?

Also, how should I state this in the whatsnew file and where?

Thank you for your time! 🐼 🐍

for i, level in enumerate(levels):
if len(level) != len(set(level)):
raise ValueError("Level values must be unique: %s "
"on level %d" % ([value for value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use .format syntax instead of %? I realize that other places within this file use % formatting, but there's an ongoing effort to transition all % formatting in the pandas codebase to .format, so might as well minimize the number of changes that will need to be made.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! I do prefer .format but decided to stick with % because of the other tests. Thank you for telling me, I'll keep it in mind in future contributions :)

for i, level in enumerate(levels):
if len(level) != len(set(level)):
raise ValueError("Level values must be unique: {0}"
" on level {1}".format([value for value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your check is not checking the values

i would be surprised that you have any failures though it is possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking if all the values inside a level are unique by comparing its length with the length of the set containing all the unique values. Should I be doing it in some other way? Maybe I misunderstood what needs to be checked.

Here are the tests that failed with the new check
========================================================================================== FAILURES ==========================================================================================
__________________________________________________________________________________ TestMultiIndex.test_is_ ___________________________________________________________________________________

self = <pandas.tests.indexes.test_multi.TestMultiIndex object at 0x7fb0e671a780>

    def test_is_(self):
        mi = MultiIndex.from_tuples(lzip(range(10), range(10)))
        assert mi.is_(mi)
        assert mi.is_(mi.view())
        assert mi.is_(mi.view().view().view().view())
        mi2 = mi.view()
        # names are metadata, they don't change id
        mi2.names = ["A", "B"]
        assert mi2.is_(mi)
        assert mi.is_(mi2)
    
        assert mi.is_(mi.set_names(["C", "D"]))
        mi2 = mi.view()
        mi2.set_names(["E", "F"], inplace=True)
        assert mi.is_(mi2)
        # levels are inherent properties, they change identity
        mi3 = mi2.set_levels([lrange(10), lrange(10)])
        assert not mi3.is_(mi2)
        # shouldn't change
        assert mi2.is_(mi)
        mi4 = mi3.view()
>       mi4.set_levels([[1 for _ in range(10)], lrange(10)], inplace=True)

pandas/tests/indexes/test_multi.py:1584: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/indexes/multi.py:254: in set_levels
    verify_integrity=verify_integrity)
pandas/core/indexes/multi.py:183: in _set_levels
    self._verify_integrity(levels=new_levels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = MultiIndex(levels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           labels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           names=['E', 'F'])
labels = FrozenList([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]), levels = FrozenList([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

    def _verify_integrity(self, labels=None, levels=None):
        """
    
            Parameters
            ----------
            labels : optional list
                Labels to check for validity. Defaults to current labels.
            levels : optional list
                Levels to check for validity. Defaults to current levels.
    
            Raises
            ------
            ValueError
                * if length of levels and labels don't match or any label would
                exceed level bounds
            """
        # NOTE: Currently does not check, among other things, that cached
        # nlevels matches nor that sortorder matches actually sortorder.
        labels = labels or self.labels
        levels = levels or self.levels
    
        if len(levels) != len(labels):
            raise ValueError("Length of levels and labels must match. NOTE:"
                             " this index is in an inconsistent state.")
        label_length = len(self.labels[0])
        for i, (level, label) in enumerate(zip(levels, labels)):
            if len(label) != label_length:
                raise ValueError("Unequal label lengths: %s" %
                                 ([len(lab) for lab in labels]))
            if len(label) and label.max() >= len(level):
                raise ValueError("On level %d, label max (%d) >= length of"
                                 " level  (%d). NOTE: this index is in an"
                                 " inconsistent state" % (i, label.max(),
                                                          len(level)))
        for i, level in enumerate(levels):
            if len(level) != len(set(level)):
                raise ValueError("Level values must be unique: {0}"
                                 " on level {1}".format([value for value
>                                                        in level], i))
E               ValueError: Level values must be unique: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] on level 0

pandas/core/indexes/multi.py:154: ValueError
____________________________________________________________________ TestMultiIndex.test_level_setting_resets_attributes _____________________________________________________________________

self = <pandas.tests.indexes.test_multi.TestMultiIndex object at 0x7fb0e69834e0>

    def test_level_setting_resets_attributes(self):
        ind = MultiIndex.from_arrays([
            ['A', 'A', 'B', 'B', 'B'], [1, 2, 1, 2, 3]
        ])
        assert ind.is_monotonic
        ind.set_levels([['A', 'B', 'A', 'A', 'B'], [2, 1, 3, -2, 5]],
>                      inplace=True)

pandas/tests/indexes/test_multi.py:2387: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/indexes/multi.py:254: in set_levels
    verify_integrity=verify_integrity)
pandas/core/indexes/multi.py:183: in _set_levels
    self._verify_integrity(levels=new_levels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = MultiIndex(levels=[['A', 'B'], [1, 2, 3]],
           labels=[[0, 0, 1, 1, 1], [0, 1, 0, 1, 2]]), labels = FrozenList([[0, 0, 1, 1, 1], [0, 1, 0, 1, 2]])
levels = FrozenList([['A', 'B', 'A', 'A', 'B'], [2, 1, 3, -2, 5]])

    def _verify_integrity(self, labels=None, levels=None):
        """
    
            Parameters
            ----------
            labels : optional list
                Labels to check for validity. Defaults to current labels.
            levels : optional list
                Levels to check for validity. Defaults to current levels.
    
            Raises
            ------
            ValueError
                * if length of levels and labels don't match or any label would
                exceed level bounds
            """
        # NOTE: Currently does not check, among other things, that cached
        # nlevels matches nor that sortorder matches actually sortorder.
        labels = labels or self.labels
        levels = levels or self.levels
    
        if len(levels) != len(labels):
            raise ValueError("Length of levels and labels must match. NOTE:"
                             " this index is in an inconsistent state.")
        label_length = len(self.labels[0])
        for i, (level, label) in enumerate(zip(levels, labels)):
            if len(label) != label_length:
                raise ValueError("Unequal label lengths: %s" %
                                 ([len(lab) for lab in labels]))
            if len(label) and label.max() >= len(level):
                raise ValueError("On level %d, label max (%d) >= length of"
                                 " level  (%d). NOTE: this index is in an"
                                 " inconsistent state" % (i, label.max(),
                                                          len(level)))
        for i, level in enumerate(levels):
            if len(level) != len(set(level)):
                raise ValueError("Level values must be unique: {0}"
                                 " on level {1}".format([value for value
>                                                        in level], i))
E               ValueError: Level values must be unique: ['A', 'B', 'A', 'A', 'B'] on level 0

pandas/core/indexes/multi.py:154: ValueError
================================================================ 2 failed, 189 passed, 2 skipped, 1 xfailed in 18.21 seconds =================================================================

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The very first example is wrong

In [17]: mi
Out[17]: 
MultiIndex(levels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           labels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           names=['E', 'F'])

In [18]: mi.levels[0]
Out[18]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64', name='E')

In [19]: mi.levels[1]
Out[19]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64', name='F')

In [20]: [set(level) for i, level in enumerate(mi.levels)]
Out[20]: [{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}]
In [22]: list(map(lambda x: x.is_unique, mi.levels))
Out[22]: [True, True]

you can prob do something like this (or rather iterate to show exactly where the error is)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example from the issue

In [31]: idx0 = range(2)
    ...: idx1 = np.repeat(range(2), 2)
    ...: 
    ...: midx = pd.MultiIndex(
    ...:     levels=[idx0, idx1],
    ...:     labels=[
    ...:         np.repeat(range(len(idx0)), len(idx1)),
    ...:         np.tile(range(len(idx1)), len(idx0))
    ...:     ],
    ...:     names=['idx0', 'idx1']
    ...: )
    ...: 

In [32]: list(map(lambda x: x.is_unique, midx.levels))
Out[32]: [True, False]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the first example throwing out an error because it's replacing those levels with these?
levels = FrozenList([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

And the is_unique method is indeed a more clear way to do it, thanks!

Also, I'm having troubles running the performance checks with asv 😞 Some weird I/O shutil error while trying to do pip wheel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I don't know if the conversation got lost due to the changes on that piece of code, so I'm pinging in case you thought I didn't reply. If you're just busy, sorry to bother you!

@TomAugspurger if you have time could you look at it? :)

Thanks, both of you!

@jreback
Copy link
Contributor

jreback commented Sep 17, 2017

needs performance checking

whatsnew is in the other api changes section

@jreback jreback added Error Reporting Incorrect or improved errors from pandas MultiIndex labels Sep 17, 2017
@TomAugspurger
Copy link
Contributor

from #17557 (comment)

Isn't the first example throwing out an error because it's replacing those levels with these?
levels = FrozenList([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
And the is_unique method is indeed a more clear way to do it, thanks!
Also, I'm having troubles running the performance checks with asv 😞 Some weird I/O shutil > error while trying to do pip wheel

For the asv, are you using conda? That's typically easiest.

@alanbato
Copy link
Contributor Author

alanbato commented Sep 20, 2017

@TomAugspurger I ran the command again and I got it to work, but I'm seeing really weird results in the performance test, everything is either above or below 10% . Are these normal?

+      96.5±0.2ms          367±5ms     3.80  frame_methods.Reindex.time_reindex_axis1
+     40.4±0.08ms          122±5ms     3.02  frame_methods.Reindex.time_reindex_both_axes
+      40.9±0.2ms        111±0.3ms     2.72  frame_methods.Reindex.time_reindex_both_axes_ix
+      93.2±0.1ms          243±4ms     2.61  frame_methods.Dropna.time_count_level_axis0_multi
+        29.1±2ms         74.5±7ms     2.56  frame_methods.Shift.time_shift_axis0
+      29.9±0.2ms       74.8±0.3ms     2.50  frame_methods.Shift.time_shift_axis_1
+      84.3±0.5ms          190±3ms     2.25  frame_methods.Dropna.time_count_level_axis1_multi
+     96.0±0.05ms          213±2ms     2.21  frame_methods.Dropna.time_count_level_axis1_mixed_dtypes_multi
+      109±0.08ms            231ms     2.13  frame_methods.Dropna.time_count_level_axis0_mixed_dtypes_multi
+       105±0.1ms        219±0.7ms     2.10  frame_methods.Dropna.time_dropna_axis1_all
+      104±0.03ms            198ms     1.90  frame_methods.Dropna.time_dropna_axis0_all
+      33.9±0.9ms         58.5±2ms     1.73  binary_ops.TimeseriesTZ.time_timestamp_ops_diff1
+         452±2ms            731ms     1.62  frame_methods.Dropna.time_dropna_axis0_all_mixed_dtypes
+         996±7ms          1.55±0s     1.56  gil.NoGilGroupby.time_groups_2
+     11.0±0.02ms           16.7ms     1.51  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('LastWeekOfMonth', 1)
+     11.3±0.03ms      17.1±0.03ms     1.51  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('FY5253_1', 2)
+     5.22±0.01ms       7.81±0.3ms     1.50  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('CustomBusinessDay', 1)
+     47.8±0.01ms           70.6ms     1.48  frame_methods.Dropna.time_dropna_axis1_any
+     50.9±0.04ms           74.9ms     1.47  frame_methods.Dropna.time_dropna_axis0_any
+          2.97ms           4.25ms     1.43  frame_methods.Iteration.time_iteritems_cached
+        1.28±0ms      1.82±0.03ms     1.42  groupby.GroupBySuite.time_first('float', 100)
+      90.4±0.5ms          126±4ms     1.40  eval.Eval.time_chained_cmp('python', 1)
+         450±0ms            617ms     1.37  frame_methods.Dropna.time_dropna_axis1_all_mixed_dtypes
+     1.10±0.01ms      1.51±0.04ms     1.37  frame_methods.frame_boolean_row_select.time_frame_boolean_row_select
+     5.36±0.01ms       7.31±0.3ms     1.36  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Minute', 1)
+     1.34±0.01ms      1.81±0.08ms     1.34  groupby.GroupBySuite.time_last('float', 100)
+        1.27±0ms      1.70±0.06ms     1.34  groupby.GroupBySuite.time_last('int', 100)
+        2.19±0ms      2.93±0.08ms     1.34  groupby.GroupBySuite.time_head('float', 100)
+     1.27±0.01ms      1.67±0.03ms     1.32  groupby.GroupBySuite.time_first('int', 100)
+     5.48±0.01ms      7.18±0.02ms     1.31  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('DateOffset', 2)
+     2.13±0.01ms      2.78±0.07ms     1.30  groupby.GroupBySuite.time_head('int', 100)
+     6.29±0.03ms       8.18±0.2ms     1.30  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Milli', 2)
+     6.47±0.01ms      8.41±0.02ms     1.30  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('QuarterBegin', 1)
+     6.27±0.02ms      8.13±0.03ms     1.30  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Micro', 2)
+     5.85±0.01ms      7.57±0.01ms     1.29  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Easter', 1)
+        40.0±2ms         51.6±2ms     1.29  eval.Eval.time_mult('python', 1)
+     23.6±0.09ms       30.5±0.2ms     1.29  binary_ops.Timeseries.time_timestamp_ops_diff1
+     29.5±0.05ms       38.0±0.2ms     1.29  frame_methods.Apply.time_apply_lambda_mean
+         138±4ms          178±5ms     1.29  gil.NoGilGroupby.time_count_2
+     7.85±0.07ms         10.1±0ms     1.29  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('CustomBusinessHour', 1)
+     6.29±0.02ms      8.08±0.01ms     1.28  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Second', 2)
+         425±1ms        546±0.8ms     1.28  frame_methods.Apply.time_apply_axis_1
+     6.51±0.03ms      8.32±0.02ms     1.28  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('BYearBegin', 2)
+     7.29±0.02ms      9.30±0.02ms     1.28  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('BusinessHour', 1)
+           682ms            866ms     1.27  gil.nogil_read_csv.time_read_csv
+          29.3μs           37.2μs     1.27  timeseries.DatetimeIndex.time_timestamp_tzinfo_cons
+     6.20±0.01ms      7.86±0.01ms     1.27  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('CustomBusinessDay', 2)
+     6.63±0.01ms         8.39±0ms     1.27  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('BMonthEnd', 2)
+           674ms            853ms     1.27  frame_methods.Iteration.time_iteritems_indexing
+     4.59±0.02ms       5.80±0.2ms     1.26  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('DateOffset', 1)
+         175±2ms          221±2ms     1.26  groupby.GroupBySuite.time_mad('int', 100)
+           17.2s            21.7s     1.26  groupby.GroupBySuite.time_mad('int', 10000)
+       132±0.2ms        164±0.1ms     1.25  frame_ctor.FromDicts.time_frame_ctor_list_of_dict
+          70.4ms           87.8ms     1.25  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('FY5253Quarter_1', 2)
+       393±0.4ms            489ms     1.25  frame_methods.Dropna.time_dropna_axis0_any_mixed_dtypes
+      108±0.02ms          134±7ms     1.24  frame_methods.frame_getitem_single_column.time_frame_getitem_single_column
+     43.2±0.05ms         53.6±0ms     1.24  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('FY5253_2', 2)
+      394±0.06ms            489ms     1.24  frame_methods.Dropna.time_dropna_axis1_any_mixed_dtypes
+     6.19±0.04ms       7.68±0.2ms     1.24  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Second', 1)
+      80.7±0.4μs        100±0.1μs     1.24  timeseries.Offsets.time_custom_bday_apply_dt64
+      131±0.04ms            163ms     1.24  frame_methods.Iteration.time_iteritems
+     5.89±0.06ms      7.30±0.07ms     1.24  binary_ops.Timeseries.time_series_timestamp_compare
+          42.8ms           52.7ms     1.23  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('FY5253_2', 1)
+           27.5s            33.9s     1.23  groupby.GroupBySuite.time_mad('float', 10000)
+      2.73±0.01s       3.36±0.03s     1.23  gil.NoGilGroupby.time_groups_4
+     4.99±0.06ms       6.13±0.2ms     1.23  groupby.GroupBySuite.time_last('int', 10000)
+         274±2ms         336±30ms     1.23  groupby.GroupBySuite.time_mad('float', 100)
+      42.5±0.2μs       52.0±0.2μs     1.22  timeseries.Offsets.time_timeseries_year_apply
+      61.1±0.4ms         74.4±1ms     1.22  groupby.GroupBySuite.time_diff('int', 100)
+     6.25±0.05ms      7.62±0.01ms     1.22  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Micro', 1)
+      44.3±0.9ms         53.8±1ms     1.22  groupby.FirstLast.time_groupby_nth_none('datetime')
+        1.25±0ms      1.51±0.04ms     1.22  groupby.GroupBySuite.time_cummin('int', 100)
+       120±0.3μs        146±0.2μs     1.21  timeseries.Offsets.time_timeseries_day_incr
+         214±2ms        260±0.3ms     1.21  timedelta.ToTimedelta.time_convert_coerce
+     28.2±0.06ms         34.1±2ms     1.21  frame_methods.Apply.time_apply_np_mean
+     6.12±0.01ms       7.38±0.2ms     1.21  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('CDay', 1)
+        1.24±0ms      1.50±0.03ms     1.20  groupby.GroupBySuite.time_cummax('int', 100)
+     5.97±0.02ms      7.18±0.03ms     1.20  frame_ctor.FromDicts.time_series_ctor_from_dict
+          75.7μs           90.8μs     1.20  index_object.Float64.time_construct
+           11.1s            13.2s     1.20  panel_methods.PanelMethods.time_pct_change_items
+          87.5μs            105μs     1.20  inference.to_numeric.time_from_float
+        5.14±0ms       6.14±0.4ms     1.20  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('BDay', 1)
+       114±0.8ms        136±0.6ms     1.19  frame_ctor.FromDicts.time_frame_ctor_nested_dict
+           5.31s            6.31s     1.19  gil.NoGilGroupby.time_groups_8
+           8.96s            10.6s     1.19  groupby.GroupBySuite.time_diff('float', 10000)
+         582±2μs          687±1μs     1.18  period.period_standard_indexing.time_intersection
+     6.93±0.06ms      8.18±0.09ms     1.18  groupby.GroupBySuite.time_head('int', 10000)
+     5.29±0.05ms       6.22±0.1ms     1.18  groupby.GroupBySuite.time_first('int', 10000)
+     1.85±0.01ms      2.18±0.04ms     1.18  groupby.GroupBySuite.time_cumprod('float', 100)
+      22.6±0.1μs       26.5±0.1μs     1.17  indexing.DataFrameIndexing.time_get_value_ix
+           1.20s            1.40s     1.17  packers.Excel.time_write_excel_xlsxwriter
+     59.5±0.06ms       69.8±0.8ms     1.17  categoricals.Categoricals2.time_value_counts_dropna
+         893±1μs         1.05±0ms     1.17  indexing.MultiIndexing.time_frame_xs_mi_ix
+           8.12s            9.52s     1.17  panel_methods.PanelMethods.time_pct_change_major
+       233±0.4μs        273±0.3μs     1.17  timeseries.SemiMonthOffset.time_begin_decr_n
+           47.9s            56.0s     1.17  groupby.GroupBySuite.time_describe('int', 10000)
+       208±0.2μs        242±0.3μs     1.16  timeseries.SemiMonthOffset.time_begin_incr
+     3.95±0.01ms       4.60±0.3ms     1.16  categoricals.Categoricals.time_constructor_datetimes_with_nat
+     1.29±0.01ms      1.50±0.05ms     1.16  groupby.GroupBySuite.time_cumsum('int', 100)
+     2.89±0.01ms       3.35±0.3ms     1.16  frame_methods.frame_from_records_generator.time_frame_from_records_generator_nrows
+     9.17±0.03ms       10.6±0.1ms     1.16  groupby.GroupBySuite.time_head('float', 10000)
+       146±0.3ms          168±3ms     1.16  groupby.GroupBySuite.time_pct_change('int', 100)
+        1.21±0ms      1.40±0.01ms     1.16  groupby.GroupBySuite.time_var('float', 100)
+           8.06s            9.31s     1.15  panel_methods.PanelMethods.time_pct_change_minor
+     5.60±0.04ms       6.46±0.1ms     1.15  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('MonthBegin', 2)
+      15.7±0.1ms       18.1±0.4ms     1.15  algorithms.Algorithms.time_add_overflow_pos_scalar
+      31.0±0.1ms       35.8±0.6ms     1.15  eval.Eval.time_add('numexpr', 1)
+        1.39±0ms      1.60±0.05ms     1.15  groupby.GroupBySuite.time_cumsum('float', 100)
+           5.65s            6.50s     1.15  groupby.GroupBySuite.time_diff('int', 10000)
+           3.26s            3.76s     1.15  timeseries.SeriesArithmetic.time_add_offset_slow
+        97.2±2ms          112±3ms     1.15  eval.Eval.time_and('python', 1)
+           154ms            177ms     1.15  frame_methods.Iteration.time_itertuples
+     4.99±0.03ms       5.73±0.2ms     1.15  groupby.GroupBySuite.time_cummax('int', 10000)
+     6.56±0.01ms       7.53±0.1ms     1.15  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('SemiMonthBegin', 1)
+          9.84ms           11.3ms     1.15  index_object.Float64.time_boolean_series_indexer
+        1.31±0ms      1.50±0.04ms     1.14  groupby.GroupBySuite.time_cummax('float', 100)
+     7.73±0.06ms       8.84±0.1ms     1.14  groupby.GroupBySuite.time_last('float', 10000)
+     1.76±0.01ms      2.01±0.01ms     1.14  groupby.GroupBySuite.time_cumprod('int', 100)
+      92.4±0.3ms          106±3ms     1.14  groupby.GroupBySuite.time_diff('float', 100)
+          6.68ms           7.62ms     1.14  inference.DtypeInfer.time_int64
+     1.29±0.01ms      1.47±0.02ms     1.14  groupby.GroupBySuite.time_cumcount('float', 100)
+     6.04±0.04ms       6.86±0.7ms     1.14  binary_ops.TimeseriesTZ.time_series_timestamp_compare
+     5.90±0.03ms       6.70±0.3ms     1.14  binary_ops.Timeseries.time_timestamp_series_compare
+         392±1ms          446±9ms     1.14  frame_methods.frame_insert_100_columns_begin.time_frame_insert_500_columns_end
+     5.02±0.05ms      5.70±0.05ms     1.14  groupby.GroupBySuite.time_cummin('int', 10000)
+     5.80±0.03ms       6.58±0.1ms     1.14  groupby.GroupBySuite.time_cumprod('int', 10000)
+         770±6ms          874±8ms     1.13  groupby.GroupBySuite.time_describe('float', 100)
+           22.8s            25.9s     1.13  gil.nogil_datetime_fields.time_datetime_field_day
+     7.74±0.06ms       8.78±0.1ms     1.13  groupby.GroupBySuite.time_first('float', 10000)
+       108±0.5ms        122±0.3ms     1.13  frame_methods.FrameIsnull.time_isnull_obj
+         819±2μs          926±2μs     1.13  indexing.MultiIndexing.time_series_xs_mi_ix
+         479±2μs         541±20μs     1.13  frame_methods.frame_get_dtype_counts.time_frame_get_dtype_counts
+       101±0.3ms        114±0.2ms     1.13  frame_methods.FrameIsnull.time_isnull_strngs
+     9.86±0.06ms       11.1±0.8ms     1.13  binary_ops.Ops.time_frame_mult(False, 'default')
+         114±1ms          128±1ms     1.13  parser_vb.read_csv_categorical.time_convert_direct
+     6.28±0.02ms       7.07±0.2ms     1.13  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('BQuarterEnd', 2)
+      37.7±0.1ms         42.3±1ms     1.12  groupby.GroupBySuite.time_any('int', 100)
+     3.82±0.05μs      4.29±0.02μs     1.12  period.Properties.time_minute
+         264±2ms          296±3ms     1.12  packers.packers_read_sas.time_read_sas7bdat
+     30.8±0.08ms         34.5±2ms     1.12  eval.Eval.time_mult('numexpr', 1)
+        2.64±0ms      2.96±0.09ms     1.12  categoricals.Categoricals2.time_rendering
+      7.40±0.1ms       8.29±0.1ms     1.12  groupby.GroupBySuite.time_cummin('float', 10000)
+     3.09±0.01ms      3.46±0.01ms     1.12  groupby.groupby_sum_multiindex.time_groupby_sum_multiindex
+       120±0.2ms        135±0.4ms     1.12  frame_ctor.FromDicts.time_frame_ctor_nested_dict_int64
+         485±2ms          543±4ms     1.12  groupby.GroupBySuite.time_describe('int', 100)
+           47.3s            52.9s     1.12  binary_ops.TimeseriesTZ.time_timestamp_ops_diff2
+         603±1μs          674±2μs     1.12  indexing.DataFrameIndexing.time_iloc_dups
+        1.02±0ms      1.14±0.04ms     1.12  categoricals.Categoricals.time_constructor_fastpath
+     3.83±0.05μs      4.27±0.07μs     1.12  period.Properties.time_hour
+     10.8±0.04ms      12.0±0.07ms     1.11  groupby.groupby_nth.time_groupby_frame_nth_any
+     21.6±0.05μs      24.1±0.06μs     1.11  indexing.MultiIndexing.time_multiindex_med_get_loc
+       350±0.8ms          390±2ms     1.11  inference.to_numeric_downcast.time_downcast('string-float', 'float')
+     4.03±0.03μs      4.47±0.05μs     1.11  period.Properties.time_year
+     2.40±0.03ms      2.66±0.01ms     1.11  frame_methods.FrameIsnull.time_isnull
+      24.1±0.1ms       26.7±0.2ms     1.11  reindex.LibFastZip.time_lib_fast_zip
+       103±0.6ms        114±0.5ms     1.10  strings.StringMethods.time_join_split_expand
+     6.29±0.01ms      6.94±0.02ms     1.10  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Day', 2)
+     6.78±0.03ms      7.47±0.06ms     1.10  frame_methods.Equals.time_frame_float_equal
+     4.97±0.02ms      5.48±0.05ms     1.10  groupby.GroupBySuite.time_cumsum('int', 10000)
-          26.8μs           24.2μs     0.90  index_object.Float64.time_slice_indexer_even
-          7.60ms           6.81ms     0.90  inference.DtypeInfer.time_float64
-        2.41±0ms      2.16±0.01ms     0.90  groupby.GroupBySuite.time_sem('float', 100)
-     9.28±0.02ms      8.27±0.05ms     0.89  groupby.groupby_datetimetz.time_groupby_sum
-         612±1ms          546±1ms     0.89  inference.to_numeric_downcast.time_downcast('string-nint', 'unsigned')
-        1.07±0ms          957±3μs     0.89  reindex.Duplicates.time_series_drop_dups_int
-      1.48±0.01s          1.32±0s     0.89  replace.replace_convert.time_replace_frame_timestamp
-        1.37±0ms         1.22±0ms     0.89  period.Algorithms.time_value_counts_pindex
-         486±4ms          432±2ms     0.89  groupby.Transform.time_transform_func
-      42.8±0.2μs       38.0±0.2μs     0.89  timestamp.TimestampProperties.time_is_year_end
-          40.9ms           36.1ms     0.88  gil.nogil_read_csv.time_read_csv_object
-      18.5±0.2ms       16.2±0.2ms     0.88  groupby.groupby_nth.time_groupby_series_nth_any
-      82.1±0.3μs       71.7±0.3μs     0.87  indexing.Int64Indexing.time_getitem_scalar
-      94.5±0.2ms       82.4±0.2ms     0.87  groupby.GroupBySuite.time_skew('float', 100)
-           752ms          649±1ms     0.86  inference.to_numeric_downcast.time_downcast('string-nint', 'signed')
-         380±3ms          328±1ms     0.86  inference.to_numeric_downcast.time_downcast('string-float', 'signed')
-          44.6ms           38.3ms     0.86  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('FY5253Quarter_1', 1)
-      79.4±0.1ms       68.1±0.6ms     0.86  groupby.GroupBySuite.time_unique('float', 100)
-      1.54±0.01s       1.30±0.01s     0.85  replace.replace_convert.time_replace_frame_timedelta
-         959±6ms          808±5ms     0.84  replace.replace_convert.time_replace_series_timedelta
-          26.9μs           22.5μs     0.84  index_object.Float64.time_slice_indexer_basic
-         273±1μs        226±0.3μs     0.83  period.period_standard_indexing.time_series_loc
-       106±0.7μs       87.5±0.5μs     0.82  timeseries.Offsets.time_custom_bday_incr
-      17.2±0.3ms       13.7±0.3ms     0.80  binary_ops.Ops2.time_frame_int_mod
-      25.8±0.4ms       17.5±0.7ms     0.68  binary_ops.Ops.time_frame_add(True, 'default')
-      17.4±0.3ms       9.88±0.5ms     0.57  binary_ops.Ops2.time_frame_float_mod

@TomAugspurger
Copy link
Contributor

The -f 1.1 limits it to only benchmarks that changed by more that 10%. The results can be noisy, depending on how much load your machine is under while running the benchmarks. Typically anything larger than 1.3 - 1.5 is significant.

I'll take a look later.

@alanbato
Copy link
Contributor Author

Alright, ping me when you've gone through it. Thanks @TomAugspurger :)

@TomAugspurger
Copy link
Contributor

Perf looked fine, though the CI failures look relevant.

@alanbato
Copy link
Contributor Author

Yes, I think I'll need to change those tests because they they have this faulty behavior. Any advice on that?

@jreback
Copy link
Contributor

jreback commented Oct 29, 2017

this is being closed by #17971. thanks for the effort here. many other issues if you'd like to take a look!

@jreback jreback closed this Oct 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: invalid MultiIndex construction
4 participants