Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise ValueError when using levels with non-unique values in MultiIndex constructor #17557

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,11 @@ def _verify_integrity(self, labels=None, levels=None):
" level (%d). NOTE: this index is in an"
" inconsistent state" % (i, label.max(),
len(level)))
for i, level in enumerate(levels):
if len(level) != len(set(level)):
raise ValueError("Level values must be unique: {0}"
" on level {1}".format([value for value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your check is not checking the values

i would be surprised that you have any failures though it is possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking if all the values inside a level are unique by comparing its length with the length of the set containing all the unique values. Should I be doing it in some other way? Maybe I misunderstood what needs to be checked.

Here are the tests that failed with the new check
========================================================================================== FAILURES ==========================================================================================
__________________________________________________________________________________ TestMultiIndex.test_is_ ___________________________________________________________________________________

self = <pandas.tests.indexes.test_multi.TestMultiIndex object at 0x7fb0e671a780>

    def test_is_(self):
        mi = MultiIndex.from_tuples(lzip(range(10), range(10)))
        assert mi.is_(mi)
        assert mi.is_(mi.view())
        assert mi.is_(mi.view().view().view().view())
        mi2 = mi.view()
        # names are metadata, they don't change id
        mi2.names = ["A", "B"]
        assert mi2.is_(mi)
        assert mi.is_(mi2)
    
        assert mi.is_(mi.set_names(["C", "D"]))
        mi2 = mi.view()
        mi2.set_names(["E", "F"], inplace=True)
        assert mi.is_(mi2)
        # levels are inherent properties, they change identity
        mi3 = mi2.set_levels([lrange(10), lrange(10)])
        assert not mi3.is_(mi2)
        # shouldn't change
        assert mi2.is_(mi)
        mi4 = mi3.view()
>       mi4.set_levels([[1 for _ in range(10)], lrange(10)], inplace=True)

pandas/tests/indexes/test_multi.py:1584: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/indexes/multi.py:254: in set_levels
    verify_integrity=verify_integrity)
pandas/core/indexes/multi.py:183: in _set_levels
    self._verify_integrity(levels=new_levels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = MultiIndex(levels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           labels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           names=['E', 'F'])
labels = FrozenList([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]), levels = FrozenList([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

    def _verify_integrity(self, labels=None, levels=None):
        """
    
            Parameters
            ----------
            labels : optional list
                Labels to check for validity. Defaults to current labels.
            levels : optional list
                Levels to check for validity. Defaults to current levels.
    
            Raises
            ------
            ValueError
                * if length of levels and labels don't match or any label would
                exceed level bounds
            """
        # NOTE: Currently does not check, among other things, that cached
        # nlevels matches nor that sortorder matches actually sortorder.
        labels = labels or self.labels
        levels = levels or self.levels
    
        if len(levels) != len(labels):
            raise ValueError("Length of levels and labels must match. NOTE:"
                             " this index is in an inconsistent state.")
        label_length = len(self.labels[0])
        for i, (level, label) in enumerate(zip(levels, labels)):
            if len(label) != label_length:
                raise ValueError("Unequal label lengths: %s" %
                                 ([len(lab) for lab in labels]))
            if len(label) and label.max() >= len(level):
                raise ValueError("On level %d, label max (%d) >= length of"
                                 " level  (%d). NOTE: this index is in an"
                                 " inconsistent state" % (i, label.max(),
                                                          len(level)))
        for i, level in enumerate(levels):
            if len(level) != len(set(level)):
                raise ValueError("Level values must be unique: {0}"
                                 " on level {1}".format([value for value
>                                                        in level], i))
E               ValueError: Level values must be unique: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] on level 0

pandas/core/indexes/multi.py:154: ValueError
____________________________________________________________________ TestMultiIndex.test_level_setting_resets_attributes _____________________________________________________________________

self = <pandas.tests.indexes.test_multi.TestMultiIndex object at 0x7fb0e69834e0>

    def test_level_setting_resets_attributes(self):
        ind = MultiIndex.from_arrays([
            ['A', 'A', 'B', 'B', 'B'], [1, 2, 1, 2, 3]
        ])
        assert ind.is_monotonic
        ind.set_levels([['A', 'B', 'A', 'A', 'B'], [2, 1, 3, -2, 5]],
>                      inplace=True)

pandas/tests/indexes/test_multi.py:2387: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/indexes/multi.py:254: in set_levels
    verify_integrity=verify_integrity)
pandas/core/indexes/multi.py:183: in _set_levels
    self._verify_integrity(levels=new_levels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = MultiIndex(levels=[['A', 'B'], [1, 2, 3]],
           labels=[[0, 0, 1, 1, 1], [0, 1, 0, 1, 2]]), labels = FrozenList([[0, 0, 1, 1, 1], [0, 1, 0, 1, 2]])
levels = FrozenList([['A', 'B', 'A', 'A', 'B'], [2, 1, 3, -2, 5]])

    def _verify_integrity(self, labels=None, levels=None):
        """
    
            Parameters
            ----------
            labels : optional list
                Labels to check for validity. Defaults to current labels.
            levels : optional list
                Levels to check for validity. Defaults to current levels.
    
            Raises
            ------
            ValueError
                * if length of levels and labels don't match or any label would
                exceed level bounds
            """
        # NOTE: Currently does not check, among other things, that cached
        # nlevels matches nor that sortorder matches actually sortorder.
        labels = labels or self.labels
        levels = levels or self.levels
    
        if len(levels) != len(labels):
            raise ValueError("Length of levels and labels must match. NOTE:"
                             " this index is in an inconsistent state.")
        label_length = len(self.labels[0])
        for i, (level, label) in enumerate(zip(levels, labels)):
            if len(label) != label_length:
                raise ValueError("Unequal label lengths: %s" %
                                 ([len(lab) for lab in labels]))
            if len(label) and label.max() >= len(level):
                raise ValueError("On level %d, label max (%d) >= length of"
                                 " level  (%d). NOTE: this index is in an"
                                 " inconsistent state" % (i, label.max(),
                                                          len(level)))
        for i, level in enumerate(levels):
            if len(level) != len(set(level)):
                raise ValueError("Level values must be unique: {0}"
                                 " on level {1}".format([value for value
>                                                        in level], i))
E               ValueError: Level values must be unique: ['A', 'B', 'A', 'A', 'B'] on level 0

pandas/core/indexes/multi.py:154: ValueError
================================================================ 2 failed, 189 passed, 2 skipped, 1 xfailed in 18.21 seconds =================================================================

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The very first example is wrong

In [17]: mi
Out[17]: 
MultiIndex(levels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           labels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           names=['E', 'F'])

In [18]: mi.levels[0]
Out[18]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64', name='E')

In [19]: mi.levels[1]
Out[19]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64', name='F')

In [20]: [set(level) for i, level in enumerate(mi.levels)]
Out[20]: [{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}]
In [22]: list(map(lambda x: x.is_unique, mi.levels))
Out[22]: [True, True]

you can prob do something like this (or rather iterate to show exactly where the error is)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example from the issue

In [31]: idx0 = range(2)
    ...: idx1 = np.repeat(range(2), 2)
    ...: 
    ...: midx = pd.MultiIndex(
    ...:     levels=[idx0, idx1],
    ...:     labels=[
    ...:         np.repeat(range(len(idx0)), len(idx1)),
    ...:         np.tile(range(len(idx1)), len(idx0))
    ...:     ],
    ...:     names=['idx0', 'idx1']
    ...: )
    ...: 

In [32]: list(map(lambda x: x.is_unique, midx.levels))
Out[32]: [True, False]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the first example throwing out an error because it's replacing those levels with these?
levels = FrozenList([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

And the is_unique method is indeed a more clear way to do it, thanks!

Also, I'm having troubles running the performance checks with asv 😞 Some weird I/O shutil error while trying to do pip wheel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I don't know if the conversation got lost due to the changes on that piece of code, so I'm pinging in case you thought I didn't reply. If you're just busy, sorry to bother you!

@TomAugspurger if you have time could you look at it? :)

Thanks, both of you!

in level], i))

def _get_levels(self):
return self._levels
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/indexes/test_multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -588,6 +588,14 @@ def test_constructor_mismatched_label_levels(self):
with tm.assert_raises_regex(ValueError, label_error):
self.index.copy().labels = [[0, 0, 0, 0], [0, 0]]

def test_constructor_non_unique_level_values(self):
# GH #17464
with tm.assert_raises_regex(ValueError, '^Level values'):
MultiIndex(levels=[[0, 1], [0, 0, 1, 1]],
labels=[[0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 0, 1, 1]],
names=[u'idx0', u'idx1'])

def assert_multiindex_copied(self, copy, original):
# Levels should be (at least, shallow copied)
tm.assert_copy(copy.levels, original.levels)
Expand Down