-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise ValueError when using levels with non-unique values in MultiIndex constructor #17557
Conversation
pandas/core/indexes/multi.py
Outdated
for i, level in enumerate(levels): | ||
if len(level) != len(set(level)): | ||
raise ValueError("Level values must be unique: %s " | ||
"on level %d" % ([value for value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use .format
syntax instead of %
? I realize that other places within this file use %
formatting, but there's an ongoing effort to transition all %
formatting in the pandas codebase to .format
, so might as well minimize the number of changes that will need to be made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course! I do prefer .format
but decided to stick with %
because of the other tests. Thank you for telling me, I'll keep it in mind in future contributions :)
pandas/core/indexes/multi.py
Outdated
for i, level in enumerate(levels): | ||
if len(level) != len(set(level)): | ||
raise ValueError("Level values must be unique: {0}" | ||
" on level {1}".format([value for value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your check is not checking the values
i would be surprised that you have any failures though it is possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm checking if all the values inside a level are unique by comparing its length with the length of the set containing all the unique values. Should I be doing it in some other way? Maybe I misunderstood what needs to be checked.
Here are the tests that failed with the new check
========================================================================================== FAILURES ==========================================================================================
__________________________________________________________________________________ TestMultiIndex.test_is_ ___________________________________________________________________________________
self = <pandas.tests.indexes.test_multi.TestMultiIndex object at 0x7fb0e671a780>
def test_is_(self):
mi = MultiIndex.from_tuples(lzip(range(10), range(10)))
assert mi.is_(mi)
assert mi.is_(mi.view())
assert mi.is_(mi.view().view().view().view())
mi2 = mi.view()
# names are metadata, they don't change id
mi2.names = ["A", "B"]
assert mi2.is_(mi)
assert mi.is_(mi2)
assert mi.is_(mi.set_names(["C", "D"]))
mi2 = mi.view()
mi2.set_names(["E", "F"], inplace=True)
assert mi.is_(mi2)
# levels are inherent properties, they change identity
mi3 = mi2.set_levels([lrange(10), lrange(10)])
assert not mi3.is_(mi2)
# shouldn't change
assert mi2.is_(mi)
mi4 = mi3.view()
> mi4.set_levels([[1 for _ in range(10)], lrange(10)], inplace=True)
pandas/tests/indexes/test_multi.py:1584:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/indexes/multi.py:254: in set_levels
verify_integrity=verify_integrity)
pandas/core/indexes/multi.py:183: in _set_levels
self._verify_integrity(levels=new_levels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = MultiIndex(levels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
labels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
names=['E', 'F'])
labels = FrozenList([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]), levels = FrozenList([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
def _verify_integrity(self, labels=None, levels=None):
"""
Parameters
----------
labels : optional list
Labels to check for validity. Defaults to current labels.
levels : optional list
Levels to check for validity. Defaults to current levels.
Raises
------
ValueError
* if length of levels and labels don't match or any label would
exceed level bounds
"""
# NOTE: Currently does not check, among other things, that cached
# nlevels matches nor that sortorder matches actually sortorder.
labels = labels or self.labels
levels = levels or self.levels
if len(levels) != len(labels):
raise ValueError("Length of levels and labels must match. NOTE:"
" this index is in an inconsistent state.")
label_length = len(self.labels[0])
for i, (level, label) in enumerate(zip(levels, labels)):
if len(label) != label_length:
raise ValueError("Unequal label lengths: %s" %
([len(lab) for lab in labels]))
if len(label) and label.max() >= len(level):
raise ValueError("On level %d, label max (%d) >= length of"
" level (%d). NOTE: this index is in an"
" inconsistent state" % (i, label.max(),
len(level)))
for i, level in enumerate(levels):
if len(level) != len(set(level)):
raise ValueError("Level values must be unique: {0}"
" on level {1}".format([value for value
> in level], i))
E ValueError: Level values must be unique: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] on level 0
pandas/core/indexes/multi.py:154: ValueError
____________________________________________________________________ TestMultiIndex.test_level_setting_resets_attributes _____________________________________________________________________
self = <pandas.tests.indexes.test_multi.TestMultiIndex object at 0x7fb0e69834e0>
def test_level_setting_resets_attributes(self):
ind = MultiIndex.from_arrays([
['A', 'A', 'B', 'B', 'B'], [1, 2, 1, 2, 3]
])
assert ind.is_monotonic
ind.set_levels([['A', 'B', 'A', 'A', 'B'], [2, 1, 3, -2, 5]],
> inplace=True)
pandas/tests/indexes/test_multi.py:2387:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/indexes/multi.py:254: in set_levels
verify_integrity=verify_integrity)
pandas/core/indexes/multi.py:183: in _set_levels
self._verify_integrity(levels=new_levels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = MultiIndex(levels=[['A', 'B'], [1, 2, 3]],
labels=[[0, 0, 1, 1, 1], [0, 1, 0, 1, 2]]), labels = FrozenList([[0, 0, 1, 1, 1], [0, 1, 0, 1, 2]])
levels = FrozenList([['A', 'B', 'A', 'A', 'B'], [2, 1, 3, -2, 5]])
def _verify_integrity(self, labels=None, levels=None):
"""
Parameters
----------
labels : optional list
Labels to check for validity. Defaults to current labels.
levels : optional list
Levels to check for validity. Defaults to current levels.
Raises
------
ValueError
* if length of levels and labels don't match or any label would
exceed level bounds
"""
# NOTE: Currently does not check, among other things, that cached
# nlevels matches nor that sortorder matches actually sortorder.
labels = labels or self.labels
levels = levels or self.levels
if len(levels) != len(labels):
raise ValueError("Length of levels and labels must match. NOTE:"
" this index is in an inconsistent state.")
label_length = len(self.labels[0])
for i, (level, label) in enumerate(zip(levels, labels)):
if len(label) != label_length:
raise ValueError("Unequal label lengths: %s" %
([len(lab) for lab in labels]))
if len(label) and label.max() >= len(level):
raise ValueError("On level %d, label max (%d) >= length of"
" level (%d). NOTE: this index is in an"
" inconsistent state" % (i, label.max(),
len(level)))
for i, level in enumerate(levels):
if len(level) != len(set(level)):
raise ValueError("Level values must be unique: {0}"
" on level {1}".format([value for value
> in level], i))
E ValueError: Level values must be unique: ['A', 'B', 'A', 'A', 'B'] on level 0
pandas/core/indexes/multi.py:154: ValueError
================================================================ 2 failed, 189 passed, 2 skipped, 1 xfailed in 18.21 seconds =================================================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The very first example is wrong
In [17]: mi
Out[17]:
MultiIndex(levels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
labels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
names=['E', 'F'])
In [18]: mi.levels[0]
Out[18]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64', name='E')
In [19]: mi.levels[1]
Out[19]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64', name='F')
In [20]: [set(level) for i, level in enumerate(mi.levels)]
Out[20]: [{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}]
In [22]: list(map(lambda x: x.is_unique, mi.levels))
Out[22]: [True, True]
you can prob do something like this (or rather iterate to show exactly where the error is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the example from the issue
In [31]: idx0 = range(2)
...: idx1 = np.repeat(range(2), 2)
...:
...: midx = pd.MultiIndex(
...: levels=[idx0, idx1],
...: labels=[
...: np.repeat(range(len(idx0)), len(idx1)),
...: np.tile(range(len(idx1)), len(idx0))
...: ],
...: names=['idx0', 'idx1']
...: )
...:
In [32]: list(map(lambda x: x.is_unique, midx.levels))
Out[32]: [True, False]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the first example throwing out an error because it's replacing those levels with these?
levels = FrozenList([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
And the is_unique
method is indeed a more clear way to do it, thanks!
Also, I'm having troubles running the performance checks with asv 😞 Some weird I/O shutil error while trying to do pip wheel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I don't know if the conversation got lost due to the changes on that piece of code, so I'm pinging in case you thought I didn't reply. If you're just busy, sorry to bother you!
@TomAugspurger if you have time could you look at it? :)
Thanks, both of you!
needs performance checking whatsnew is in the other api changes section |
from #17557 (comment)
For the asv, are you using conda? That's typically easiest. |
@TomAugspurger I ran the command again and I got it to work, but I'm seeing really weird results in the performance test, everything is either above or below 10% . Are these normal?
|
The I'll take a look later. |
Alright, ping me when you've gone through it. Thanks @TomAugspurger :) |
Perf looked fine, though the CI failures look relevant. |
Yes, I think I'll need to change those tests because they they have this faulty behavior. Any advice on that? |
this is being closed by #17971. thanks for the effort here. many other issues if you'd like to take a look! |
git diff upstream/master -u -- "*.py" | flake8 --diff
verify_integrity
now also checks if any level has non-unique values and raisesValueError
if one does.However, some tests were broken due to this new behaviour.
I'd like to know what should I do in this case, should I add
verify_integrity=False
to those tests, change them, or do something else?Also, how should I state this in the whatsnew file and where?
Thank you for your time! 🐼 🐍