-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix no raise dup index when using drop with axis=0 #19230
Conversation
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -453,6 +453,7 @@ Reshaping | |||
- Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`) | |||
- Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`) | |||
- Bug in :func:`DataFrame.stack`, :func:`DataFrame.unstack`, :func:`Series.unstack` which were not returning subclasses (:issue:`15563`) | |||
- Bug in :func:`DataFrame.drop`, `ValueError` now raises when dropping an `Index` that has duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`ValueError`
-->``ValueError``
(double backticks)`Index`
-->``Index``
(double backticks)- Add the issue number:
(:issue:`19186`)
pandas/core/generic.py
Outdated
@@ -2909,6 +2909,9 @@ def _drop_axis(self, labels, axis, level=None, errors='raise'): | |||
else: | |||
indexer = ~axis.isin(labels) | |||
|
|||
if indexer.all() and errors == 'raise': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I might have been a little too quick to say switch to .all()
. It looks like builtin all
will be faster for sizable arrays that have a False
early on. Leave it as is for now though, unless someone else says to switch it, since .all()
should have a lower bound for the worst case scenario.
Switching the errors == 'raise'
check to be first should be more efficient, since that's a cheaper check, and will cause the and
to terminate before checking indexer
in cases where this isn't applicable:
In [2]: a = np.repeat([True], 10**5)
In [3]: errors = 'other'
In [4]: %timeit errors == 'raise' and a.all()
58.7 ns ± 10 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [5]: %timeit a.all() and errors == 'raise'
6.82 µs ± 856 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll leave it as is for now and make the other change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you're right about .all()
when a value is False
early on:
In[9]: arr = np.repeat([True], 1000000)
In[10]: arr[15] = False
In[11]: %timeit all(arr)
1.93 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In[12]: %timeit arr.all()
244 µs ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And then when every value is True
:
In[13]: arr = np.repeat([True], 1000000)
In[14]: %timeit all(arr)
33.8 ms ± 5.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[15]: %timeit arr.all()
131 µs ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But like you said worst case is much better with .all()
ebd2a72
to
b49f78d
Compare
c365e0c
to
43fe5b0
Compare
@@ -257,3 +257,21 @@ def test_insert_column_bug_4032(self): | |||
expected = DataFrame([[1.3, 1, 1.1], [2.3, 2, 2.2]], | |||
columns=['c', 'a', 'b']) | |||
assert_frame_equal(result, expected) | |||
|
|||
data = [[1, 2, 3], [1, 2, 3]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move to test_axis_select_reindex where the other drop tests are
pandas/core/generic.py
Outdated
@@ -2909,6 +2909,9 @@ def _drop_axis(self, labels, axis, level=None, errors='raise'): | |||
else: | |||
indexer = ~axis.isin(labels) | |||
|
|||
if errors == 'raise' and indexer.all(): | |||
raise ValueError('{} not found in axis'.format(labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a KeyError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should line 3770 in pandas.core.indexes.base also be KeyError
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u show a user facing example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In[3]: pd.DataFrame(index=['a', 'b']).drop('c', axis=1)
ValueError: labels ['c'] not contained in axis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if u change that,
what breaks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It breaks 15 tests that are expecting that scenario to be a ValueError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u show some more on this eg which tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
============================================================= FAILURES =============================================================
_______________________________________________________ TestPanel.test_drop ________________________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/test_panel.py:2305: in test_drop
pytest.raises(ValueError, panel.drop, 'Three')
pandas/core/generic.py:2863: in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
pandas/core/generic.py:2895: in _drop_axis
new_axis = axis.drop(labels, errors=errors)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: "labels ['Three'] not contained in axis"
____________________________________________ TestDataFrameSelectReindex.test_drop_names ____________________________________________
[gw1] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/frame/test_axis_select_reindex.py:44: in test_drop_names
pytest.raises(ValueError, df.drop, ['g'])
pandas/core/generic.py:2863: in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
pandas/core/generic.py:2895: in _drop_axis
new_axis = axis.drop(labels, errors=errors)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: "labels ['g'] not contained in axis"
_______________________________________________ TestDataFrameSelectReindex.test_drop _______________________________________________
[gw0] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/frame/test_axis_select_reindex.py:90: in test_drop
pytest.raises(ValueError, simple.drop, 5)
pandas/core/generic.py:2863: in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
pandas/core/generic.py:2895: in _drop_axis
new_axis = axis.drop(labels, errors=errors)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: 'labels [5] not contained in axis'
_______________________________________________________ TestIndex.test_drop ________________________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1399: in test_drop
pytest.raises(ValueError, self.strIndex.drop, ['foo', 'bar'])
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: "labels ['foo' 'bar'] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop0-values0] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: "labels ['a'] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop0-values1] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: "labels ['a'] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop0-values2] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: "labels ['a'] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop1-values0] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: "labels [('c', 'd')] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop1-values1] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: "labels [('c', 'd')] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop1-values2] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: "labels [('c', 'd')] not contained in axis"
_______________________________________________ TestPivotTable.test_pivot_no_values ________________________________________________
[gw3] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/reshape/test_pivot.py:214: in test_pivot_no_values
res = df.pivot_table(index=df.index.month, columns=df.index.day)
pandas/core/frame.py:4513: in pivot_table
margins_name=margins_name)
pandas/core/reshape/pivot.py:77: in pivot_table
values = values.drop(key)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: 'labels [1 2 1 1 1] not contained in axis'
____________________________________________________ TestPivotTable.test_daily _____________________________________________________
[gw3] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/reshape/test_pivot.py:916: in test_daily
columns=ts.index.dayofyear)
pandas/core/reshape/pivot.py:77: in pivot_table
values = values.drop(key)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: 'labels [2000 2000 2000 ..., 2004 2004 2004] not contained in axis'
___________________________________________________ TestPivotTable.test_monthly ____________________________________________________
[gw3] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/reshape/test_pivot.py:934: in test_monthly
columns=ts.index.month)
pandas/core/reshape/pivot.py:77: in pivot_table
values = values.drop(key)
pandas/core/indexes/base.py:3771: in drop
labels[mask])
E KeyError: 'labels [2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2001 2001 2001\n 2001 2001 2001 2001 2001 2001 2001 2001 2001 2002 2002 2002 2002 2002 2002\n 2002 2002 2002 2002 2002 2002 2003 2003 2003 2003 2003 2003 2003 2003 2003\n 2003 2003 2003 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004] not contained in axis'
___________________________________________________ TestSeriesIndexing.test_drop ___________________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/series/test_indexing.py:1841: in test_drop
pytest.raises(ValueError, s.drop, 'bc')
pandas/core/generic.py:2863: in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
pandas/core/generic.py:2895: in _drop_axis
new_axis = axis.drop(labels, errors=errors)
pandas/core/indexes/base.py:3771: in drop
'labels %s not contained in axis' % labels[mask])
E KeyError: "labels ['bc'] not contained in axis"
Wouldn't be a ton of effort to just update the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah let me have a closer look
we could prob just change this to be a KeyError which is more consistent with other indexing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -453,6 +453,7 @@ Reshaping | |||
- Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`) | |||
- Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`) | |||
- Bug in :func:`DataFrame.stack`, :func:`DataFrame.unstack`, :func:`Series.unstack` which were not returning subclasses (:issue:`15563`) | |||
- Bug in :func:`DataFrame.drop`, ``ValueError`` now raises when dropping an ``Index`` that has duplicates (:issue:`19186`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reword (e.g. the Index now corectly raises KeyError
if it has duplicates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move to Bug Fixes, indexing section
|
||
data = [[1, 2, 3], [1, 2, 3]] | ||
|
||
@pytest.mark.parametrize('actual', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to test for axis=1 case as well (see the original issue)
Codecov Report
@@ Coverage Diff @@
## master #19230 +/- ##
==========================================
+ Coverage 91.56% 91.56% +<.01%
==========================================
Files 148 148
Lines 48856 48858 +2
==========================================
+ Hits 44733 44735 +2
Misses 4123 4123
Continue to review full report at Codecov.
|
@jreback I updated with proposed changes |
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -408,6 +408,8 @@ Indexing | |||
- Bug in :func:`MultiIndex.set_labels` which would cause casting (and potentially clipping) of the new labels if the ``level`` argument is not 0 or a list like [0, 1, ... ] (:issue:`19057`) | |||
- Bug in ``str.extractall`` when there were no matches empty :class:`Index` was returned instead of appropriate :class:`MultiIndex` (:issue:`19034`) | |||
- Bug in :class:`IntervalIndex` where set operations that returned an empty ``IntervalIndex`` had the wrong dtype (:issue:`19101`) | |||
- Bug in :func:`DataFrame.drop`, ``KeyError`` now raises when dropping an ``Index`` or column that has duplicates (:issue:`19186`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add another entry in API breaking saysing that .drop()
for Series,Index,DataFrame,Panel .drop
now raises KeyError
rather than ValueError
if labels are missing (use this PR number)
@@ -3767,8 +3767,8 @@ def drop(self, labels, errors='raise'): | |||
mask = indexer == -1 | |||
if mask.any(): | |||
if errors != 'ignore': | |||
raise ValueError('labels %s not contained in axis' % | |||
labels[mask]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to update the doc-strings for all (Index,Series,DataFrame,Panel) for drop to change ValueError -> KeyError in the Raises section (or add it if its not there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments. @TomAugspurger if you'd have a look.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -272,6 +272,7 @@ Other API Changes | |||
- :class:`IntervalIndex` and ``IntervalDtype`` no longer support categorical, object, and string subtypes (:issue:`19016`) | |||
- The default ``Timedelta`` constructor now accepts an ``ISO 8601 Duration`` string as an argument (:issue:`19040`) | |||
- ``IntervalDtype`` now returns ``True`` when compared against ``'interval'`` regardless of subtype, and ``IntervalDtype.name`` now returns ``'interval'`` regardless of subtype (:issue:`18980`) | |||
- ``KeyError`` now raises instead of ``ValueError`` when using :meth:`drop()` to remove a non-existent element in an axis of ``Series``, ``Index``, ``DataFrame`` and ``Panel`` (:issue:`19186`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you re-word, start with :meth:`drop()`
(side issue does this render?) you may hve to say :meth:`DataFrame.drop`
(and prob list for all of them Index, Series, Panel).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think that'll render / link. I usually do
:meth:`~DataFrame.drop`
When I want to refer to a generic .drop
method, which renders as .drop()
, but links to DataFrame.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -415,6 +416,8 @@ Indexing | |||
- Bug in :func:`MultiIndex.set_labels` which would cause casting (and potentially clipping) of the new labels if the ``level`` argument is not 0 or a list like [0, 1, ... ] (:issue:`19057`) | |||
- Bug in ``str.extractall`` when there were no matches empty :class:`Index` was returned instead of appropriate :class:`MultiIndex` (:issue:`19034`) | |||
- Bug in :class:`IntervalIndex` where set operations that returned an empty ``IntervalIndex`` had the wrong dtype (:issue:`19101`) | |||
- Bug in :func:`Index.drop()`, where no ``Exception`` is raised when dropping a non-existent element from an axis in ``Index`` (:issue:`19186`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say that this will now raise a KeyError
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -272,6 +272,7 @@ Other API Changes | |||
- :class:`IntervalIndex` and ``IntervalDtype`` no longer support categorical, object, and string subtypes (:issue:`19016`) | |||
- The default ``Timedelta`` constructor now accepts an ``ISO 8601 Duration`` string as an argument (:issue:`19040`) | |||
- ``IntervalDtype`` now returns ``True`` when compared against ``'interval'`` regardless of subtype, and ``IntervalDtype.name`` now returns ``'interval'`` regardless of subtype (:issue:`18980`) | |||
- ``KeyError`` now raises instead of ``ValueError`` when using :meth:`drop()` to remove a non-existent element in an axis of ``Series``, ``Index``, ``DataFrame`` and ``Panel`` (:issue:`19186`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think that'll render / link. I usually do
:meth:`~DataFrame.drop`
When I want to refer to a generic .drop
method, which renders as .drop()
, but links to DataFrame.
pandas/core/generic.py
Outdated
Raises | ||
------ | ||
KeyError | ||
* If labels are not found in the selected axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think you need the *
here. And maybe clarify that it raises if none of the labels are found.
pandas/core/indexes/base.py
Outdated
Raises | ||
------ | ||
KeyError | ||
* If labels are not found in the selected axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing
# issue 19186 | ||
level = 0 if isinstance(actual.index, MultiIndex) else None | ||
with pytest.raises(KeyError): | ||
actual.drop('c', level=level, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split this into multiple with
statements, one per exception.
bc369f3
to
e31b4d2
Compare
e31b4d2
to
5776bf3
Compare
thanks @aschade nice patch! keep em coming! |
Will do :)
…On Wed, Jan 17, 2018 at 7:53 PM Jeff Reback ***@***.***> wrote:
thanks @aschade <https://github.com/aschade> nice patch! keep em coming!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19230 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADMMOIe17borX0hdSiipVQyoTu0oW-eCks5tLpYMgaJpZM4RdWOd>
.
|
fix for the stricter error introduced by pandas-dev/pandas#19230
git diff upstream/master -u -- "*.py" | flake8 --diff