Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix no raise dup index when using drop with axis=0 #19230

Merged
merged 14 commits into from
Jan 18, 2018

Conversation

aschade
Copy link
Contributor

@aschade aschade commented Jan 13, 2018

@aschade aschade closed this Jan 13, 2018
@aschade aschade deleted the fix-no-raise-dup-cols branch January 13, 2018 20:04
@aschade aschade restored the fix-no-raise-dup-cols branch January 13, 2018 20:04
@aschade aschade reopened this Jan 13, 2018
@@ -453,6 +453,7 @@ Reshaping
- Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`)
- Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`)
- Bug in :func:`DataFrame.stack`, :func:`DataFrame.unstack`, :func:`Series.unstack` which were not returning subclasses (:issue:`15563`)
- Bug in :func:`DataFrame.drop`, `ValueError` now raises when dropping an `Index` that has duplicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • `ValueError` --> ``ValueError`` (double backticks)
  • `Index` --> ``Index`` (double backticks)
  • Add the issue number: (:issue:`19186`)

@@ -2909,6 +2909,9 @@ def _drop_axis(self, labels, axis, level=None, errors='raise'):
else:
indexer = ~axis.isin(labels)

if indexer.all() and errors == 'raise':
Copy link
Member

@jschendel jschendel Jan 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I might have been a little too quick to say switch to .all(). It looks like builtin all will be faster for sizable arrays that have a False early on. Leave it as is for now though, unless someone else says to switch it, since .all() should have a lower bound for the worst case scenario.

Switching the errors == 'raise' check to be first should be more efficient, since that's a cheaper check, and will cause the and to terminate before checking indexer in cases where this isn't applicable:

In [2]: a = np.repeat([True], 10**5)

In [3]: errors = 'other'

In [4]: %timeit errors == 'raise' and a.all()
58.7 ns ± 10 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [5]: %timeit a.all() and errors == 'raise'
6.82 µs ± 856 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll leave it as is for now and make the other change

Copy link
Contributor Author

@aschade aschade Jan 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're right about .all() when a value is False early on:

In[9]: arr = np.repeat([True], 1000000)
In[10]: arr[15] = False
In[11]: %timeit all(arr)
1.93 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In[12]: %timeit arr.all()
244 µs ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And then when every value is True:

In[13]: arr = np.repeat([True], 1000000)
In[14]: %timeit all(arr)
33.8 ms ± 5.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[15]: %timeit arr.all()
131 µs ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But like you said worst case is much better with .all()

@aschade aschade force-pushed the fix-no-raise-dup-cols branch from ebd2a72 to b49f78d Compare January 13, 2018 21:03
@aschade aschade force-pushed the fix-no-raise-dup-cols branch from c365e0c to 43fe5b0 Compare January 13, 2018 21:12
@aschade aschade changed the title Fix no raise dup cols Fix no raise dup index when using drop with axis=0 Jan 13, 2018
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Error Reporting Incorrect or improved errors from pandas labels Jan 13, 2018
@@ -257,3 +257,21 @@ def test_insert_column_bug_4032(self):
expected = DataFrame([[1.3, 1, 1.1], [2.3, 2, 2.2]],
columns=['c', 'a', 'b'])
assert_frame_equal(result, expected)

data = [[1, 2, 3], [1, 2, 3]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to test_axis_select_reindex where the other drop tests are

@@ -2909,6 +2909,9 @@ def _drop_axis(self, labels, axis, level=None, errors='raise'):
else:
indexer = ~axis.isin(labels)

if errors == 'raise' and indexer.all():
raise ValueError('{} not found in axis'.format(labels))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a KeyError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should line 3770 in pandas.core.indexes.base also be KeyError then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u show a user facing example

Copy link
Contributor Author

@aschade aschade Jan 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In[3]: pd.DataFrame(index=['a', 'b']).drop('c', axis=1)
ValueError: labels ['c'] not contained in axis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if u change that,
what breaks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It breaks 15 tests that are expecting that scenario to be a ValueError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u show some more on this eg which tests

Copy link
Contributor Author

@aschade aschade Jan 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

============================================================= FAILURES =============================================================
_______________________________________________________ TestPanel.test_drop ________________________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/test_panel.py:2305: in test_drop
    pytest.raises(ValueError, panel.drop, 'Three')
pandas/core/generic.py:2863: in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
pandas/core/generic.py:2895: in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: "labels ['Three'] not contained in axis"
____________________________________________ TestDataFrameSelectReindex.test_drop_names ____________________________________________
[gw1] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/frame/test_axis_select_reindex.py:44: in test_drop_names
    pytest.raises(ValueError, df.drop, ['g'])
pandas/core/generic.py:2863: in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
pandas/core/generic.py:2895: in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: "labels ['g'] not contained in axis"
_______________________________________________ TestDataFrameSelectReindex.test_drop _______________________________________________
[gw0] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/frame/test_axis_select_reindex.py:90: in test_drop
    pytest.raises(ValueError, simple.drop, 5)
pandas/core/generic.py:2863: in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
pandas/core/generic.py:2895: in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: 'labels [5] not contained in axis'
_______________________________________________________ TestIndex.test_drop ________________________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1399: in test_drop
    pytest.raises(ValueError, self.strIndex.drop, ['foo', 'bar'])
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: "labels ['foo' 'bar'] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop0-values0] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
    pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: "labels ['a'] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop0-values1] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
    pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: "labels ['a'] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop0-values2] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
    pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: "labels ['a'] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop1-values0] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
    pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: "labels [('c', 'd')] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop1-values1] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
    pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: "labels [('c', 'd')] not contained in axis"
___________________________________________ TestIndex.test_drop_tuple[to_drop1-values2] ____________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/indexes/test_base.py:1451: in test_drop_tuple
    pytest.raises(ValueError, removed.drop, drop_me)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: "labels [('c', 'd')] not contained in axis"
_______________________________________________ TestPivotTable.test_pivot_no_values ________________________________________________
[gw3] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/reshape/test_pivot.py:214: in test_pivot_no_values
    res = df.pivot_table(index=df.index.month, columns=df.index.day)
pandas/core/frame.py:4513: in pivot_table
    margins_name=margins_name)
pandas/core/reshape/pivot.py:77: in pivot_table
    values = values.drop(key)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: 'labels [1 2 1 1 1] not contained in axis'
____________________________________________________ TestPivotTable.test_daily _____________________________________________________
[gw3] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/reshape/test_pivot.py:916: in test_daily
    columns=ts.index.dayofyear)
pandas/core/reshape/pivot.py:77: in pivot_table
    values = values.drop(key)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: 'labels [2000 2000 2000 ..., 2004 2004 2004] not contained in axis'
___________________________________________________ TestPivotTable.test_monthly ____________________________________________________
[gw3] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/reshape/test_pivot.py:934: in test_monthly
    columns=ts.index.month)
pandas/core/reshape/pivot.py:77: in pivot_table
    values = values.drop(key)
pandas/core/indexes/base.py:3771: in drop
    labels[mask])
E   KeyError: 'labels [2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2001 2001 2001\n 2001 2001 2001 2001 2001 2001 2001 2001 2001 2002 2002 2002 2002 2002 2002\n 2002 2002 2002 2002 2002 2002 2003 2003 2003 2003 2003 2003 2003 2003 2003\n 2003 2003 2003 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004] not contained in axis'
___________________________________________________ TestSeriesIndexing.test_drop ___________________________________________________
[gw2] darwin -- Python 3.5.4 /Users/Alex/miniconda3/envs/pandas/bin/python
pandas/tests/series/test_indexing.py:1841: in test_drop
    pytest.raises(ValueError, s.drop, 'bc')
pandas/core/generic.py:2863: in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
pandas/core/generic.py:2895: in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
pandas/core/indexes/base.py:3771: in drop
    'labels %s not contained in axis' % labels[mask])
E   KeyError: "labels ['bc'] not contained in axis"

Wouldn't be a ton of effort to just update the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let me have a closer look
we could prob just change this to be a KeyError which is more consistent with other indexing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@@ -453,6 +453,7 @@ Reshaping
- Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`)
- Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`)
- Bug in :func:`DataFrame.stack`, :func:`DataFrame.unstack`, :func:`Series.unstack` which were not returning subclasses (:issue:`15563`)
- Bug in :func:`DataFrame.drop`, ``ValueError`` now raises when dropping an ``Index`` that has duplicates (:issue:`19186`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reword (e.g. the Index now corectly raises KeyError if it has duplicates)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to Bug Fixes, indexing section


data = [[1, 2, 3], [1, 2, 3]]

@pytest.mark.parametrize('actual', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to test for axis=1 case as well (see the original issue)

@codecov
Copy link

codecov bot commented Jan 13, 2018

Codecov Report

Merging #19230 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19230      +/-   ##
==========================================
+ Coverage   91.56%   91.56%   +<.01%     
==========================================
  Files         148      148              
  Lines       48856    48858       +2     
==========================================
+ Hits        44733    44735       +2     
  Misses       4123     4123
Flag Coverage Δ
#multiple 89.93% <100%> (ø) ⬆️
#single 41.69% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 95.91% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.46% <100%> (ø) ⬆️
pandas/core/reshape/pivot.py 96.35% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c271d4d...5776bf3. Read the comment docs.

@aschade
Copy link
Contributor Author

aschade commented Jan 14, 2018

@jreback I updated with proposed changes

@@ -408,6 +408,8 @@ Indexing
- Bug in :func:`MultiIndex.set_labels` which would cause casting (and potentially clipping) of the new labels if the ``level`` argument is not 0 or a list like [0, 1, ... ] (:issue:`19057`)
- Bug in ``str.extractall`` when there were no matches empty :class:`Index` was returned instead of appropriate :class:`MultiIndex` (:issue:`19034`)
- Bug in :class:`IntervalIndex` where set operations that returned an empty ``IntervalIndex`` had the wrong dtype (:issue:`19101`)
- Bug in :func:`DataFrame.drop`, ``KeyError`` now raises when dropping an ``Index`` or column that has duplicates (:issue:`19186`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add another entry in API breaking saysing that .drop() for Series,Index,DataFrame,Panel .drop now raises KeyError rather than ValueError if labels are missing (use this PR number)

@@ -3767,8 +3767,8 @@ def drop(self, labels, errors='raise'):
mask = indexer == -1
if mask.any():
if errors != 'ignore':
raise ValueError('labels %s not contained in axis' %
labels[mask])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update the doc-strings for all (Index,Series,DataFrame,Panel) for drop to change ValueError -> KeyError in the Raises section (or add it if its not there)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. @TomAugspurger if you'd have a look.

@@ -272,6 +272,7 @@ Other API Changes
- :class:`IntervalIndex` and ``IntervalDtype`` no longer support categorical, object, and string subtypes (:issue:`19016`)
- The default ``Timedelta`` constructor now accepts an ``ISO 8601 Duration`` string as an argument (:issue:`19040`)
- ``IntervalDtype`` now returns ``True`` when compared against ``'interval'`` regardless of subtype, and ``IntervalDtype.name`` now returns ``'interval'`` regardless of subtype (:issue:`18980`)
- ``KeyError`` now raises instead of ``ValueError`` when using :meth:`drop()` to remove a non-existent element in an axis of ``Series``, ``Index``, ``DataFrame`` and ``Panel`` (:issue:`19186`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you re-word, start with :meth:`drop()` (side issue does this render?) you may hve to say :meth:`DataFrame.drop` (and prob list for all of them Index, Series, Panel).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that'll render / link. I usually do

:meth:`~DataFrame.drop`

When I want to refer to a generic .drop method, which renders as .drop(), but links to DataFrame.

@@ -415,6 +416,8 @@ Indexing
- Bug in :func:`MultiIndex.set_labels` which would cause casting (and potentially clipping) of the new labels if the ``level`` argument is not 0 or a list like [0, 1, ... ] (:issue:`19057`)
- Bug in ``str.extractall`` when there were no matches empty :class:`Index` was returned instead of appropriate :class:`MultiIndex` (:issue:`19034`)
- Bug in :class:`IntervalIndex` where set operations that returned an empty ``IntervalIndex`` had the wrong dtype (:issue:`19101`)
- Bug in :func:`Index.drop()`, where no ``Exception`` is raised when dropping a non-existent element from an axis in ``Index`` (:issue:`19186`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say that this will now raise a KeyError

@jreback jreback added this to the 0.23.0 milestone Jan 16, 2018
@@ -272,6 +272,7 @@ Other API Changes
- :class:`IntervalIndex` and ``IntervalDtype`` no longer support categorical, object, and string subtypes (:issue:`19016`)
- The default ``Timedelta`` constructor now accepts an ``ISO 8601 Duration`` string as an argument (:issue:`19040`)
- ``IntervalDtype`` now returns ``True`` when compared against ``'interval'`` regardless of subtype, and ``IntervalDtype.name`` now returns ``'interval'`` regardless of subtype (:issue:`18980`)
- ``KeyError`` now raises instead of ``ValueError`` when using :meth:`drop()` to remove a non-existent element in an axis of ``Series``, ``Index``, ``DataFrame`` and ``Panel`` (:issue:`19186`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that'll render / link. I usually do

:meth:`~DataFrame.drop`

When I want to refer to a generic .drop method, which renders as .drop(), but links to DataFrame.

Raises
------
KeyError
* If labels are not found in the selected axis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think you need the * here. And maybe clarify that it raises if none of the labels are found.

Raises
------
KeyError
* If labels are not found in the selected axis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing

# issue 19186
level = 0 if isinstance(actual.index, MultiIndex) else None
with pytest.raises(KeyError):
actual.drop('c', level=level, axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this into multiple with statements, one per exception.

@aschade aschade force-pushed the fix-no-raise-dup-cols branch from bc369f3 to e31b4d2 Compare January 16, 2018 02:15
@aschade aschade force-pushed the fix-no-raise-dup-cols branch from e31b4d2 to 5776bf3 Compare January 16, 2018 02:18
@jreback jreback merged commit d7a2e94 into pandas-dev:master Jan 18, 2018
@jreback
Copy link
Contributor

jreback commented Jan 18, 2018

thanks @aschade nice patch! keep em coming!

@aschade
Copy link
Contributor Author

aschade commented Jan 18, 2018 via email

rainwoodman added a commit to rainwoodman/nbodykit that referenced this pull request May 19, 2018
fix for the stricter error introduced by

pandas-dev/pandas#19230
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: drop on axis with duplicate values doesn't raise when label is absent
4 participants