pandas.DataFrame.duplicated to allow take_all #6511

socheon · 2014-02-28T19:46:45Z

When working with external data, I often see rows with primary key violations. Currently, I could not easily select all the violating rows. For example, if I have a massive file with some inconsistent data

datecol,valuecol
...
2014-01-01,12
2014-01-01,13
2014-01-02,10
...

In this use case, it would be good if we can do df[df.duplicated('datecol', take_all=True)] to directly get the bad rows

2014-01-01,12
2014-01-01,13

The text was updated successfully, but these errors were encountered:

jreback · 2014-02-28T20:38:36Z

You can do it like this. That said this is not hard to implement for lib.duplicated anyhow

In [108]: df = DataFrame({ 'A' : [1,1,2,2,2,4,5,2,2]})

In [109]: df
Out[109]: 
   A
0  1
1  1
2  2
3  2
4  2
5  4
6  5
7  2
8  2

[9 rows x 1 columns]

In [110]: df[df.A.isin(df.A[df.A.duplicated()].unique())]
Out[110]: 
   A
0  1
1  1
2  2
3  2
4  2
7  2
8  2

[7 rows x 1 columns]

sinhrks · 2014-11-22T21:23:25Z

Interested in this. To cover 3 patterns, how about changing duplicated / drop_duplicates keyword like below?

duplicated:

take='first' (default): Set True to duplicates except the 1st element.
take='last': Set True to duplicates except the last element.
take='none': Set True to all duplicates.

`drop_duplicates':

take='first' (default): Drop duplicates holding the 1st element.
take='last': Drop duplicates holding the last element.
take='none': Drop all duplicates.

shoyer · 2014-11-23T23:34:31Z

@sinhrks take a look at #8505 (a duplicate issue) where we discussed this.

jreback added Algos labels Feb 28, 2014

jreback added this to the 0.15.0 milestone Feb 28, 2014

shoyer mentioned this issue Jan 7, 2015

DOC: Edited doc string of pandas/core/frame.duplicated(). Redefined take... #9203

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

sinhrks mentioned this issue May 30, 2015

ENH: duplicated and drop_duplicates now accept keep kw #10236

Merged

jreback modified the milestones: 0.17.0, Next Major Release Aug 8, 2015

sinhrks closed this as completed in #10236 Aug 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.DataFrame.duplicated to allow take_all #6511

pandas.DataFrame.duplicated to allow take_all #6511

socheon commented Feb 28, 2014

jreback commented Feb 28, 2014

sinhrks commented Nov 22, 2014

shoyer commented Nov 23, 2014

pandas.DataFrame.duplicated to allow take_all #6511

pandas.DataFrame.duplicated to allow take_all #6511

Comments

socheon commented Feb 28, 2014

jreback commented Feb 28, 2014

sinhrks commented Nov 22, 2014

shoyer commented Nov 23, 2014