Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.DataFrame.duplicated to allow take_all #6511

Closed
socheon opened this issue Feb 28, 2014 · 3 comments · Fixed by #10236
Closed

pandas.DataFrame.duplicated to allow take_all #6511

socheon opened this issue Feb 28, 2014 · 3 comments · Fixed by #10236
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Indexing Related to indexing on series/frames, not to indexes themselves Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@socheon
Copy link

socheon commented Feb 28, 2014

When working with external data, I often see rows with primary key violations. Currently, I could not easily select all the violating rows. For example, if I have a massive file with some inconsistent data

datecol,valuecol
...
2014-01-01,12
2014-01-01,13
2014-01-02,10
...

In this use case, it would be good if we can do df[df.duplicated('datecol', take_all=True)] to directly get the bad rows

2014-01-01,12
2014-01-01,13
@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

You can do it like this. That said this is not hard to implement for lib.duplicated anyhow

In [108]: df = DataFrame({ 'A' : [1,1,2,2,2,4,5,2,2]})

In [109]: df
Out[109]: 
   A
0  1
1  1
2  2
3  2
4  2
5  4
6  5
7  2
8  2

[9 rows x 1 columns]

In [110]: df[df.A.isin(df.A[df.A.duplicated()].unique())]
Out[110]: 
   A
0  1
1  1
2  2
3  2
4  2
7  2
8  2

[7 rows x 1 columns]

@jreback jreback added this to the 0.15.0 milestone Feb 28, 2014
@sinhrks
Copy link
Member

sinhrks commented Nov 22, 2014

Interested in this. To cover 3 patterns, how about changing duplicated / drop_duplicates keyword like below?

duplicated:

  • take='first' (default): Set True to duplicates except the 1st element.
  • take='last': Set True to duplicates except the last element.
  • take='none': Set True to all duplicates.

`drop_duplicates':

  • take='first' (default): Drop duplicates holding the 1st element.
  • take='last': Drop duplicates holding the last element.
  • take='none': Drop all duplicates.

@shoyer
Copy link
Member

shoyer commented Nov 23, 2014

@sinhrks take a look at #8505 (a duplicate issue) where we discussed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Indexing Related to indexing on series/frames, not to indexes themselves Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants