Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas (0.18) Rank: unexpected behavior for method = 'dense' and pct = True #15630

Closed
FXLab91 opened this issue Mar 9, 2017 · 8 comments · Fixed by #15639
Closed

Pandas (0.18) Rank: unexpected behavior for method = 'dense' and pct = True #15630

FXLab91 opened this issue Mar 9, 2017 · 8 comments · Fixed by #15639
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@FXLab91
Copy link

FXLab91 commented Mar 9, 2017

I find the behavior of rank function with method = 'dense' and pct = True unexpected as it looks like, in order to calculate percentile ranks, the function is using the total number of observations instead of the number of distinct observations.

Code Sample, a copy-pastable example if possible

import pandas as pd
n_rep = 2
ts = pd.Series([1,2,3,4] * n_rep )
output = ts.rank(method = 'dense', pct = True)

Problem description

ts.rank(method = 'dense', pct = True)
Out[116]: 
0    0.125
1    0.250
2    0.375
3    0.500
4    0.125
5    0.250
6    0.375
7    0.500

Expected Output

Something similar to:

pd.Series([1,2,3,4] * 2).rank(method = 'dense', pct = True) * n_rep 
Out[118]: 
0    0.25
1    0.50
2    0.75
3    1.00
4    0.25
5    0.50
6    0.75
7    1.00

Also, I would expected the result above to be invariant to n_rep.
i.e. I would expect a "mapping" {value -> pct_rank} that would not depend on how many times the value is repeated, while it is not the case here.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

so all pct=True does is divide by the nobs, which seems correct for all of the other methods.

In [3]: ts.rank(method='dense')
Out[3]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    1.0
5    2.0
6    3.0
7    4.0
dtype: float64

# this is the result
In [4]: ts.rank(method='dense')/8
Out[4]: 
0    0.125
1    0.250
2    0.375
3    0.500
4    0.125
5    0.250
6    0.375
7    0.500
dtype: float64

you want something like this I suppose, note that the original definitions are from : https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.rankdata.html (though scipy doesn't do pct, so I guess this doesn't matter).

In [14]: ts.rank(method='dense')/len(ts.drop_duplicates())
Out[14]: 
0    0.25
1    0.50
2    0.75
3    1.00
4    0.25
5    0.50
6    0.75
7    1.00
dtype: float64

code is here:
https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos_rank_helper.pxi.in#L201

if you'd like to see what (if anything) this change would break. (not you cannot directly use .drop_duplicates, you would have to call the cython routine (or maybe better we push pct calcs higher up in the stack so we could call that routine (I don't think perf is an issue, more about clarity).

@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 9, 2017
@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

@FXLab91 another option is to not allow pct=True with dense and let the user decide.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

@shoyer any thoughts

@shoyer
Copy link
Member

shoyer commented Mar 9, 2017

I agree with @FXLab91 that this is very strange behavior, and I can't see why anyone would want it. So I would be inclined to treat it as a bug and fix it for the next release.

@dsm054
Copy link
Contributor

dsm054 commented Mar 9, 2017

Does this suggest we should rethink the pct behaviour of some of the others as well? Something like [1,2,2] will give the same pct results under both min and dense (1/3, 2/3, 2/3).

@jreback
Copy link
Contributor

jreback commented Mar 10, 2017

@dsm054 surely!

yep these are prob not tested at all.

@jreback jreback added this to the Next Major Release milestone Mar 10, 2017
rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 10, 2017
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630
@rouzazari
Copy link
Contributor

May be a bit premature but I just worked through a possible solution that only touches method=dense and does not require .drop_duplicates. Comments and recommendations appreciated.

rouzazari added a commit to rouzazari/pandas that referenced this issue Apr 5, 2017
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630
@rouzazari
Copy link
Contributor

Restating @dsm054's question (and asking a few of my own), should all other method's return a "dense percentage" on a 100% basis when pct=True?

As @dsm054 noted, Series([1,2,2]).max(method='min', pct=True) will return [1/3, 2/3, 2/3]. Should this return [1/2, 2/2, 2/2]?

Now if method='max', Series([1,2,2]).max(method='max', pct=True) will return [1/3, 3/3, 3/3]. Is that is the desired output or should it again be [1/2, 2/2, 2/2]?

#15639 will fix the method='dense' case, but we need to address other methods as well.

@jreback jreback modified the milestones: 0.21.0, Next Major Release May 7, 2017
rouzazari added a commit to rouzazari/pandas that referenced this issue May 22, 2017
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630
@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017
gfyoung pushed a commit to rouzazari/pandas that referenced this issue Mar 2, 2018
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630
gfyoung pushed a commit to rouzazari/pandas that referenced this issue Mar 2, 2018
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Mar 8, 2018
gfyoung pushed a commit to rouzazari/pandas that referenced this issue Mar 8, 2018
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630
gfyoung pushed a commit to rouzazari/pandas that referenced this issue Mar 8, 2018
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants