Pandas (0.18) Rank: unexpected behavior for method = 'dense' and pct = True #15630

FXLab91 · 2017-03-09T12:07:32Z

I find the behavior of rank function with method = 'dense' and pct = True unexpected as it looks like, in order to calculate percentile ranks, the function is using the total number of observations instead of the number of distinct observations.

Code Sample, a copy-pastable example if possible

import pandas as pd
n_rep = 2
ts = pd.Series([1,2,3,4] * n_rep )
output = ts.rank(method = 'dense', pct = True)

Problem description

ts.rank(method = 'dense', pct = True)
Out[116]: 
0    0.125
1    0.250
2    0.375
3    0.500
4    0.125
5    0.250
6    0.375
7    0.500

Expected Output

Something similar to:

pd.Series([1,2,3,4] * 2).rank(method = 'dense', pct = True) * n_rep 
Out[118]: 
0    0.25
1    0.50
2    0.75
3    1.00
4    0.25
5    0.50
6    0.75
7    1.00

Also, I would expected the result above to be invariant to n_rep.
i.e. I would expect a "mapping" {value -> pct_rank} that would not depend on how many times the value is repeated, while it is not the case here.

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-09T14:33:50Z

so all pct=True does is divide by the nobs, which seems correct for all of the other methods.

In [3]: ts.rank(method='dense')
Out[3]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    1.0
5    2.0
6    3.0
7    4.0
dtype: float64

# this is the result
In [4]: ts.rank(method='dense')/8
Out[4]: 
0    0.125
1    0.250
2    0.375
3    0.500
4    0.125
5    0.250
6    0.375
7    0.500
dtype: float64

you want something like this I suppose, note that the original definitions are from : https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.rankdata.html (though scipy doesn't do pct, so I guess this doesn't matter).

In [14]: ts.rank(method='dense')/len(ts.drop_duplicates())
Out[14]: 
0    0.25
1    0.50
2    0.75
3    1.00
4    0.25
5    0.50
6    0.75
7    1.00
dtype: float64

code is here:
https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos_rank_helper.pxi.in#L201

if you'd like to see what (if anything) this change would break. (not you cannot directly use .drop_duplicates, you would have to call the cython routine (or maybe better we push pct calcs higher up in the stack so we could call that routine (I don't think perf is an issue, more about clarity).

jreback · 2017-03-09T14:34:48Z

@FXLab91 another option is to not allow pct=True with dense and let the user decide.

jreback · 2017-03-09T14:35:13Z

@shoyer any thoughts

shoyer · 2017-03-09T16:55:34Z

I agree with @FXLab91 that this is very strange behavior, and I can't see why anyone would want it. So I would be inclined to treat it as a bug and fix it for the next release.

dsm054 · 2017-03-09T23:34:59Z

Does this suggest we should rethink the pct behaviour of some of the others as well? Something like [1,2,2] will give the same pct results under both min and dense (1/3, 2/3, 2/3).

jreback · 2017-03-10T00:04:31Z

@dsm054 surely!

yep these are prob not tested at all.

- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and `pct=True` now scales to 100%. See pandas-dev#15630

rouzazari · 2017-03-10T01:10:03Z

May be a bit premature but I just worked through a possible solution that only touches method=dense and does not require .drop_duplicates. Comments and recommendations appreciated.

- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and `pct=True` now scales to 100%. See pandas-dev#15630

rouzazari · 2017-04-06T00:14:22Z

Restating @dsm054's question (and asking a few of my own), should all other method's return a "dense percentage" on a 100% basis when pct=True?

As @dsm054 noted, Series([1,2,2]).max(method='min', pct=True) will return [1/3, 2/3, 2/3]. Should this return [1/2, 2/2, 2/2]?

Now if method='max', Series([1,2,2]).max(method='max', pct=True) will return [1/3, 3/3, 3/3]. Is that is the desired output or should it again be [1/2, 2/2, 2/2]?

#15639 will fix the method='dense' case, but we need to address other methods as well.

- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and `pct=True` now scales to 100%. See pandas-dev#15630

jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 9, 2017

jreback added Difficulty Intermediate labels Mar 9, 2017

jreback added this to the Next Major Release milestone Mar 10, 2017

rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 10, 2017

BUG: Dense ranking with percent now uses 100% basis

55827c8

- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and `pct=True` now scales to 100%. See pandas-dev#15630

rouzazari mentioned this issue Mar 10, 2017

BUG: Dense ranking with percent now uses 100% basis #15639

Merged

4 tasks

rouzazari added a commit to rouzazari/pandas that referenced this issue Apr 5, 2017

BUG: Dense ranking with percent now uses 100% basis

ea077d3

- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and `pct=True` now scales to 100%. See pandas-dev#15630

jreback modified the milestones: 0.21.0, Next Major Release May 7, 2017

rouzazari added a commit to rouzazari/pandas that referenced this issue May 22, 2017

BUG: Dense ranking with percent now uses 100% basis

ba3da79

- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and `pct=True` now scales to 100%. See pandas-dev#15630

jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017

This was referenced Nov 15, 2017

BUG: Series.rank(pct=True, method='dense').max() != 1 for repeated values #18296

Closed

BUG: Use total_tie_count to normalize dense ranking #18297

Closed

BUG: Series.rank(pct=True).max() != 1 for a large series of floats #18271

Closed

gfyoung pushed a commit to rouzazari/pandas that referenced this issue Mar 2, 2018

BUG: Dense ranking with percent now uses 100% basis

0421dc5

- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and `pct=True` now scales to 100%. See pandas-dev#15630

gfyoung pushed a commit to rouzazari/pandas that referenced this issue Mar 2, 2018

BUG: Dense ranking with percent now uses 100% basis

0f9bea3

- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and `pct=True` now scales to 100%. See pandas-dev#15630

jreback modified the milestones: Next Major Release, 0.23.0 Mar 8, 2018

gfyoung pushed a commit to rouzazari/pandas that referenced this issue Mar 8, 2018

BUG: Dense ranking with percent now uses 100% basis

edc8f85

- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and `pct=True` now scales to 100%. See pandas-dev#15630

gfyoung pushed a commit to rouzazari/pandas that referenced this issue Mar 8, 2018

BUG: Dense ranking with percent now uses 100% basis

6299790

- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and `pct=True` now scales to 100%. See pandas-dev#15630

jreback closed this as completed in #15639 Mar 9, 2018

velikod mentioned this issue Mar 19, 2021

BUG: Unexpected behaviour of groupby + rank(pct=True, method="dense") #40518

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas (0.18) Rank: unexpected behavior for method = 'dense' and pct = True #15630

Pandas (0.18) Rank: unexpected behavior for method = 'dense' and pct = True #15630

FXLab91 commented Mar 9, 2017 •

edited by jorisvandenbossche

Loading

jreback commented Mar 9, 2017

jreback commented Mar 9, 2017

jreback commented Mar 9, 2017

shoyer commented Mar 9, 2017

dsm054 commented Mar 9, 2017

jreback commented Mar 10, 2017

rouzazari commented Mar 10, 2017

rouzazari commented Apr 6, 2017

Pandas (0.18) Rank: unexpected behavior for method = 'dense' and pct = True #15630

Pandas (0.18) Rank: unexpected behavior for method = 'dense' and pct = True #15630

Comments

FXLab91 commented Mar 9, 2017 • edited by jorisvandenbossche Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

jreback commented Mar 9, 2017

jreback commented Mar 9, 2017

jreback commented Mar 9, 2017

shoyer commented Mar 9, 2017

dsm054 commented Mar 9, 2017

jreback commented Mar 10, 2017

rouzazari commented Mar 10, 2017

rouzazari commented Apr 6, 2017

FXLab91 commented Mar 9, 2017 •

edited by jorisvandenbossche

Loading