Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.rank(pct=True, method='dense').max() != 1 for repeated values #18296

Closed
proinsias opened this issue Nov 15, 2017 · 1 comment
Closed
Labels
Bug Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@proinsias
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame(data=[1,2,3,3,3,3,4,4,4,5,5,5], columns=['abc'], ).sort_values('abc')
df['Rank'] = df['abc'].rank(method='dense')
df['Rank_Pct']= df['abc'].rank(pct=True, method='dense', )
df['Rank_Pct_Manual']= df['Rank'] / df['Rank'].max()

df.head()

Output:

    abc  Rank  Rank_Pct  Rank_Pct_Manual
0     1   1.0  0.083333              0.2
1     2   2.0  0.166667              0.4
2     3   3.0  0.250000              0.6
3     3   3.0  0.250000              0.6
4     3   3.0  0.250000              0.6
5     3   3.0  0.250000              0.6
6     4   4.0  0.333333              0.8
7     4   4.0  0.333333              0.8
8     4   4.0  0.333333              0.8
9     5   5.0  0.416667              1.0
10    5   5.0  0.416667              1.0
11    5   5.0  0.416667              1.0

Problem description

If you chose both the pct=True and method='dense' options of Series.rank, you don't get the expected maximum percentile of 1 if there are repeated values in the Series. This is because the function (e.g., rank_1d_float64()) always divides by the total number of elements in the Series. But in the case of the dense method, we should divide by the maximum rank value.

I'm working on a PR now.

Expected Output

I would expect the values of Rank_Pct and Rank_Pct_Manual to be the same, and that the maximum of both should be 1.

    abc  Rank  Rank_Pct  Rank_Pct_Manual
0     1   1.0       0.2              0.2
1     2   2.0       0.4              0.4
2     3   3.0       0.6              0.6
3     3   3.0       0.6              0.6
4     3   3.0       0.6              0.6
5     3   3.0       0.6              0.6
6     4   4.0       0.8              0.8
7     4   4.0       0.8              0.8
8     4   4.0       0.8              0.8
9     5   5.0       1.0              1.0
10    5   5.0       1.0              1.0
11    5   5.0       1.0              1.0

Output of pd.show_versions()

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.3.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Nov 15, 2017

duplicate of #15630.

@jreback jreback closed this as completed Nov 15, 2017
@jreback jreback added Bug Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 15, 2017
@jreback jreback added this to the No action milestone Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants