Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crosstab gives wrong result if a categorical Series contains NaNs #21565

Closed
simon-anders opened this issue Jun 20, 2018 · 5 comments
Closed

crosstab gives wrong result if a categorical Series contains NaNs #21565

simon-anders opened this issue Jun 20, 2018 · 5 comments
Labels
Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@simon-anders
Copy link

simon-anders commented Jun 20, 2018

Test code:

import numpy as np
import pandas as pd

df = pd.DataFrame.from_dict( {"objcol": ("A", "B", np.nan, "C", "C", "A", "D" ) })
df["catcol"] = df.objcol.astype('category')

pd.crosstab( df.objcol, 1 )
pd.crosstab( df.catcol, 1 )

Problem description

We have this data frame:

>>> df
  objcol catcol
0      A      A
1      B      B
2    NaN    NaN
3      C      C
4      C      C
5      A      A
6      D      D

The first column is of dtype object, the second column of dtype 'category'. Running crosstab on the two columns gives different results:

>>> pd.crosstab( df.objcol, 1 )
col_0   1
objcol   
A       2
B       1
C       2
D       1

>>> pd.crosstab( df.catcol, 1 )
col_0   1
catcol   
A       2
B       1
NaN     2
C       1

Clearly, the second result is wrong. Note how "C" has the wrong count, 1 instead of 2.

value_counts, on the other hand, works correctly:

>>> df.objcol.value_counts()
C    2
A    2
D    1
B    1
Name: objcol, dtype: int64

>>> df.catcol.value_counts()
C    2
A    2
D    1
B    1
Name: catcol, dtype: int64

Expected Output

pd.crosstab( df.catcol, 1 ) should give the same output as pd.crosstab( df.objcol, 1 ).

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-45-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2014.10
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@simon-anders
Copy link
Author

simon-anders commented Jun 20, 2018

Curiously, the behaviour changes if one uses dropna=False:

>>> pd.crosstab( df.catcol, 1, dropna=False )
col_0   1
catcol   
A       2
B       1
C       2
D       1

>>> pd.crosstab( df.objcol, 1, dropna=False )
col_0   1
objcol   
A       2
B       1
C       2
D       1

Now, the two outputs are the same. (Alhough: Shouldn't they contain a row for NA?)

Note also, how in the example above (without dropna=False), the count values are actually correct; it is the labels that are wrong. They seem to be shifted because the NaN was not removed from the index.

@jschendel
Copy link
Member

jschendel commented Jun 21, 2018

Thanks, this looks to be the same underlying issue as #21133, which was fixed by #21252. Upgrading to 0.23.1 should give you the expected behavior:

In [2]: pd.__version__
Out[2]: '0.23.1'

In [3]: df = pd.DataFrame.from_dict({"objcol": ("A", "B", np.nan, "C", "C", "A", "D" )})
   ...: df["catcol"] = df.objcol.astype('category')

In [4]: pd.crosstab( df.catcol, 1 )
Out[4]: 
col_0   1
catcol   
A       2
B       1
C       2
D       1

In [5]: pd.crosstab( df.objcol, 1 )
Out[5]: 
col_0   1
objcol   
A       2
B       1
C       2
D       1

@jschendel jschendel added this to the No action milestone Jun 21, 2018
@jschendel jschendel added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Duplicate Report Duplicate issue or pull request Categorical Categorical Data Type labels Jun 21, 2018
@aganatramoat
Copy link

Issue is still present in 0.23.4

@jschendel
Copy link
Member

@aganatramoat : please provide a reproducible example, as this appears to be working fine on 0.23.4:

In [1]: import pandas as pd; import numpy as np; pd.__version__
Out[1]: '0.23.4'

In [2]: df = pd.DataFrame.from_dict({"objcol": ("A", "B", np.nan, "C", "C", "A", "D" )})
   ...: df["catcol"] = df.objcol.astype('category')
   ...: 

In [3]: pd.crosstab(df.catcol, 1)
Out[3]: 
col_0   1
catcol   
A       2
B       1
C       2
D       1

In [4]: pd.crosstab(df.objcol, 1)
Out[4]: 
col_0   1
objcol   
A       2
B       1
C       2
D       1

@aganatramoat
Copy link

aganatramoat commented Sep 3, 2018

Sorry misspoke, the problem is with crosstab with categorical data and margins.
To reproduce:

from numpy.random import choice
from pandas.api.types import CategoricalDtype
import pandas as pd
t0 = CategoricalDtype(categories=['a0', 'b0', 'c0', 'd0', 'e0'], ordered=True)
t1 = CategoricalDtype(categories=['a1', 'b1', 'c1', 'd1', 'e1'], ordered=True)
mydf = pd.DataFrame({'col0': pd.Series(choice(t0.categories, 100), dtype=t0), 'col1': pd.Series(choice(t1.categories, 100), dtype=t1)})
pd.crosstab(mydf.col0, mydf.col1, margins=True)
In [315]: pd.__version__
Out[315]: '0.23.4'

On one run of the above, I get

col1     a1     b1    c1    d1    e1   All
col0
a0        3      4      4      8    5   19
b0        5      1      5      3    5   24
c0        4      0      2      8    5   23
d0        8      0      1      1   13   19
e0        6      3      3      3    0   15
All       26    28     15     23    8  100

The margins are permuted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants