crosstab gives wrong result if a categorical Series contains NaNs #21565

simon-anders · 2018-06-20T22:03:35Z

Test code:

import numpy as np
import pandas as pd

df = pd.DataFrame.from_dict( {"objcol": ("A", "B", np.nan, "C", "C", "A", "D" ) })
df["catcol"] = df.objcol.astype('category')

pd.crosstab( df.objcol, 1 )
pd.crosstab( df.catcol, 1 )

Problem description

We have this data frame:

>>> df
  objcol catcol
0      A      A
1      B      B
2    NaN    NaN
3      C      C
4      C      C
5      A      A
6      D      D

The first column is of dtype object, the second column of dtype 'category'. Running crosstab on the two columns gives different results:

>>> pd.crosstab( df.objcol, 1 )
col_0   1
objcol   
A       2
B       1
C       2
D       1

>>> pd.crosstab( df.catcol, 1 )
col_0   1
catcol   
A       2
B       1
NaN     2
C       1

Clearly, the second result is wrong. Note how "C" has the wrong count, 1 instead of 2.

value_counts, on the other hand, works correctly:

>>> df.objcol.value_counts()
C    2
A    2
D    1
B    1
Name: objcol, dtype: int64

>>> df.catcol.value_counts()
C    2
A    2
D    1
B    1
Name: catcol, dtype: int64

Expected Output

pd.crosstab( df.catcol, 1 ) should give the same output as pd.crosstab( df.objcol, 1 ).

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-45-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2014.10
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

simon-anders · 2018-06-20T22:11:27Z

Curiously, the behaviour changes if one uses dropna=False:

>>> pd.crosstab( df.catcol, 1, dropna=False )
col_0   1
catcol   
A       2
B       1
C       2
D       1

>>> pd.crosstab( df.objcol, 1, dropna=False )
col_0   1
objcol   
A       2
B       1
C       2
D       1

Now, the two outputs are the same. (Alhough: Shouldn't they contain a row for NA?)

Note also, how in the example above (without dropna=False), the count values are actually correct; it is the labels that are wrong. They seem to be shifted because the NaN was not removed from the index.

jschendel · 2018-06-21T00:24:42Z

Thanks, this looks to be the same underlying issue as #21133, which was fixed by #21252. Upgrading to 0.23.1 should give you the expected behavior:

In [2]: pd.__version__
Out[2]: '0.23.1'

In [3]: df = pd.DataFrame.from_dict({"objcol": ("A", "B", np.nan, "C", "C", "A", "D" )})
   ...: df["catcol"] = df.objcol.astype('category')

In [4]: pd.crosstab( df.catcol, 1 )
Out[4]: 
col_0   1
catcol   
A       2
B       1
C       2
D       1

In [5]: pd.crosstab( df.objcol, 1 )
Out[5]: 
col_0   1
objcol   
A       2
B       1
C       2
D       1

aganatramoat · 2018-09-02T17:11:30Z

Issue is still present in 0.23.4

jschendel · 2018-09-03T00:17:47Z

@aganatramoat : please provide a reproducible example, as this appears to be working fine on 0.23.4:

In [1]: import pandas as pd; import numpy as np; pd.__version__
Out[1]: '0.23.4'

In [2]: df = pd.DataFrame.from_dict({"objcol": ("A", "B", np.nan, "C", "C", "A", "D" )})
   ...: df["catcol"] = df.objcol.astype('category')
   ...: 

In [3]: pd.crosstab(df.catcol, 1)
Out[3]: 
col_0   1
catcol   
A       2
B       1
C       2
D       1

In [4]: pd.crosstab(df.objcol, 1)
Out[4]: 
col_0   1
objcol   
A       2
B       1
C       2
D       1

aganatramoat · 2018-09-03T03:03:49Z

Sorry misspoke, the problem is with crosstab with categorical data and margins.
To reproduce:

from numpy.random import choice
from pandas.api.types import CategoricalDtype
import pandas as pd
t0 = CategoricalDtype(categories=['a0', 'b0', 'c0', 'd0', 'e0'], ordered=True)
t1 = CategoricalDtype(categories=['a1', 'b1', 'c1', 'd1', 'e1'], ordered=True)
mydf = pd.DataFrame({'col0': pd.Series(choice(t0.categories, 100), dtype=t0), 'col1': pd.Series(choice(t1.categories, 100), dtype=t1)})
pd.crosstab(mydf.col0, mydf.col1, margins=True)

In [315]: pd.__version__
Out[315]: '0.23.4'

On one run of the above, I get

col1     a1     b1    c1    d1    e1   All
col0
a0        3      4      4      8    5   19
b0        5      1      5      3    5   24
c0        4      0      2      8    5   23
d0        8      0      1      1   13   19
e0        6      3      3      3    0   15
All       26    28     15     23    8  100

The margins are permuted

jschendel closed this as completed Jun 21, 2018

jschendel added this to the No action milestone Jun 21, 2018

jschendel added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Duplicate Report Duplicate issue or pull request Categorical Categorical Data Type labels Jun 21, 2018

tchklovski mentioned this issue Dec 11, 2018

Series.apply on categorical with NaN has wrong behavior #24241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crosstab gives wrong result if a categorical Series contains NaNs #21565

crosstab gives wrong result if a categorical Series contains NaNs #21565

simon-anders commented Jun 20, 2018 •

edited

Loading

simon-anders commented Jun 20, 2018 •

edited

Loading

jschendel commented Jun 21, 2018 •

edited

Loading

aganatramoat commented Sep 2, 2018

jschendel commented Sep 3, 2018

aganatramoat commented Sep 3, 2018 •

edited by jorisvandenbossche

Loading

crosstab gives wrong result if a categorical Series contains NaNs #21565

crosstab gives wrong result if a categorical Series contains NaNs #21565

Comments

simon-anders commented Jun 20, 2018 • edited Loading

Test code:

Problem description

Expected Output

Output of pd.show_versions()

simon-anders commented Jun 20, 2018 • edited Loading

jschendel commented Jun 21, 2018 • edited Loading

aganatramoat commented Sep 2, 2018

jschendel commented Sep 3, 2018

aganatramoat commented Sep 3, 2018 • edited by jorisvandenbossche Loading

simon-anders commented Jun 20, 2018 •

edited

Loading

Output of `pd.show_versions()`

simon-anders commented Jun 20, 2018 •

edited

Loading

jschendel commented Jun 21, 2018 •

edited

Loading

aganatramoat commented Sep 3, 2018 •

edited by jorisvandenbossche

Loading