Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: dropna incorrect with categoricals in pivot_table #21252

Merged
merged 3 commits into from
Jun 7, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.23.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ Fixed Regressions
- Bug in :meth:`~DataFrame.to_csv` causes encoding error when compression and encoding are specified (:issue:`21241`, :issue:`21118`)
- Bug preventing pandas from being importable with -OO optimization (:issue:`21071`)
- Bug in :meth:`Categorical.fillna` incorrectly raising a ``TypeError`` when `value` the individual categories are iterable and `value` is an iterable (:issue:`21097`, :issue:`19788`)
- Regression in :func:`pivot_table` where an ordered ``Categorical`` with missing
values for the pivot's ``index`` would give a mis-aligned result (:issue:`21133`)


.. _whatsnew_0231.performance:
Expand Down
20 changes: 18 additions & 2 deletions pandas/core/reshape/pivot.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# pylint: disable=E1103


from pandas.core.dtypes.common import is_list_like, is_scalar
from pandas.core.dtypes.common import (
is_list_like, is_scalar, is_integer_dtype)
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries
from pandas.core.dtypes.cast import maybe_downcast_to_dtype

from pandas.core.reshape.concat import concat
from pandas.core.series import Series
Expand Down Expand Up @@ -79,8 +81,22 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
pass
values = list(values)

grouped = data.groupby(keys, observed=dropna)
# group by the cartesian product of the grouper
# if we have a categorical
grouped = data.groupby(keys, observed=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling this workaround would not be needed if the bug in groupby would be solved? (#21151)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agged = grouped.agg(aggfunc)
if dropna and isinstance(agged, ABCDataFrame) and len(agged.columns):
agged = agged.dropna(how='all')

# gh-21133
# we want to down cast if
# the original values are ints
# as we grouped with a NaN value
# and then dropped, coercing to floats
for v in [v for v in values if v in data and v in agged]:
if (is_integer_dtype(data[v]) and
not is_integer_dtype(agged[v])):
agged[v] = maybe_downcast_to_dtype(agged[v], data[v].dtype)

table = agged
if table.index.nlevels > 1:
Expand Down
26 changes: 25 additions & 1 deletion pandas/tests/reshape/test_pivot.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# -*- coding: utf-8 -*-

from datetime import datetime, date, timedelta

Expand All @@ -16,6 +17,11 @@
from pandas.api.types import CategoricalDtype as CDT


@pytest.fixture(params=[True, False])
def dropna(request):
return request.param


class TestPivotTable(object):

def setup_method(self, method):
Expand Down Expand Up @@ -109,7 +115,6 @@ def test_pivot_table_categorical(self):
index=exp_index)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize('dropna', [True, False])
def test_pivot_table_dropna_categoricals(self, dropna):
# GH 15193
categories = ['a', 'b', 'c', 'd']
Expand Down Expand Up @@ -137,6 +142,25 @@ def test_pivot_table_dropna_categoricals(self, dropna):

tm.assert_frame_equal(result, expected)

def test_pivot_with_non_observable_dropna(self, dropna):
# gh-21133
df = pd.DataFrame(
{'A': pd.Categorical([np.nan, 'low', 'high', 'low', 'high'],
categories=['low', 'high'],
ordered=True),
'B': range(5)})

result = df.pivot_table(index='A', values='B', dropna=dropna)
expected = pd.DataFrame(
{'B': [2, 3]},
index=pd.Index(
pd.Categorical.from_codes([0, 1],
categories=['low', 'high'],
ordered=True),
name='A'))

tm.assert_frame_equal(result, expected)

Copy link
Member

@jschendel jschendel May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add tests where columns/values and index/columns/values are specified for pivot_table? It looks like these fail with a similar setup on 0.23.0.

Using the same definition of df as you used in your test, columns/values is incorrect:

In [3]: pd.__version__
Out[3]: '0.23.0'

In [4]: df.pivot_table(columns='A', values='B')
Out[4]:
A  NaN  low
B  2.0  3.0

Similarly index/columns/values is incorrect:

In [5]: df['AA'] = df['A']

In [6]: df.pivot_table(index='A', columns='AA', values='B')
Out[6]:
AA   NaN  low
A
NaN  2.0  NaN
low  NaN  3.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschendel I tested that those are working correctly with this PR, but given I wanted to get this in for the release I already merged. But it's indeed true it would be good to add those as additional test case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #21370 to keep track of this

def test_pass_array(self):
result = self.data.pivot_table(
'D', index=self.data.A, columns=self.data.C)
Expand Down