Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: avoid unnecessary casting when unstacking index with unused levels #18460

Merged
merged 1 commit into from
Jan 16, 2018

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Nov 23, 2017

closes #17845
closes #18562

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@toobaz toobaz force-pushed the slice_unstack_unused branch from 7dd32be to 99a5dce Compare November 23, 2017 23:08
@codecov
Copy link

codecov bot commented Nov 24, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@de42bee). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #18460   +/-   ##
=========================================
  Coverage          ?   91.55%           
=========================================
  Files             ?      147           
  Lines             ?    48827           
  Branches          ?        0           
=========================================
  Hits              ?    44703           
  Misses            ?     4124           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.92% <100%> (?)
#single 41.69% <6.25%> (?)
Impacted Files Coverage Δ
pandas/core/reshape/reshape.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de42bee...a629b82. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Nov 24, 2017

related to #17886 or is that separate?

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 24, 2017
new_names = [self.value_columns.name, self.removed_name]
new_labels = [propagator]

new_labels.append(np.tile(np.arange(stride) - self.lift, width))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what is going on here (e.g. the unsused bizness)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see below)

exp_data[idces] = data
cols = pd.MultiIndex.from_product([[0, 1], col_level])
expected = pd.DataFrame(exp_data.reshape(3, 6),
index=idx_level, columns=cols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have an exact expected frame and assert_frame_equal (maybe more code, but it really locks it down to the exact result).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frame is an exact copy, but assert_frame_equal fails (two lines below) because of #18455 . So until that is fixed, I guess I can only add a check on the dtypes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So until that is fixed, I guess I can only add a check on the dtypes.

(shall I?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think worthwhile to fix #18455 first actually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : seems like #18455 won't be fixed before #18600, #18626, #18769... and the present PR (probably together with #18562) is hanging only for a workaround in a test. Still want to wait?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments

@toobaz
Copy link
Member Author

toobaz commented Nov 24, 2017

related to #17886 or is that separate?

Separate - this fix handles specifically the case of unused levels. But I plan to look at #17886 too.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiirc this was pretty good. pls rebase and let's see where this is.

@@ -318,6 +318,7 @@ Sparse
Reshaping
^^^^^^^^^

- Bug in :func:`DataFrame.unstack` which casts int to float if ``columns`` is a ``MultiIndex`` with unused levels (:issue:`17845`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.23

@toobaz toobaz force-pushed the slice_unstack_unused branch from 26a9968 to 2dc6cab Compare January 15, 2018 10:32
@toobaz
Copy link
Member Author

toobaz commented Jan 15, 2018

@jreback rebased, I also verified that this fixes #18562 and added a test for that

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. style comments.

new_names = [self.value_columns.name, self.removed_name]
new_labels = [propagator]

new_labels.append(np.tile(np.arange(stride) - self.lift, width))
# The two indices differ iff the unstacked level had unused items.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iff -> if

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if and only if - but OK)

else:
# Otherwise, we just use each level item exactly once:
repeater = np.arange(stride) - self.lift
# The entire level is then just a repetition of the single chunk:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line here

@@ -560,6 +560,73 @@ def test_unstack_dtypes(self):
assert left.shape == (3, 2)
tm.assert_frame_equal(left, right)

def test_unstack_unused_levels(self):
# GH 17845: sliced columns of int DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give 1 more line of expl here

result = df.unstack()
expected = pd.DataFrame(np.concatenate([block * 2, block * 2 - 1],
axis=1),
columns=idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are not comparing here?

expected.columns = MultiIndex.from_product([expected.columns, ['I']],
names=[None, 'C'])
expected.index = expected.index.droplevel('C')
assert_frame_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.assert_frame_equal for consistentcy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the following test uses assert_frame_equal, but if that's a leftover, OK


@pytest.mark.parametrize("cols", [['A', 'C'], slice(None)])
def test_unstack_unused_level(self, cols):
# GH 18562 : unused labels on the unstacked level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you had 2 cases for #18562 does this cover both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, they are the two "cols" values

@toobaz toobaz force-pushed the slice_unstack_unused branch from 2dc6cab to a629b82 Compare January 15, 2018 14:17
@toobaz
Copy link
Member Author

toobaz commented Jan 15, 2018

@jreback ping

@jreback jreback added this to the 0.23.0 milestone Jan 16, 2018
@jreback jreback merged commit 7208610 into pandas-dev:master Jan 16, 2018
@jreback
Copy link
Contributor

jreback commented Jan 16, 2018

thanks @toobaz

@toobaz toobaz deleted the slice_unstack_unused branch January 16, 2018 04:47
@TomAugspurger
Copy link
Contributor

FYI, this caused a 25-30% slowdown in this ASV: http://pandas.pydata.org/speed/pandas/#reshape.SparseIndex.time_unstack

@toobaz
Copy link
Member Author

toobaz commented Jan 17, 2018

I guess it's the remove_unused_levels call. I can see if (and how) it can be ran on specific levels only.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2018

can u create an issue to track this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
3 participants