Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

na_position doesn't work for sort_index() with MultiIndex #15845

Closed
wants to merge 1 commit into from
Closed

na_position doesn't work for sort_index() with MultiIndex #15845

wants to merge 1 commit into from

Conversation

linebp
Copy link
Contributor

@linebp linebp commented Mar 30, 2017

@codecov
Copy link

codecov bot commented Mar 30, 2017

Codecov Report

Merging #15845 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15845      +/-   ##
==========================================
- Coverage   90.97%   90.96%   -0.02%     
==========================================
  Files         143      143              
  Lines       49442    49446       +4     
==========================================
- Hits        44982    44977       -5     
- Misses       4460     4469       +9
Flag Coverage Δ
#multiple 88.72% <ø> (ø) ⬆️
#single 40.69% <ø> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.57% <ø> (-0.1%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de589c2...e56a524. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 30, 2017

Codecov Report

Merging #15845 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15845      +/-   ##
==========================================
- Coverage   90.79%   90.79%   -0.01%     
==========================================
  Files         156      156              
  Lines       50534    50537       +3     
==========================================
+ Hits        45883    45885       +2     
- Misses       4651     4652       +1
Flag Coverage Δ
#multiple 88.56% <100%> (-0.01%) ⬇️
#single 40.44% <20%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 94.97% <100%> (ø) ⬆️
pandas/core/frame.py 97.65% <100%> (ø) ⬆️
pandas/core/indexes/multi.py 96.71% <100%> (ø) ⬆️
pandas/core/indexes/datetimes.py 95.33% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f114af0...66f809e. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Mar 30, 2017

@linebp I'd prefer if you did something like I illustrated here: #14784 (comment)

Put these tests (and on Series as well) in test_multilevel.py where these types of things live. Further Ideally we'd do a lower-level test in pandas/tests/indexes/tests/test_multi.py and/or pandas/tests/tests_sorting.py

@linebp
Copy link
Contributor Author

linebp commented Mar 31, 2017

I am not sure what I am missing, I tried to do as suggested. These are my steps figuring out what to do:

  • I looked for a function called _lexsort_indexer and didn't manage to locate it.
  • I figured what you meant was the function lexsort_indexer, which made sense to me, as this function is called while sorting the index.
  • The lexsort_indexer is called directly from sort_index, no code in MultiIndex is called(?)
  • I put the fix in sort_index as that is the function calling lexsort_indexer. This made good sense to me, as lexsort_indexer is not actually broken, but the data passed to it is.

Where and how did I go wrong?

I'll move the tests to a more appropriate location! ... and test the Series sorting too.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs testing for Series (which won't work as you have changed the code)

@@ -3392,7 +3392,13 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
if not labels.is_lexsorted():
labels = MultiIndex.from_tuples(labels.values)

indexer = lexsort_indexer(labels.labels, orders=ascending,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said, you can simply create a method on MultiIndex, called _lexsort_indexer (let's make it private), that returns the correct indexer.

from pandas.core.sorting import lexsort_indexer

# make sure that the axis is lexsorted to start
# if not we need to reconstruct to get the correct indexer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so leave the is_lexsorted check in here; this is fixed in #15694 w/o this subtle things will break.


indexer = lexsort_indexer(labels.labels, orders=ascending,
na_position=na_position)
indexer = labels._lexsort_indexer(orders=ascending,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do

indexer = lexsort_indexer(labels._get_labels_for_sorting(), orders=ascending,
                                      na_position=na_position)

so IOW rename pandas.indexes.multi._lexsort_indexer (your new routines) to _get_labels_for_sorting

@@ -1763,8 +1763,8 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
new_index, indexer = index.sortlevel(level, ascending=ascending,
sort_remaining=sort_remaining)
elif isinstance(index, MultiIndex):
from pandas.core.sorting import lexsort_indexer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the same change as above

# else:
# mi = self

keys = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to _get_labels_for_sorting()

add a doc-string.

rename keys -> labels

@linebp
Copy link
Contributor Author

linebp commented Apr 5, 2017

so, add the is_lexsorted check to both series and frames sort_index right before call to _lexsort_indexer? What is the reason for not adding the check to the _lexsort_indexer function?

Panels seems to have a function similar to sort_index on frames and series, is that in the scope of this bug?

If you use the level option when calling sort_index, the na_position option is ignored. Should this be fixed to?

@jreback
Copy link
Contributor

jreback commented Apr 5, 2017

@linebp this needs to be very targeted. don't touch Panel and don't worry about level.

instead of alabels.labels, call labels._get_labels_for_sorting()

it is very easy to break things w/o very specific patches.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

@linebp maybe I wasn't clear. These ALL the existing code, except for

indexer = lexsort_indexer(index._get_labels_for_sorting(), orders=ascending)

where _get_labels_for_sorting() is the routine that you added _lexsort_indexer

so you are adding 1 function to MultiIndex, and changing 1 line in series & 1 line in frame.py thats it.

@linebp
Copy link
Contributor Author

linebp commented Apr 7, 2017

I am still a bit confused about what to do, but this adds 1 function and only changes 1 line in series and frame each. No comments on the tests, so I let those stay as is.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

ok, pls rebase this as I pushed in a bunch of fixes for sorting (generally)

@linebp
Copy link
Contributor Author

linebp commented Apr 16, 2017

I made an effort to rebase, how is it looking now?

@jreback
Copy link
Contributor

jreback commented Apr 16, 2017

@linebp code looks good. pls add a whatsnew (bug fix). make sure you rebase on origin (there are still conflicts showing)

git rebase -i thisbranch origin/master

then push with -f to update.

@linebp
Copy link
Contributor Author

linebp commented Apr 19, 2017

It fails when Categorical.from_codes() is called on an index where some rows have been removed, for example with dropna(). The index is not relabelled and so the labels may be outside the -1 to len(categories) range. Example:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': pd.to_datetime([
        '20121002', '20121007', '20130130', '20130202', '20130305',
        '20121002', '20121207', '20130130', '20130202', '20130305',
        '20130202', '20130305'
    ]),
    'user_id': [1, 1, 1, 1, 1, 3, 3, 3, 5, 5, 5, 5],
    'whole_cost': [1790, np.nan, 280, 259, np.nan, 623, 90, 312, np.nan, 301,
                   359, 801],
    'cost': [12, 15, 10, 24, 39, 1, 0, np.nan, 45, 34, 1, 12]
}).set_index(['date', 'user_id'])

df.dropna().sort_index()

@jreback
Copy link
Contributor

jreback commented Apr 19, 2017

@linebp I pushed a fix to your branch. can you add the above as an additional tests. ping on green.

@jreback jreback added this to the 0.20.0 milestone Apr 19, 2017
@jreback jreback closed this in dd5cef5 Apr 19, 2017
@linebp linebp deleted the json_normalize_seperator branch April 20, 2017 08:41
analyticalmonk pushed a commit to analyticalmonk/pandas that referenced this pull request Apr 20, 2017
closes pandas-dev#14784

Author: Line Pedersen <linebp@users.noreply.github.com>

Closes pandas-dev#15845 from linebp/json_normalize_seperator and squashes the following commits:

66f809e [Line Pedersen] BUG GH14784 na_position doesn't work for sort_index() with MultiIndex
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
closes pandas-dev#14784

Author: Line Pedersen <linebp@users.noreply.github.com>

Closes pandas-dev#15845 from linebp/json_normalize_seperator and squashes the following commits:

66f809e [Line Pedersen] BUG GH14784 na_position doesn't work for sort_index() with MultiIndex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

na_position doesn't work for sort_index() with MultiIndex
2 participants