Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: __getitem__ raise blank KeyError for IntervalIndex and missing keys #37873

Merged
merged 7 commits into from
Nov 26, 2020

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 15, 2020

Without tolist error would look like KeyError: array([6])

@phofl phofl added Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type labels Nov 15, 2020
@jbrockmendel
Copy link
Member

Tracking this down, it looks like the KeyError raised in Loc._validate_read_indexer has the well-written version of the message, and it looks like it is special-casing Categoricalndex for no apparent reason (CategoricalIndex and IntervalIndex being the only two Index subclasses that do the check this PR is looking at).

Not yet totally sure, but it may be that we should always have this exception be raised in _validate_read_indexer

@phofl
Copy link
Member Author

phofl commented Nov 16, 2020

@jbrockmendel I was thinking along the same line. One problem is, that we arrive in _get_listlike_indexer sometimes through _getitem_iterable where raise_missing ist set to False. This leads to supressed KeyErrors in some cases when not raising in IntervalIndex implementation. If we could change this to True, we do not need the raise above. But comment says will stop working in the future, so I thought we can not change this now.

@jbrockmendel
Copy link
Member

But comment says will stop working in the future, so I thought we can not change this now.

Can you track down when these comments got added? I think its likely that they are out of date.

@jbrockmendel
Copy link
Member

@phofl i came to the conclusion that the future-behavior you mentioned was supposed to have already been implemented, so those comments should be removed/updated.

moreover, if we just remove IntervalIndex._convert_list_indexer altogether (and disable it from being called from CategoricalIndex._convert_list_indexer, we end up falling though to raise raise KeyError(f"None of [{key}] are in the [{axis_name}]") in Loc._validate_read_indexer, like we would for any other Index.

The remaining trouble is that the exception message isn't quite right for IntervalIndex bc it isnt a matter of the key not being present, but the key not being contained in any of the intervals.

@phofl
Copy link
Member Author

phofl commented Nov 17, 2020

@jbrockmendel ok great. Will remove the code paths and update this pr with it. Should come to this tomorrow evening

@jreback jreback added this to the 1.2 milestone Nov 18, 2020
@phofl
Copy link
Member Author

phofl commented Nov 18, 2020

Removing

IntervalIndex._convert_list_indexer

completly does not work out of the box.

ser = Series(np.arange(5), IntervalIndex.from_breaks(np.arange(6)))
result = s.loc[[1.5, 2.5, 3.5]]

where result has index
idx=Index([1.5,2.5,3.5]) instead of intervals.

Additionally, if disable calling it from CategoricalIndex._convert_list_indexer

df = DataFrame({"A": range(10)})
s = pd.cut(df.A, 5)
df["B"] = s
df = df.set_index("B")
result = df.loc[[4]]

raises {KeyError}'a list-indexer must only include values that are in the categories' which is probably wrong because we have an interval containing 4?

@jbrockmendel
Copy link
Member

completly does not work out of the box.

Thanks for giving it a try. I may have been overly optimistic about reducing code here.

@phofl
Copy link
Member Author

phofl commented Nov 19, 2020

cc @jbrockmendel
Looked a bit more into this and got a follow up question.

df = DataFrame({"A": range(10)})
s = pd.cut(df.A, 5)
df["B"] = s
df = df.set_index("B")
result = df.loc[[4]]

This raises because

Int64Index([4], dtype='int64').difference(IntervalIndex([(-0.009, 1.8], (1.8, 3.6], (3.6, 5.4], (5.4, 7.2], (7.2, 9.0]],  closed='right', dtype='interval[float64]')

returns

Int64Index([4], dtype='int64')

I am not that familiar with interval support in this operation, but this seems at least odd. I would have expected one of the following:

  • Raising because incompatible
  • Empty return

Is my assumption correct? Then fixing this would bring us one step closer to removing this as you proposed

@jbrockmendel
Copy link
Member

This raises because

This example doesn't raise for me on master. Do you mean in this branch?

I am not that familiar with interval support in this operation, but this seems at least odd. I would have expected one of the following:

indexing on a CategoricalIndex is difficult, and I think not entirely consistent (pretty sure there's an issue about this).

IntervalIndex.get_loc/get_indexer behavior is pretty much unconnected to the .difference operation (though that would make sense for e.g. 2 Int64Indexes). So im not sure i understand the question.

@phofl
Copy link
Member Author

phofl commented Nov 19, 2020

Sorry, my question was rubbish without context.

If we remove

     if self.categories._defer_to_indexing:
            indexer = self.categories._convert_list_indexer(keyarr)
            return Index(self.codes).get_indexer_for(indexer)

from _convert_list_indexer in category.py as discussed above, my example raises. In this case it gets piped through to the code snippet posted above with the differences return.

@jbrockmendel
Copy link
Member

from _convert_list_indexer in category.py as discussed above, my example raises. In this case it gets piped through to the code snippet posted above with the differences return.

yah lets ignore this as orthogonal to this PR. can revisit later if we get ambitious about code simplification.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2020

this looks good, can you merge master @phofl if we want to open an issue about the actual error message ok by me.

� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
@jreback jreback merged commit 2863428 into pandas-dev:master Nov 26, 2020
@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

thanks @phofl very nice, keep em coming (as you have been!)

@simonjayhawkins
Copy link
Member

if we want to open an issue about the actual error message ok by me.

#25996 already exists for key error messages in general.

topper-123 pushed a commit to topper-123/pandas that referenced this pull request Nov 26, 2020
@phofl phofl deleted the 27365 branch November 26, 2020 21:13
@phofl
Copy link
Member Author

phofl commented Nov 26, 2020

Thanks very much, its fun working with you all

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Nov 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: missing key not displayed in exception message for IntervalIndex
4 participants