-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: IndexError: positional indexers are out-of-bounds iloc boolean indexing #39004
Comments
Hi, thanks for your report. Why are you using a DataFrame as mask? I am not sure if this is intended to work |
I'm using a dataframe mask because I have a 2D boolean mask that marks all the places I want to change in my 2D float value dataframe. I was hoping there would be a function for this that prevents me from writing a loop (not for performance, just for convenience). In the documentation it is says it is possible to use "A boolean array (any NA values will be treated as False)." to index via iloc. What I found so confusing about this, is that the indexing actually works if I use an assignment, but it breaks if I only evaluate the indexing operation. As the assigment is something that is done on top of the indexing it makes no sense for me that it would work. |
Our fallback for assignments is to operate on arrays. Hence it works. iloc is two dimensional, but I am not quite sure what you want to achieve. Something like
should not be possible. You have to specify rows to select and columns to select. @jbrockmendel thoughts here? I understood array as numpy array or pandas array not DataFrame. |
@phofl It does work though, check out this:
I don't have to specify rows and columns seperatively. |
Sorry, was not clear enough. I did not expect
to work. I don't even know what this should return in your case. Meaning getitem |
I think what numpy does is pretty good, so returning the elements, selected by the mask, in a Series sounds appropriate to me. That would solve two things at once, my concern about it being weird that something is only working if you do an assignment on top of it, plus it provides more functionality. |
this is not intended to work at all iloc does not align and esp does not on a Dataframe instead this gets turned into a ndarray and can easily be the wrong shape |
Maybe we should raise when a DataFrame is given as indexer? |
I think this makes sense. Definitely shouldnt be working for iloc.setitem but not iloc.getitem |
yep agreed |
I want to further clarify that I had a misunderstanding, which also was the main thing I originally objected in this issue. I assumed that if a code snippet B contains a code snippet A that B would depend on A working independently. This is however not true if there are operators in the expression since they can change the meaning. So in this example
The snippet B does not depend on A to work as it's doing something completely different (writing instead of reading memory), even though A is contained in the code of B (if you look at the characters). So the reason why I originally created the issue, out of a misunderstanding, is solved. Thankfully you found out that iloc.setitem shouldn't be possible with dataframe index, so something good came out of my misunderstanding. A raised error would have also prevented me from running into this mess. |
I got a question. If I wanted to use a boolean array with pandas for indexing I can use |
Indexing with a boolean DataFrame is definitely supported with But if that's indeed the case: if it works for |
The problem is the missing alignment in iloc. See the following:
returns
What to do in case of iloc?
returns
while iloc returns
which is counterintuitive to loc and [] I would argue that the fact that this is working with iloc is simply a bug |
Why would I pass indices that are not considered in the indexing process? |
Ok another example to clarify the point I am trying to make concerning iloc and []
The last one actually raises, which should be expected because of the missing alignment in iloc
|
I get what you mean. What bugs me though is the fact that you can pass indices that will not be considered in the indexing process. Let's say we have this:
This will still work even though the last row won't be considered in the indexing process at all. I'm addressing this because I ran into this in my code. I had a boolean mask of same shape as my df, but the indices of the mask where different (due to some data mangling) and then I was wondering why df[mask] wouldn't work as expected (that all the indices provided in the mask would actually be considered and not only those were the indices are in intersecting. |
This is what loc and [] do, they align the objects. This is documented in the docstrings and in the user guide (here for example https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label) |
Yeah it makes sense to have partial selection where indexers can be fully aligned, but does it make sense to accept indexers that can only be partially aligned? |
Wouldn't it be proper to throw an out of bounds? When I do |
@gooney47 it's certainly a tricky issue if you run into it and don't expect the behaviour. But as @phofl, this is certainly not accidental, but intentional behaviour and documented. Which of course doesn't mean we cannot re-consider it if it turns out to be confusing / non-ideal. @phofl thanks for those examples. Also Series.iloc already raises an error about, supporting your argument that for
Now, I personally still think we could reconsider this for aligned objects. Your examples used intentionally unaligned dataframes, but allowing aligned objects with equal index (or even more strict with identical index) would at least enable the use case like For example, assume you want to select positionally on the columns, and a mask for the rows. Currently you cannot easily do this with |
I tend to agree that
So looks like this is intentionally disallowed because it is not implemented, not because it should not work. For cases like |
@jorisvandenbossche I don't believe this is intended behavior, because there is inconsistent raising behaviour when you pass 1D indexer compared to passing 2D indexer:
Why should something suddenly be allowed, just because indexer has more dimensions? The reason why you raise non-existing indices is because you want people to notice that their passed indices do not apply to the data and I'm sure this helped many people in the 1 dimensional case. I'm just thinking that it would make sense in more dimensions too. |
The 1d case should raise, this is not a boolean array. If you pass in a list or array like of labels this raises as soon as one label is not found |
I guess my general question is: Why it would make sense to let someone pass indices that are not present in the data? Why not help the people and say that they aren't present, how we already do it in case of 1D indexer. I would say that this should hold as a general principle independent of what type of data you pass. Maybe it's just me though. |
Do I have to open up a new issue to ask my question, because this one is being seen as closed? I'm sorry, but I thought I had a point here that I like to be addressed. |
This issue is still open? |
@phofl So what do you think about
I don't understand why 2d boolean array labels should be treated differently than list labels. |
Here it is in code:
I also wouldn't make a distinction between index and column labels and check for validity of both. Just to be clear here, by validity I don't mean already existing alignment, but existence of all labels of the mask in the to be indexed dataframe. |
Since getitem dispatches to where in case of df, this is expected. Don't know what I would expected, but based on what we are doing with loc and alignment of DataFrames on the rhs this is somewhat consistent. Don't know how we should define alignable in case of DataFrames |
for |
If we don't raise, I don't understand how we want to align labels that are not present in the to be indexed DataFrame. Do we just want to leave them out in the alignment process? |
@gooney47 happy to take a PR for raising on the 2D inputs. let's see what this breaks. |
Will do on weekend. Should I do do a deprecate warning or just raise? Also raising/warning on general 2D's will prevent things like |
FutureWarning I would say, this has a pretty high impact on user code. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
Documentation of Pandas says I can use boolean array with iloc. The iloc call with the assignment works fine, but without assignment fails. Just as a note here: In pandas 1.0.1 not even the assignment works (it starts working again if there are more than 1 rows of data).
Expected Output
3 (similar as df.values[mask])
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: