Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: IndexError: positional indexers are out-of-bounds iloc boolean indexing #39004

Closed
2 of 3 tasks
gooney47 opened this issue Jan 6, 2021 · 35 comments · Fixed by #39022
Closed
2 of 3 tasks

BUG: IndexError: positional indexers are out-of-bounds iloc boolean indexing #39004

gooney47 opened this issue Jan 6, 2021 · 35 comments · Fixed by #39022
Labels
Bug Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@gooney47
Copy link

gooney47 commented Jan 6, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
df = pd.DataFrame([[0, 1, 2]], columns=['a', 'b', 'c'])
mask = pd.DataFrame([[False, True, False]], columns=['a', 'b', 'c'])
df.iloc[mask] = 3 # Works fine with assignment
print(df.iloc[mask]) # Throws IndexError, but should give similar result as df.values[mask]

Problem description

Documentation of Pandas says I can use boolean array with iloc. The iloc call with the assignment works fine, but without assignment fails. Just as a note here: In pandas 1.0.1 not even the assignment works (it starts working again if there are more than 1 rows of data).

Expected Output

3 (similar as df.values[mask])

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 3e89b4c python : 3.8.5.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-58-generic Version : #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.2.0 numpy : 1.18.4 pytz : 2019.3 dateutil : 2.7.3 pip : 20.2 setuptools : 46.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : 2.8.5 (dt dec pq3 ext lo64) jinja2 : 2.11.2 IPython : 7.19.0 pandas_datareader: None bs4 : 4.8.2 bottleneck : None fsspec : 0.7.3 fastparquet : None gcsfs : None matplotlib : 3.2.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 1.2.0 xlwt : None numba : 0.49.1
@gooney47 gooney47 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021
@phofl
Copy link
Member

phofl commented Jan 6, 2021

Hi, thanks for your report.

Why are you using a DataFrame as mask? I am not sure if this is intended to work

@gooney47
Copy link
Author

gooney47 commented Jan 6, 2021

I'm using a dataframe mask because I have a 2D boolean mask that marks all the places I want to change in my 2D float value dataframe. I was hoping there would be a function for this that prevents me from writing a loop (not for performance, just for convenience). In the documentation it is says it is possible to use "A boolean array (any NA values will be treated as False)." to index via iloc. What I found so confusing about this, is that the indexing actually works if I use an assignment, but it breaks if I only evaluate the indexing operation. As the assigment is something that is done on top of the indexing it makes no sense for me that it would work.

@phofl
Copy link
Member

phofl commented Jan 6, 2021

Our fallback for assignments is to operate on arrays. Hence it works.

iloc is two dimensional, but I am not quite sure what you want to achieve. Something like

mask = pd.DataFrame(
[
    [False, True, False],
    [True, False, True],
], columns=['a', 'b', 'c'])

should not be possible. You have to specify rows to select and columns to select.

@jbrockmendel thoughts here? I understood array as numpy array or pandas array not DataFrame.

@gooney47
Copy link
Author

gooney47 commented Jan 6, 2021

@phofl It does work though, check out this:

import pandas as pd
df = pd.DataFrame([
    [0, 1, 2], 
    [3, 4, 5]
], columns=['a', 'b', 'c'])
mask = pd.DataFrame(
[
    [False, True, False],
    [True, False, True],
], columns=['a', 'b', 'c'])
df.iloc[mask] = -1 # Works fine
print(df)

I don't have to specify rows and columns seperatively.

@phofl
Copy link
Member

phofl commented Jan 7, 2021

Sorry, was not clear enough. I did not expect

df.iloc[mask]

to work. I don't even know what this should return in your case. Meaning getitem

@gooney47
Copy link
Author

gooney47 commented Jan 7, 2021

I think what numpy does is pretty good, so returning the elements, selected by the mask, in a Series sounds appropriate to me. That would solve two things at once, my concern about it being weird that something is only working if you do an assignment on top of it, plus it provides more functionality.

@jreback
Copy link
Contributor

jreback commented Jan 7, 2021

this is not intended to work at all

iloc does not align and esp does not on a Dataframe

instead this gets turned into a ndarray and can easily be the wrong shape

@phofl
Copy link
Member

phofl commented Jan 7, 2021

Maybe we should raise when a DataFrame is given as indexer?

@jbrockmendel
Copy link
Member

Maybe we should raise when a DataFrame is given as indexer?

I think this makes sense. Definitely shouldnt be working for iloc.setitem but not iloc.getitem

@jreback
Copy link
Contributor

jreback commented Jan 7, 2021

yep agreed

@phofl phofl added Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 7, 2021
@gooney47
Copy link
Author

gooney47 commented Jan 7, 2021

I want to further clarify that I had a misunderstanding, which also was the main thing I originally objected in this issue. I assumed that if a code snippet B contains a code snippet A that B would depend on A working independently. This is however not true if there are operators in the expression since they can change the meaning. So in this example

[A] df[mask]
[B] df[mask] = 3

The snippet B does not depend on A to work as it's doing something completely different (writing instead of reading memory), even though A is contained in the code of B (if you look at the characters). So the reason why I originally created the issue, out of a misunderstanding, is solved.

Thankfully you found out that iloc.setitem shouldn't be possible with dataframe index, so something good came out of my misunderstanding. A raised error would have also prevented me from running into this mess.

@gooney47
Copy link
Author

gooney47 commented Jan 7, 2021

I got a question. If I wanted to use a boolean array with pandas for indexing I can use df[mask] instead of df.iloc[mask], right? Because if I do something like df[df == 2], it's basically using a boolean dataframe as an index. Or is this just more of the same that is working right now but actually shouldn't be working?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 7, 2021

Indexing with a boolean DataFrame is definitely supported with __getitem__, like the example @gooney47 gives (df[df == 2] or df[df == 2] = 3).
I think this is intentionally supported and documented?

But if that's indeed the case: if it works for __getitem__, is there a specific reason to not allow it for loc/iloc? (there might a good reason, didn't think it fully through, but I think generally we want getitem to be a subset of loc/iloc?)

@phofl
Copy link
Member

phofl commented Jan 7, 2021

The problem is the missing alignment in iloc. See the following:

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

indexer = pd.DataFrame({"b": [True, False], "c": [False, True]})
df[indexer]

returns

    a    b
0 NaN  3.0
1 NaN  NaN

What to do in case of iloc?
Same for setitem

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

indexer = pd.DataFrame({"b": [True, False], "c": [False, True]})
df[indexer] = 5

returns

   a  b
0  1  5
1  2  4

while iloc returns

   a  b
0  5  3
1  2  5

which is counterintuitive to loc and []

I would argue that the fact that this is working with iloc is simply a bug

@gooney47
Copy link
Author

gooney47 commented Jan 8, 2021

Why would I pass indices that are not considered in the indexing process?

@phofl
Copy link
Member

phofl commented Jan 8, 2021

Ok another example to clarify the point I am trying to make concerning iloc and []

df = pd.DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]})

indexer = pd.DataFrame({"b": [True, False]})
df[indexer]

    a    b   c
0 NaN  3.0 NaN
1 NaN  NaN NaN

df[indexer] = 10
df

   a   b  c
0  1  10  5
1  2   4  6

df = pd.DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]})
df.iloc[indexer] = 10

The last one actually raises, which should be expected because of the missing alignment in iloc

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2020.3/scratches/scratch_4.py", line 389, in <module>
    df.iloc[indexer] = 10
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 691, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 1640, in _setitem_with_indexer
    self._setitem_single_block(indexer, value, name)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 1866, in _setitem_single_block
    self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 562, in setitem
    return self.apply("setitem", indexer=indexer, value=value)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 428, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 989, in setitem
    values[indexer] = value
IndexError: boolean index did not match indexed array along dimension 1; dimension is 3 but corresponding boolean dimension is 1

@gooney47
Copy link
Author

gooney47 commented Jan 8, 2021

I get what you mean. What bugs me though is the fact that you can pass indices that will not be considered in the indexing process. Let's say we have this:

df = pd.DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]})
indexer = pd.DataFrame({"b": [True, False, True], 'a': [False, True, False]})

This will still work even though the last row won't be considered in the indexing process at all. I'm addressing this because I ran into this in my code. I had a boolean mask of same shape as my df, but the indices of the mask where different (due to some data mangling) and then I was wondering why df[mask] wouldn't work as expected (that all the indices provided in the mask would actually be considered and not only those were the indices are in intersecting.

@phofl
Copy link
Member

phofl commented Jan 8, 2021

This is what loc and [] do, they align the objects. This is documented in the docstrings and in the user guide (here for example https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label)

@gooney47
Copy link
Author

gooney47 commented Jan 8, 2021

Yeah it makes sense to have partial selection where indexers can be fully aligned, but does it make sense to accept indexers that can only be partially aligned?

@gooney47
Copy link
Author

gooney47 commented Jan 8, 2021

Wouldn't it be proper to throw an out of bounds? When I do df[['a', 'b', 'd']] when my df does not have d, it will throw too.

@jorisvandenbossche
Copy link
Member

@gooney47 it's certainly a tricky issue if you run into it and don't expect the behaviour. But as @phofl, this is certainly not accidental, but intentional behaviour and documented. Which of course doesn't mean we cannot re-consider it if it turns out to be confusing / non-ideal.
(personally, I don't think I ever made use of this feature)

@phofl thanks for those examples. Also Series.iloc already raises an error about, supporting your argument that for iloc it is not supposed to work:

In [172]: s = pd.Series([1, 2, 3], index=["a", "b", "c"])

In [173]: s.iloc[s > 1]
...
ValueError: iLocation based boolean indexing cannot use an indexable as a mask

Now, I personally still think we could reconsider this for aligned objects. Your examples used intentionally unaligned dataframes, but allowing aligned objects with equal index (or even more strict with identical index) would at least enable the use case like df.iloc[df == 2] (where the dataframe itself is used to create the mask), a case for which I think there is no ambiguity?

For example, assume you want to select positionally on the columns, and a mask for the rows. Currently you cannot easily do this with iloc. Eg df.iloc[df['a'] > 2, 0] is also not allowed, while it is not ambiguous I think (and the workaround are not an improvement IMO, eg df.loc[df['a'] > 2].iloc[:, 0] or df.iloc[(df['a'] > 2).values, 0])

@phofl
Copy link
Member

phofl commented Jan 8, 2021

I tend to agree that df.iloc[df['a'] > 2, 0] should be allowed, but this raises

NotImplementedError: iLocation based boolean indexing on an integer type is not available

So looks like this is intentionally disallowed because it is not implemented, not because it should not work.

For cases like df.iloc[df == 2]: I don't thinkt this is a case which has to work for iloc, you can simply use loc here, should give the same result, shouldn't it? Using iloc without positions is somewhat redundant?

@gooney47
Copy link
Author

gooney47 commented Jan 8, 2021

@jorisvandenbossche I don't believe this is intended behavior, because there is inconsistent raising behaviour when you pass 1D indexer compared to passing 2D indexer:

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
indexer_1d = pd.Series(['b', 'c'])
indexer_2d = pd.DataFrame({'b': [True, False], 'c': [False, True]})
print(df[indexer_2d]) # works
df[indexer_1d] # raises

Why should something suddenly be allowed, just because indexer has more dimensions? The reason why you raise non-existing indices is because you want people to notice that their passed indices do not apply to the data and I'm sure this helped many people in the 1 dimensional case. I'm just thinking that it would make sense in more dimensions too.

@phofl
Copy link
Member

phofl commented Jan 8, 2021

The 1d case should raise, this is not a boolean array. If you pass in a list or array like of labels this raises as soon as one label is not found

@gooney47
Copy link
Author

gooney47 commented Jan 8, 2021

I guess my general question is: Why it would make sense to let someone pass indices that are not present in the data? Why not help the people and say that they aren't present, how we already do it in case of 1D indexer. I would say that this should hold as a general principle independent of what type of data you pass. Maybe it's just me though.

@jreback jreback added this to the 1.3 milestone Jan 8, 2021
@gooney47
Copy link
Author

Do I have to open up a new issue to ask my question, because this one is being seen as closed? I'm sorry, but I thought I had a point here that I like to be addressed.

@phofl
Copy link
Member

phofl commented Jan 12, 2021

This issue is still open?

@gooney47
Copy link
Author

gooney47 commented Jan 13, 2021

@phofl So what do you think about

I guess my general question is: Why it would make sense to let someone pass indices that are not present in the data? Why not help the people and say that they aren't present, how we already do it in case of 1D indexer. I would say that this should hold as a general principle independent of what type of data you pass. Maybe it's just me though.

I don't understand why 2d boolean array labels should be treated differently than list labels.

@gooney47
Copy link
Author

gooney47 commented Jan 13, 2021

Here it is in code:

import pandas as pd
df = pd.DataFrame([
    [0, 1, 2],
    [3, 4, 5]
], columns=['a', 'b', 'c'], index=[0, 1])
valid_mask_1d = pd.Series([False, True], index=[0, 1])
invalid_mask_1d = pd.Series([False, True], index=[0, 2])
df[valid_mask_1d] # Works
# df[invalid_mask_1d] # Raises due to invalid index

valid_mask_2d = pd.DataFrame([
    [False, True, False],
    [False, True, False]
], index=[0, 1], columns=['a', 'b', 'c'])
invalid_mask_2d = pd.DataFrame([
    [False, True, False],
    [False, True, False]
], index=[0, 2], columns=['a', 'b', 'c'])
df[valid_mask_2d] # Works
df[invalid_mask_2d] # Works (I want this to raise too as in 1d case)

I also wouldn't make a distinction between index and column labels and check for validity of both. Just to be clear here, by validity I don't mean already existing alignment, but existence of all labels of the mask in the to be indexed dataframe.

@phofl
Copy link
Member

phofl commented Jan 17, 2021

Since getitem dispatches to where in case of df, this is expected. Don't know what I would expected, but based on what we are doing with loc and alignment of DataFrames on the rhs this is somewhat consistent.

Don't know how we should define alignable in case of DataFrames

@jreback
Copy link
Contributor

jreback commented Jan 18, 2021

for .loc we do align labels on the values & the masks / indexers. I think we could do this for a DataFrame as well. For getitem though, we do much less and i think an argument could be made to raise on an alignable for 2D (eg DataFrame)

@gooney47
Copy link
Author

If we don't raise, I don't understand how we want to align labels that are not present in the to be indexed DataFrame. Do we just want to leave them out in the alignment process?

@jreback
Copy link
Contributor

jreback commented Jan 21, 2021

If we don't raise, I don't understand how we want to align labels that are not present in the to be indexed DataFrame. Do we just want to leave them out in the alignment process?

@gooney47 happy to take a PR for raising on the 2D inputs. let's see what this breaks.

@gooney47
Copy link
Author

gooney47 commented Jan 22, 2021

Will do on weekend. Should I do do a deprecate warning or just raise? Also raising/warning on general 2D's will prevent things like df[df < 0]. We could check for existing alignment and allow things like df[df < 0].

@phofl
Copy link
Member

phofl commented Jan 22, 2021

FutureWarning I would say, this has a pretty high impact on user code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
5 participants