BUG: IndexError: positional indexers are out-of-bounds iloc boolean indexing #39004

gooney47 · 2021-01-06T15:49:44Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
df = pd.DataFrame([[0, 1, 2]], columns=['a', 'b', 'c'])
mask = pd.DataFrame([[False, True, False]], columns=['a', 'b', 'c'])
df.iloc[mask] = 3 # Works fine with assignment
print(df.iloc[mask]) # Throws IndexError, but should give similar result as df.values[mask]

Problem description

Documentation of Pandas says I can use boolean array with iloc. The iloc call with the assignment works fine, but without assignment fails. Just as a note here: In pandas 1.0.1 not even the assignment works (it starts working again if there are more than 1 rows of data).

Expected Output

3 (similar as df.values[mask])

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : 3e89b4c python : 3.8.5.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-58-generic Version : #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.2.0 numpy : 1.18.4 pytz : 2019.3 dateutil : 2.7.3 pip : 20.2 setuptools : 46.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : 2.8.5 (dt dec pq3 ext lo64) jinja2 : 2.11.2 IPython : 7.19.0 pandas_datareader: None bs4 : 4.8.2 bottleneck : None fsspec : 0.7.3 fastparquet : None gcsfs : None matplotlib : 3.2.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 1.2.0 xlwt : None numba : 0.49.1

The text was updated successfully, but these errors were encountered:

phofl · 2021-01-06T19:14:19Z

Hi, thanks for your report.

Why are you using a DataFrame as mask? I am not sure if this is intended to work

gooney47 · 2021-01-06T20:36:05Z

I'm using a dataframe mask because I have a 2D boolean mask that marks all the places I want to change in my 2D float value dataframe. I was hoping there would be a function for this that prevents me from writing a loop (not for performance, just for convenience). In the documentation it is says it is possible to use "A boolean array (any NA values will be treated as False)." to index via iloc. What I found so confusing about this, is that the indexing actually works if I use an assignment, but it breaks if I only evaluate the indexing operation. As the assigment is something that is done on top of the indexing it makes no sense for me that it would work.

phofl · 2021-01-06T20:41:49Z

Our fallback for assignments is to operate on arrays. Hence it works.

iloc is two dimensional, but I am not quite sure what you want to achieve. Something like

mask = pd.DataFrame(
[
    [False, True, False],
    [True, False, True],
], columns=['a', 'b', 'c'])

should not be possible. You have to specify rows to select and columns to select.

@jbrockmendel thoughts here? I understood array as numpy array or pandas array not DataFrame.

gooney47 · 2021-01-06T23:59:22Z

@phofl It does work though, check out this:

import pandas as pd
df = pd.DataFrame([
    [0, 1, 2], 
    [3, 4, 5]
], columns=['a', 'b', 'c'])
mask = pd.DataFrame(
[
    [False, True, False],
    [True, False, True],
], columns=['a', 'b', 'c'])
df.iloc[mask] = -1 # Works fine
print(df)

I don't have to specify rows and columns seperatively.

phofl · 2021-01-07T00:29:53Z

Sorry, was not clear enough. I did not expect

df.iloc[mask]

to work. I don't even know what this should return in your case. Meaning getitem

gooney47 · 2021-01-07T00:46:38Z

I think what numpy does is pretty good, so returning the elements, selected by the mask, in a Series sounds appropriate to me. That would solve two things at once, my concern about it being weird that something is only working if you do an assignment on top of it, plus it provides more functionality.

jreback · 2021-01-07T01:46:36Z

this is not intended to work at all

iloc does not align and esp does not on a Dataframe

instead this gets turned into a ndarray and can easily be the wrong shape

phofl · 2021-01-07T01:48:27Z

Maybe we should raise when a DataFrame is given as indexer?

jbrockmendel · 2021-01-07T02:05:58Z

Maybe we should raise when a DataFrame is given as indexer?

I think this makes sense. Definitely shouldnt be working for iloc.setitem but not iloc.getitem

jreback · 2021-01-07T02:17:09Z

yep agreed

gooney47 · 2021-01-07T13:31:08Z

I want to further clarify that I had a misunderstanding, which also was the main thing I originally objected in this issue. I assumed that if a code snippet B contains a code snippet A that B would depend on A working independently. This is however not true if there are operators in the expression since they can change the meaning. So in this example

[A] df[mask]
[B] df[mask] = 3

The snippet B does not depend on A to work as it's doing something completely different (writing instead of reading memory), even though A is contained in the code of B (if you look at the characters). So the reason why I originally created the issue, out of a misunderstanding, is solved.

Thankfully you found out that iloc.setitem shouldn't be possible with dataframe index, so something good came out of my misunderstanding. A raised error would have also prevented me from running into this mess.

gooney47 · 2021-01-07T13:55:45Z

I got a question. If I wanted to use a boolean array with pandas for indexing I can use df[mask] instead of df.iloc[mask], right? Because if I do something like df[df == 2], it's basically using a boolean dataframe as an index. Or is this just more of the same that is working right now but actually shouldn't be working?

jorisvandenbossche · 2021-01-07T19:51:34Z

Indexing with a boolean DataFrame is definitely supported with __getitem__, like the example @gooney47 gives (df[df == 2] or df[df == 2] = 3).
I think this is intentionally supported and documented?

But if that's indeed the case: if it works for __getitem__, is there a specific reason to not allow it for loc/iloc? (there might a good reason, didn't think it fully through, but I think generally we want getitem to be a subset of loc/iloc?)

phofl · 2021-01-07T19:58:31Z

The problem is the missing alignment in iloc. See the following:

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

indexer = pd.DataFrame({"b": [True, False], "c": [False, True]})
df[indexer]

returns

    a    b
0 NaN  3.0
1 NaN  NaN

What to do in case of iloc?
Same for setitem

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

indexer = pd.DataFrame({"b": [True, False], "c": [False, True]})
df[indexer] = 5

returns

   a  b
0  1  5
1  2  4

while iloc returns

   a  b
0  5  3
1  2  5

which is counterintuitive to loc and []

I would argue that the fact that this is working with iloc is simply a bug

gooney47 · 2021-01-08T10:26:36Z

Why would I pass indices that are not considered in the indexing process?

phofl · 2021-01-08T10:41:31Z

Ok another example to clarify the point I am trying to make concerning iloc and []

df = pd.DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]})

indexer = pd.DataFrame({"b": [True, False]})
df[indexer]

    a    b   c
0 NaN  3.0 NaN
1 NaN  NaN NaN

df[indexer] = 10
df

   a   b  c
0  1  10  5
1  2   4  6

df = pd.DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]})
df.iloc[indexer] = 10

The last one actually raises, which should be expected because of the missing alignment in iloc

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2020.3/scratches/scratch_4.py", line 389, in <module>
    df.iloc[indexer] = 10
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 691, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 1640, in _setitem_with_indexer
    self._setitem_single_block(indexer, value, name)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 1866, in _setitem_single_block
    self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 562, in setitem
    return self.apply("setitem", indexer=indexer, value=value)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 428, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 989, in setitem
    values[indexer] = value
IndexError: boolean index did not match indexed array along dimension 1; dimension is 3 but corresponding boolean dimension is 1

gooney47 · 2021-01-08T10:52:00Z

I get what you mean. What bugs me though is the fact that you can pass indices that will not be considered in the indexing process. Let's say we have this:

df = pd.DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]})
indexer = pd.DataFrame({"b": [True, False, True], 'a': [False, True, False]})

This will still work even though the last row won't be considered in the indexing process at all. I'm addressing this because I ran into this in my code. I had a boolean mask of same shape as my df, but the indices of the mask where different (due to some data mangling) and then I was wondering why df[mask] wouldn't work as expected (that all the indices provided in the mask would actually be considered and not only those were the indices are in intersecting.

phofl · 2021-01-08T10:54:41Z

This is what loc and [] do, they align the objects. This is documented in the docstrings and in the user guide (here for example https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label)

gooney47 · 2021-01-08T11:03:42Z

Yeah it makes sense to have partial selection where indexers can be fully aligned, but does it make sense to accept indexers that can only be partially aligned?

gooney47 · 2021-01-08T11:12:03Z

Wouldn't it be proper to throw an out of bounds? When I do df[['a', 'b', 'd']] when my df does not have d, it will throw too.

jorisvandenbossche · 2021-01-08T13:53:21Z

@gooney47 it's certainly a tricky issue if you run into it and don't expect the behaviour. But as @phofl, this is certainly not accidental, but intentional behaviour and documented. Which of course doesn't mean we cannot re-consider it if it turns out to be confusing / non-ideal.
(personally, I don't think I ever made use of this feature)

@phofl thanks for those examples. Also Series.iloc already raises an error about, supporting your argument that for iloc it is not supposed to work:

In [172]: s = pd.Series([1, 2, 3], index=["a", "b", "c"])

In [173]: s.iloc[s > 1]
...
ValueError: iLocation based boolean indexing cannot use an indexable as a mask

Now, I personally still think we could reconsider this for aligned objects. Your examples used intentionally unaligned dataframes, but allowing aligned objects with equal index (or even more strict with identical index) would at least enable the use case like df.iloc[df == 2] (where the dataframe itself is used to create the mask), a case for which I think there is no ambiguity?

For example, assume you want to select positionally on the columns, and a mask for the rows. Currently you cannot easily do this with iloc. Eg df.iloc[df['a'] > 2, 0] is also not allowed, while it is not ambiguous I think (and the workaround are not an improvement IMO, eg df.loc[df['a'] > 2].iloc[:, 0] or df.iloc[(df['a'] > 2).values, 0])

phofl · 2021-01-08T14:01:24Z

I tend to agree that df.iloc[df['a'] > 2, 0] should be allowed, but this raises

NotImplementedError: iLocation based boolean indexing on an integer type is not available

So looks like this is intentionally disallowed because it is not implemented, not because it should not work.

For cases like df.iloc[df == 2]: I don't thinkt this is a case which has to work for iloc, you can simply use loc here, should give the same result, shouldn't it? Using iloc without positions is somewhat redundant?

gooney47 · 2021-01-08T14:14:28Z

@jorisvandenbossche I don't believe this is intended behavior, because there is inconsistent raising behaviour when you pass 1D indexer compared to passing 2D indexer:

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
indexer_1d = pd.Series(['b', 'c'])
indexer_2d = pd.DataFrame({'b': [True, False], 'c': [False, True]})
print(df[indexer_2d]) # works
df[indexer_1d] # raises

Why should something suddenly be allowed, just because indexer has more dimensions? The reason why you raise non-existing indices is because you want people to notice that their passed indices do not apply to the data and I'm sure this helped many people in the 1 dimensional case. I'm just thinking that it would make sense in more dimensions too.

phofl · 2021-01-08T14:20:05Z

The 1d case should raise, this is not a boolean array. If you pass in a list or array like of labels this raises as soon as one label is not found

gooney47 · 2021-01-08T14:30:41Z

I guess my general question is: Why it would make sense to let someone pass indices that are not present in the data? Why not help the people and say that they aren't present, how we already do it in case of 1D indexer. I would say that this should hold as a general principle independent of what type of data you pass. Maybe it's just me though.

gooney47 · 2021-01-12T15:19:48Z

Do I have to open up a new issue to ask my question, because this one is being seen as closed? I'm sorry, but I thought I had a point here that I like to be addressed.

phofl · 2021-01-12T22:08:48Z

This issue is still open?

gooney47 · 2021-01-13T11:07:36Z

@phofl So what do you think about

I guess my general question is: Why it would make sense to let someone pass indices that are not present in the data? Why not help the people and say that they aren't present, how we already do it in case of 1D indexer. I would say that this should hold as a general principle independent of what type of data you pass. Maybe it's just me though.

I don't understand why 2d boolean array labels should be treated differently than list labels.

gooney47 · 2021-01-13T11:19:39Z

Here it is in code:

import pandas as pd
df = pd.DataFrame([
    [0, 1, 2],
    [3, 4, 5]
], columns=['a', 'b', 'c'], index=[0, 1])
valid_mask_1d = pd.Series([False, True], index=[0, 1])
invalid_mask_1d = pd.Series([False, True], index=[0, 2])
df[valid_mask_1d] # Works
# df[invalid_mask_1d] # Raises due to invalid index

valid_mask_2d = pd.DataFrame([
    [False, True, False],
    [False, True, False]
], index=[0, 1], columns=['a', 'b', 'c'])
invalid_mask_2d = pd.DataFrame([
    [False, True, False],
    [False, True, False]
], index=[0, 2], columns=['a', 'b', 'c'])
df[valid_mask_2d] # Works
df[invalid_mask_2d] # Works (I want this to raise too as in 1d case)

I also wouldn't make a distinction between index and column labels and check for validity of both. Just to be clear here, by validity I don't mean already existing alignment, but existence of all labels of the mask in the to be indexed dataframe.

phofl · 2021-01-17T02:19:05Z

Since getitem dispatches to where in case of df, this is expected. Don't know what I would expected, but based on what we are doing with loc and alignment of DataFrames on the rhs this is somewhat consistent.

Don't know how we should define alignable in case of DataFrames

jreback · 2021-01-18T15:12:15Z

for .loc we do align labels on the values & the masks / indexers. I think we could do this for a DataFrame as well. For getitem though, we do much less and i think an argument could be made to raise on an alignable for 2D (eg DataFrame)

gooney47 · 2021-01-20T15:33:24Z

If we don't raise, I don't understand how we want to align labels that are not present in the to be indexed DataFrame. Do we just want to leave them out in the alignment process?

jreback · 2021-01-21T17:34:48Z

If we don't raise, I don't understand how we want to align labels that are not present in the to be indexed DataFrame. Do we just want to leave them out in the alignment process?

@gooney47 happy to take a PR for raising on the 2D inputs. let's see what this breaks.

gooney47 · 2021-01-22T09:57:22Z

Will do on weekend. Should I do do a deprecate warning or just raise? Also raising/warning on general 2D's will prevent things like df[df < 0]. We could check for existing alignment and allow things like df[df < 0].

phofl · 2021-01-22T10:16:11Z

FutureWarning I would say, this has a pretty high impact on user code.

gooney47 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021

phofl added Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 7, 2021

phofl mentioned this issue Jan 7, 2021

Deprecate DataFrame indexer for iloc setitem and getitem #39022

Merged

4 tasks

jreback added this to the 1.3 milestone Jan 8, 2021

quant-dc mentioned this issue Jan 18, 2021

BUG: iloc assignment in Pandas 1.2.0 #39261

Closed

3 tasks

gooney47 mentioned this issue Jan 24, 2021

Warn on boolean frame indexer #39373

Closed

4 tasks

jreback closed this as completed in #39022 Mar 2, 2021

BUG: IndexError: positional indexers are out-of-bounds iloc boolean indexing #39004

BUG: IndexError: positional indexers are out-of-bounds iloc boolean indexing #39004

Comments

gooney47 commented Jan 6, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

phofl commented Jan 6, 2021

gooney47 commented Jan 6, 2021 • edited Loading

phofl commented Jan 6, 2021

gooney47 commented Jan 6, 2021 • edited Loading

phofl commented Jan 7, 2021 • edited Loading

gooney47 commented Jan 7, 2021 • edited Loading

jreback commented Jan 7, 2021

phofl commented Jan 7, 2021

jbrockmendel commented Jan 7, 2021

jreback commented Jan 7, 2021

gooney47 commented Jan 7, 2021 • edited Loading

gooney47 commented Jan 7, 2021 • edited Loading

jorisvandenbossche commented Jan 7, 2021 • edited Loading

phofl commented Jan 7, 2021 • edited Loading

gooney47 commented Jan 8, 2021 • edited Loading

phofl commented Jan 8, 2021 • edited Loading

gooney47 commented Jan 8, 2021

phofl commented Jan 8, 2021

gooney47 commented Jan 8, 2021 • edited Loading

gooney47 commented Jan 8, 2021 • edited Loading

jorisvandenbossche commented Jan 8, 2021

phofl commented Jan 8, 2021 • edited Loading

gooney47 commented Jan 8, 2021 • edited Loading

phofl commented Jan 8, 2021

gooney47 commented Jan 8, 2021 • edited Loading

gooney47 commented Jan 12, 2021

phofl commented Jan 12, 2021

gooney47 commented Jan 13, 2021 • edited Loading

gooney47 commented Jan 13, 2021 • edited Loading

phofl commented Jan 17, 2021

jreback commented Jan 18, 2021

gooney47 commented Jan 20, 2021

jreback commented Jan 21, 2021

gooney47 commented Jan 22, 2021 • edited Loading

phofl commented Jan 22, 2021

gooney47 commented Jan 6, 2021 •

edited

Loading

Output of `pd.show_versions()`

gooney47 commented Jan 6, 2021 •

edited

Loading

gooney47 commented Jan 6, 2021 •

edited

Loading

phofl commented Jan 7, 2021 •

edited

Loading

gooney47 commented Jan 7, 2021 •

edited

Loading

gooney47 commented Jan 7, 2021 •

edited

Loading

gooney47 commented Jan 7, 2021 •

edited

Loading

jorisvandenbossche commented Jan 7, 2021 •

edited

Loading

phofl commented Jan 7, 2021 •

edited

Loading

gooney47 commented Jan 8, 2021 •

edited

Loading

phofl commented Jan 8, 2021 •

edited

Loading

gooney47 commented Jan 8, 2021 •

edited

Loading

gooney47 commented Jan 8, 2021 •

edited

Loading

phofl commented Jan 8, 2021 •

edited

Loading

gooney47 commented Jan 8, 2021 •

edited

Loading

gooney47 commented Jan 8, 2021 •

edited

Loading

gooney47 commented Jan 13, 2021 •

edited

Loading

gooney47 commented Jan 13, 2021 •

edited

Loading

gooney47 commented Jan 22, 2021 •

edited

Loading