Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: RangeIndex.get_indexer is incorrect for some decreasing RangeIndex #28678

Closed
jschendel opened this issue Sep 30, 2019 · 1 comment · Fixed by #28680
Closed

BUG: RangeIndex.get_indexer is incorrect for some decreasing RangeIndex #28678

jschendel opened this issue Sep 30, 2019 · 1 comment · Fixed by #28680
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jschendel
Copy link
Member

jschendel commented Sep 30, 2019

For some decreasing RangeIndex, the get_indexer method will indicate that all of it's own values are missing, and find matches for values not included in the index:

In [2]: ri = pd.RangeIndex(10, 0, -3)

In [3]: ri.get_indexer(ri)
Out[3]: array([-1, -1, -1, -1])

In [4]: ri.get_indexer(ri - 1)
Out[4]: array([ 1,  2,  3, -1])

This will in turn result in methods like Series.reindex not working properly:

In [5]: s = pd.Series(list('abcd'), index=ri) 

In [6]: s
Out[6]: 
10    a
7     b
4     c
1     d
dtype: object

In [7]: s.reindex([10, 9, 7])
Out[7]: 
10    NaN
9       b
7     NaN
dtype: object

The issue appears to occur specifically for decreasing RangeIndex that are not in their canonical form. By canonical form, I mean when stop is the next valid value in the range that's not included, e.g. when you think of a more standard range like range(1, 7, 1), 7 is the next valid value that's not present, but when the step is larger than 1 you lose uniqueness of representation with stop (i.e. range(1, 6, 2) == range(1, 7, 2)).

Note that the code above works properly for the equivalent RangeIndex in it's canonical form:

In [8]: ri2 = pd.RangeIndex(start=10, stop=-2, step=-3)

In [9]: ri2.equals(ri)
Out[9]: True

In [10]: ri2.get_indexer(ri2)
Out[10]: array([0, 1, 2, 3])

In [11]: ri2.get_indexer(ri2 - 1)
Out[11]: array([-1, -1, -1, -1])

In [12]: s2 = pd.Series(list('abcd'), index=ri2)

In [13]: s2
Out[13]: 
10    a
7     b
4     c
1     d
dtype: object

In [14]: s2.reindex([10, 9, 7])
Out[14]: 
10      a
9     NaN
7       b
dtype: object

The cause of the issue appears to be that the code to determine start, stop, step when dealing with decreasing RangeIndex in get_indexer assumes self.stop is the canonical form:

if self.step > 0:
start, stop, step = self.start, self.stop, self.step
else:
# Work on reversed range for simplicity:
start, stop, step = (self.stop - self.step, self.start + 1, -self.step)

Instead of directly computing the reversed values ourselves, I think we should simply take the values from the reversed underlying range object:

diff --git a/pandas/core/indexes/range.py b/pandas/core/indexes/range.py
index 8783351cc..4c5904e5a 100644
--- a/pandas/core/indexes/range.py
+++ b/pandas/core/indexes/range.py
@@ -387,7 +387,8 @@ class RangeIndex(Int64Index):
             start, stop, step = self.start, self.stop, self.step
         else:
             # Work on reversed range for simplicity:
-            start, stop, step = (self.stop - self.step, self.start + 1, -self.step)
+            reverse = self._range[::-1]
+            start, stop, step = reverse.start, reverse.stop, reverse.step
 
         target_array = np.asarray(target)
         if not (is_integer_dtype(target_array) and target_array.ndim == 1):
@jschendel jschendel added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Sep 30, 2019
@jschendel jschendel added this to the Contributions Welcome milestone Sep 30, 2019
@jschendel jschendel modified the milestones: Contributions Welcome, 0.25.2 Sep 30, 2019
@jschendel
Copy link
Member Author

Note that this is a regression as this worked in 0.24.2:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.24.2'

In [2]: ri = pd.RangeIndex(10, 0, -3)

In [3]: ri.get_indexer(ri)
Out[3]: array([0, 1, 2, 3])

In [4]: ri.get_indexer(ri - 1)
Out[4]: array([-1, -1, -1, -1])

In [5]: s = pd.Series(list('abcd'), index=ri)

In [6]: s
Out[6]: 
10    a
7     b
4     c
1     d
dtype: object

In [7]: s.reindex([10, 9, 7])
Out[7]: 
10      a
9     NaN
7       b
dtype: object

@jschendel jschendel added the Regression Functionality that used to work in a prior pandas version label Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant