Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] select_rows function implementation #1173

Merged
merged 83 commits into from
Oct 31, 2022
Merged

[ENH] select_rows function implementation #1173

merged 83 commits into from
Oct 31, 2022

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Oct 11, 2022

PR Description

Please describe the changes proposed in the pull request:

  • Flexible row selection, similar to select_columns
  • level argument dropped in favour of dictionary for more flexibility. No deprecation cycle required..
  • _select_columns and _select_rows return a slice/booleans/integer/array of integers (via get_loc or get_locs) - this offers slightly better performance, instead of the initial round tripping, where the numeric indexer is obtained, the labels extracted, and then passed to loc. Now, we just pass the indexers to iloc, since we know exactly where the target label is.
  • more explicit column selection for MultiIndex in pivot_longer and pivot_wider.
  • selection with tuples implemented. Handy for MultiIndex selection
  • dictionary support for easy multiindex selection - devolves to pd.get_locs under the hood.
  • Explicit support for pandas/numpy objects, when no preprocessing is required. This usually offers more performance and close to loc performance, since there is very little indirections/checks.

This PR relates to #1124 .

MultiIndex selection with a dictionary - The example below is based on Pandas' Advanced Indexing guide:

import pandas as pd
import janitor

# select on a slice and a list, on different levels
dfmi.select_rows({0:slice('A1','A3'), 2:['C1','C3']})
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
         D1  125  124  127  126
A2 B0 C1 D0  137  136  139  138
         D1  141  140  143  142
      C3 D0  153  152  155  154
         D1  157  156  159  158
   B1 C1 D0  169  168  171  170
         D1  173  172  175  174
      C3 D0  185  184  187  186
         D1  189  188  191  190
A3 B0 C1 D0  201  200  203  202
         D1  205  204  207  206
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

# filter deeper on different levels
dfmi.select_rows({0:['A1','A3'], 2:['C1','C3']})
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
   B0 C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C3 D0  121  120  123  122
         D1  125  124  127  126
A3 B0 C1 D0  201  200  203  202
         D1  205  204  207  206
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
   B0 C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C3 D0  249  248  251  250
         D1  253  252  255  254

# filter on rows and columns
# we can merge this into one in the future
# when the generic `select` function is implemented
dfmi.select_rows('A1').select_columns({-1:'foo'})
lvl0           a    b
lvl1         foo  foo
A1 B0 C0 D0   64   66
         D1   68   70
      C1 D0   72   74
         D1   76   78
      C2 D0   80   82
         D1   84   86
      C3 D0   88   90
         D1   92   94
   B1 C0 D0   96   98
         D1  100  102
      C1 D0  104  106
         D1  108  110
      C2 D0  112  114
         D1  116  118
      C3 D0  120  122
         D1  124  126

# select on a mix of booleans and labels
mask = dfmi.select_columns(('a','foo')).gt(200).squeeze() # shrink to a Series
dfmi.select_rows({0:mask, 2:['C1','C3']}).select_columns({1:'foo'})
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

You can also select partially on a MultiIndex with tuples - just like with loc, and select multiple tuples :

dfmi.select_rows(('A0', 'B0'), ('A1','B1'))
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
         D1   29   28   31   30
A1 B1 C0 D0   97   96   99   98
         D1  101  100  103  102
      C1 D0  105  104  107  106
         D1  109  108  111  110
      C2 D0  113  112  115  114
         D1  117  116  119  118
      C3 D0  121  120  123  122
         D1  125  124  127  126

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@samukweku samukweku self-assigned this Oct 11, 2022
@ericmjl
Copy link
Member

ericmjl commented Oct 11, 2022

@codecov
Copy link

codecov bot commented Oct 11, 2022

Codecov Report

Merging #1173 (3168732) into dev (f0c7906) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1173      +/-   ##
==========================================
+ Coverage   97.93%   98.01%   +0.07%     
==========================================
  Files          76       76              
  Lines        3387     3472      +85     
==========================================
+ Hits         3317     3403      +86     
+ Misses         70       69       -1     

@samukweku samukweku changed the title [ENH] select_rows function implemented [ENH] select_rows function implementation Oct 11, 2022
@ericmjl
Copy link
Member

ericmjl commented Oct 22, 2022

@samukweku how do you feel about the PR? It's been a while, but looks like it might be good to merge.

@samukweku
Copy link
Collaborator Author

@ericmjl yea, lets merge

@ericmjl
Copy link
Member

ericmjl commented Oct 29, 2022

Okie dokes, @samukweku! I think we should be good to merge here, is that right?

@samukweku
Copy link
Collaborator Author

Yes, good 👍

@ericmjl ericmjl merged commit 8445dc0 into dev Oct 31, 2022
@ericmjl ericmjl deleted the samukweku/select_rows branch October 31, 2022 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants