[ENH] `select_rows` function implementation #1173

samukweku · 2022-10-11T01:05:01Z

PR Description

Please describe the changes proposed in the pull request:

Flexible row selection, similar to select_columns
level argument dropped in favour of dictionary for more flexibility. No deprecation cycle required..
_select_columns and _select_rows return a slice/booleans/integer/array of integers (via get_loc or get_locs) - this offers slightly better performance, instead of the initial round tripping, where the numeric indexer is obtained, the labels extracted, and then passed to loc. Now, we just pass the indexers to iloc, since we know exactly where the target label is.
more explicit column selection for MultiIndex in pivot_longer and pivot_wider.
selection with tuples implemented. Handy for MultiIndex selection
dictionary support for easy multiindex selection - devolves to pd.get_locs under the hood.
Explicit support for pandas/numpy objects, when no preprocessing is required. This usually offers more performance and close to loc performance, since there is very little indirections/checks.

This PR relates to #1124 .

MultiIndex selection with a dictionary - The example below is based on Pandas' Advanced Indexing guide:

import pandas as pd
import janitor

# select on a slice and a list, on different levels
dfmi.select_rows({0:slice('A1','A3'), 2:['C1','C3']})
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
         D1  125  124  127  126
A2 B0 C1 D0  137  136  139  138
         D1  141  140  143  142
      C3 D0  153  152  155  154
         D1  157  156  159  158
   B1 C1 D0  169  168  171  170
         D1  173  172  175  174
      C3 D0  185  184  187  186
         D1  189  188  191  190
A3 B0 C1 D0  201  200  203  202
         D1  205  204  207  206
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

# filter deeper on different levels
dfmi.select_rows({0:['A1','A3'], 2:['C1','C3']})
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
   B0 C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C3 D0  121  120  123  122
         D1  125  124  127  126
A3 B0 C1 D0  201  200  203  202
         D1  205  204  207  206
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
   B0 C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C3 D0  249  248  251  250
         D1  253  252  255  254

# filter on rows and columns
# we can merge this into one in the future
# when the generic `select` function is implemented
dfmi.select_rows('A1').select_columns({-1:'foo'})
lvl0           a    b
lvl1         foo  foo
A1 B0 C0 D0   64   66
         D1   68   70
      C1 D0   72   74
         D1   76   78
      C2 D0   80   82
         D1   84   86
      C3 D0   88   90
         D1   92   94
   B1 C0 D0   96   98
         D1  100  102
      C1 D0  104  106
         D1  108  110
      C2 D0  112  114
         D1  116  118
      C3 D0  120  122
         D1  124  126

# select on a mix of booleans and labels
mask = dfmi.select_columns(('a','foo')).gt(200).squeeze() # shrink to a Series
dfmi.select_rows({0:mask, 2:['C1','C3']}).select_columns({1:'foo'})
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

You can also select partially on a MultiIndex with tuples - just like with loc, and select multiple tuples :

dfmi.select_rows(('A0', 'B0'), ('A1','B1'))
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
         D1   29   28   31   30
A1 B1 C0 D0   97   96   99   98
         D1  101  100  103  102
      C1 D0  105  104  107  106
         D1  109  108  111  110
      C2 D0  113  112  115  114
         D1  117  116  119  118
      C3 D0  121  120  123  122
         D1  125  124  127  126

PR Checklist

Please ensure that you have done the following:

PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.

If you're not on the contributors list, add yourself to AUTHORS.md.

Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

Building a preview of the docs on Netlify
Automatically linting the code
Making sure the code is documented
Making sure that all tests are passed
Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl

…devs/pyjanitor into samukweku/select_rows

ericmjl · 2022-10-11T01:10:47Z

🚀 Deployed on https://deploy-preview-1173--pyjanitor.netlify.app

…devs/pyjanitor into samukweku/select_rows

codecov · 2022-10-11T01:28:47Z

Codecov Report

Merging #1173 (3168732) into dev (f0c7906) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1173      +/-   ##
==========================================
+ Coverage   97.93%   98.01%   +0.07%     
==========================================
  Files          76       76              
  Lines        3387     3472      +85     
==========================================
+ Hits         3317     3403      +86     
+ Misses         70       69       -1

…o samukweku/select_rows

ericmjl · 2022-10-22T17:53:35Z

@samukweku how do you feel about the PR? It's been a while, but looks like it might be good to merge.

samukweku · 2022-10-22T21:02:02Z

@ericmjl yea, lets merge

…devs/pyjanitor into samukweku/select_rows

ericmjl · 2022-10-29T17:27:47Z

Okie dokes, @samukweku! I think we should be good to merge here, is that right?

samukweku · 2022-10-29T20:21:10Z

Yes, good 👍

samukweku added 8 commits October 8, 2022 08:31

add changelog

42f4663

select_rows implementation

72ee224

multiindex level selection implementation

e00377e

tests added

6cd01db

updates to docs and tests

e91f107

Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…

f99b937

…devs/pyjanitor into samukweku/select_rows

Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…

6ef79e8

…devs/pyjanitor into samukweku/select_rows

updates to changelog

24f6559

samukweku requested review from ericmjl, Zeroto521 and thatlittleboy October 11, 2022 01:05

samukweku self-assigned this Oct 11, 2022

samukweku added 2 commits October 11, 2022 12:05

Update select_columns.ipynb

1640264

remove unnecessary file

c570af0

samukweku added 4 commits October 11, 2022 01:15

add select_rows to janitor/__init__.py

1bafb32

Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…

d6ae385

…devs/pyjanitor into samukweku/select_rows

update select_rows docs

f7b923e

updates to select links

3ca4e57

samukweku added 4 commits October 11, 2022 02:56

add more tests

c4ccb35

move utils/test__select_columns to functions/test_select_columns

d8f2356

change columns_to_select to cols

c9a426f

remove print

ceba067

samukweku changed the title ~~[ENH] select_rows function implemented~~ [ENH] select_rows function implementation Oct 11, 2022

samukweku added 5 commits October 11, 2022 12:22

updates

18a131e

spelling fix

4522f7c

Update CHANGELOG.md

8fc9d16

Update utils.py

ccb6435

more tests

dd6de85

samukweku added 3 commits October 15, 2022 09:39

logic for when dictionary is used

2944f4f

logic for fnmatch/regex selection on multiindex

e811178

add tests for regex/fnmatch on multiindex

ffb6f76

ericmjl approved these changes Oct 15, 2022

View reviewed changes

samukweku added 13 commits October 15, 2022 22:11

remove shortcut to loc

f165274

pass responsibility of slice to pandas

4522ab6

remove print

673c35c

keys for dict for multiindex should be strings/integers only

68dbd9a

remove IndexLabel class

a24bcdb

changelog

ce2f9bf

improve error reporting for fnmatch

8312a0d

cleanup docs

99d333f

cleanup docs

abe5002

fix links

9820ae2

Merge branch 'dev' of https://github.com/pyjanitor-devs/pyjanitor int…

4720984

…o samukweku/select_rows

add notes for users

b05645e

fix grammar

c6345d6

ericmjl and others added 7 commits October 24, 2022 15:02

Merge branch 'dev' into samukweku/select_rows

88c699c

shortcut to get_indexer for performance, if possible

f410945

Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…

7b1caa9

…devs/pyjanitor into samukweku/select_rows

undo last commit

7a12a7e

add dispatch for range

c07464e

fix grammar

8de787f

update docs

3168732

ericmjl merged commit 8445dc0 into dev Oct 31, 2022

ericmjl deleted the samukweku/select_rows branch October 31, 2022 00:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] `select_rows` function implementation #1173

[ENH] `select_rows` function implementation #1173

samukweku commented Oct 11, 2022 •

edited

Loading

ericmjl commented Oct 11, 2022 •

edited

Loading

codecov bot commented Oct 11, 2022 •

edited

Loading

ericmjl commented Oct 22, 2022

samukweku commented Oct 22, 2022

ericmjl commented Oct 29, 2022

samukweku commented Oct 29, 2022

[ENH] select_rows function implementation #1173

[ENH] select_rows function implementation #1173

Conversation

samukweku commented Oct 11, 2022 • edited Loading

PR Description

PR Checklist

Automatic checks

Relevant Reviewers

ericmjl commented Oct 11, 2022 • edited Loading

codecov bot commented Oct 11, 2022 • edited Loading

Codecov Report

ericmjl commented Oct 22, 2022

samukweku commented Oct 22, 2022

ericmjl commented Oct 29, 2022

samukweku commented Oct 29, 2022

[ENH] `select_rows` function implementation #1173

[ENH] `select_rows` function implementation #1173

samukweku commented Oct 11, 2022 •

edited

Loading

ericmjl commented Oct 11, 2022 •

edited

Loading

codecov bot commented Oct 11, 2022 •

edited

Loading