Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] select_rows function implementation #1173

Merged
merged 83 commits into from
Oct 31, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
42f4663
add changelog
samukweku Oct 8, 2022
72ee224
select_rows implementation
samukweku Oct 9, 2022
e00377e
multiindex level selection implementation
samukweku Oct 10, 2022
6cd01db
tests added
samukweku Oct 11, 2022
e91f107
updates to docs and tests
samukweku Oct 11, 2022
f99b937
Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…
samukweku Oct 11, 2022
6ef79e8
Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…
samukweku Oct 11, 2022
24f6559
updates to changelog
samukweku Oct 11, 2022
1640264
Update select_columns.ipynb
samukweku Oct 11, 2022
c570af0
remove unnecessary file
samukweku Oct 11, 2022
1bafb32
add select_rows to janitor/__init__.py
samukweku Oct 11, 2022
d6ae385
Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…
samukweku Oct 11, 2022
f7b923e
update select_rows docs
samukweku Oct 11, 2022
3ca4e57
updates to select links
samukweku Oct 11, 2022
c4ccb35
add more tests
samukweku Oct 11, 2022
d8f2356
move utils/test__select_columns to functions/test_select_columns
samukweku Oct 11, 2022
c9a426f
change columns_to_select to cols
samukweku Oct 11, 2022
ceba067
remove print
samukweku Oct 11, 2022
18a131e
updates
samukweku Oct 11, 2022
4522f7c
spelling fix
samukweku Oct 11, 2022
8fc9d16
Update CHANGELOG.md
samukweku Oct 11, 2022
ccb6435
Update utils.py
samukweku Oct 11, 2022
dd6de85
more tests
samukweku Oct 11, 2022
bbcd6a8
Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…
samukweku Oct 11, 2022
fd8852d
Merge branch 'dev' of https://github.com/pyjanitor-devs/pyjanitor int…
samukweku Oct 11, 2022
47bb446
explicit label selection in pivot_longer and pivot_wider
samukweku Oct 12, 2022
3969440
spelling fix
samukweku Oct 12, 2022
6ed8805
tuple selection added
samukweku Oct 12, 2022
c522f0a
update logic for pivot_wider
samukweku Oct 12, 2022
d99eb57
improve performance when single value passed to select_*
samukweku Oct 12, 2022
de26df3
fix for boolean array for single select_*
samukweku Oct 12, 2022
f976041
dict support for MultiIndex indexing
samukweku Oct 12, 2022
232882f
changelog
samukweku Oct 12, 2022
576d66b
changelog
samukweku Oct 12, 2022
ec869a0
changelog
samukweku Oct 12, 2022
8a906e1
changelog
samukweku Oct 12, 2022
7ea8d7a
fix column selection via dictionary in conditional_join
samukweku Oct 12, 2022
615c026
Update pivot.py
samukweku Oct 12, 2022
03d0d66
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 12, 2022
822e994
updates to docs
samukweku Oct 12, 2022
2b51275
Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…
samukweku Oct 12, 2022
bc04697
simplify logic
samukweku Oct 13, 2022
b6fd700
simplify level_labels logic
samukweku Oct 13, 2022
45f14fe
add regex and callable options to dict
samukweku Oct 13, 2022
97d7911
cleanup
samukweku Oct 13, 2022
50f8c73
test for callable errors
samukweku Oct 13, 2022
6c5fe73
callable applied across entire dataframe for performance
samukweku Oct 13, 2022
4915d68
add tests for MultiIndex dictionary
samukweku Oct 13, 2022
00d26b9
explicit support for pandas/numpy objects
samukweku Oct 14, 2022
1f437f6
add test for boolean callable length mismatch
samukweku Oct 14, 2022
d0be77f
fix test fails for conditional_join
samukweku Oct 14, 2022
b74c441
Update select_columns.ipynb
samukweku Oct 14, 2022
aeb4f98
edit on conditional join; improve on Pandas/numpy object selection on…
samukweku Oct 14, 2022
3244205
Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…
samukweku Oct 14, 2022
ad88fe8
update
samukweku Oct 14, 2022
fcc0c1b
spelling fix
samukweku Oct 14, 2022
0c18a52
strip irrelevance from slice dispatch
samukweku Oct 14, 2022
fb5b5a5
fix for IndexLabel and dict
samukweku Oct 14, 2022
8152aa5
use loc directly if possible, else pass to _select_index
samukweku Oct 15, 2022
e5ab39c
keep dict as-is in conditional_join
samukweku Oct 15, 2022
2944f4f
logic for when dictionary is used
samukweku Oct 15, 2022
e811178
logic for fnmatch/regex selection on multiindex
samukweku Oct 15, 2022
ffb6f76
add tests for regex/fnmatch on multiindex
samukweku Oct 15, 2022
f165274
remove shortcut to loc
samukweku Oct 15, 2022
4522ab6
pass responsibility of slice to pandas
samukweku Oct 16, 2022
673c35c
remove print
samukweku Oct 16, 2022
68dbd9a
keys for dict for multiindex should be strings/integers only
samukweku Oct 16, 2022
a24bcdb
remove IndexLabel class
samukweku Oct 17, 2022
ce2f9bf
changelog
samukweku Oct 17, 2022
8312a0d
improve error reporting for fnmatch
samukweku Oct 17, 2022
99d333f
cleanup docs
samukweku Oct 17, 2022
abe5002
cleanup docs
samukweku Oct 17, 2022
9820ae2
fix links
samukweku Oct 17, 2022
4720984
Merge branch 'dev' of https://github.com/pyjanitor-devs/pyjanitor int…
samukweku Oct 22, 2022
b05645e
add notes for users
samukweku Oct 22, 2022
c6345d6
fix grammar
samukweku Oct 22, 2022
88c699c
Merge branch 'dev' into samukweku/select_rows
ericmjl Oct 24, 2022
f410945
shortcut to get_indexer for performance, if possible
samukweku Oct 24, 2022
7b1caa9
Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-…
samukweku Oct 24, 2022
7a12a7e
undo last commit
samukweku Oct 24, 2022
c07464e
add dispatch for range
samukweku Oct 24, 2022
8de787f
fix grammar
samukweku Oct 25, 2022
3168732
update docs
samukweku Oct 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,20 @@
- [DOC] Updated developer guide docs.
- [ENH] Allow column selection/renaming within conditional_join. Issue #1102. Also allow first or last match. Issue #1020 @samukweku.
- [ENH] New decorator `deprecated_kwargs` for breaking API. #1103 @Zeroto521
- [ENH] Extend select_columns to support non-string columns. Also allow selection on MultiIndex columns via level parameter. Issue #1105 @samukweku
- [ENH] Extend select_columns to support non-string columns. Issue #1105 @samukweku
- [ENH] Performance improvement for groupby_topk. Issue #1093 @samukweku
- [ENH] `min_max_scale` drop `old_min` and `old_max` to fit sklearn's method API. Issue #1068 @Zeroto521
- [ENH] Add `jointly` option for `min_max_scale` support to transform each column values or entire values. Default transform each column, similar behavior to `sklearn.preprocessing.MinMaxScaler`. (Issue #1067, PR #1112, PR #1123) @Zeroto521
- [INF] Require pyspark minimal version is v3.2.0 to cut duplicates codes. Issue #1110 @Zeroto521
- [ENH] Added support for extension arrays in `expand_grid`. Issue #1121 @samukweku
- [ENH] Add support for extension arrays in `expand_grid`. Issue #1121 @samukweku
- [ENH] Add `names_expand` and `index_expand` parameters to `pivot_wider` for exposing missing categoricals. Issue #1108 @samukweku
- [ENH] Add fix for slicing error when selecting columns in `pivot_wider`. Issue #1134 @samukweku
- [ENH] Add fix for slicing error when selecting columns in `pivot_wider`. Issue #1134 @samukweku
- [ENH] `dropna` parameter added to `pivot_longer`. Issue #1132 @samukweku
- [INF] Update `mkdocstrings` version and to fit its new coming features. PR #1138 @Zeroto521
- [BUG] Force `math.softmax` returning `Series`. PR #1139 @Zeroto521
- [INF] Set independent environment for building documentation. PR #1141 @Zeroto521
- [DOC] Add local documentation preview via github action artifact. PR #1149 @Zeroto521
- [ENH] Enable `encode_categorical` handle 2 (or more ) dimensions array. PR #1153 @Zeroto521
- [ENH] Faster computation for a single non-equi join, with a numba engine. Issue #1102 @samukweku
- [TST] Fix testcases failing on Window. Issue #1160 @Zeroto521, and @samukweku
- [INF] Cancel old workflow runs via Github Action `concurrency`. PR #1161 @Zeroto521
- [ENH] Faster computation for non-equi join, with a numba engine. Speed improvement for left/right joins when `sort_by_appearance` is False. Issue #1102 @samukweku
Expand All @@ -28,6 +27,7 @@
- [ENH] Fix error when `sort_by_appearance=True` is combined with `dropna=True`. Issue #1168 @samukweku
- [ENH] Add explicit default parameter to `case_when` function. Issue #1159 @samukweku
- [BUG] pandas 1.5.x `_MergeOperation` doesn't have `copy` keyword anymore. Issue #1174 @Zeroto521
- [ENH] `select_rows` function added for flexible row selection. Add support for MultiIndex selection via dictionary. Issue #1124 @samukweku
- [TST] Compat with macos and window, to fix `FailedHealthCheck` Issue #1181 @Zeroto521

## [v0.23.1] - 2022-05-03
Expand Down
2 changes: 1 addition & 1 deletion examples/notebooks/select_columns.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -433,7 +433,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
"version": "3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) \n[GCC 10.3.0]"
},
"orig_nbformat": 4
},
Expand Down
2 changes: 1 addition & 1 deletion janitor/functions/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@
from .reorder_columns import reorder_columns
from .round_to_fraction import round_to_fraction
from .row_to_names import row_to_names
from .select_columns import select_columns
from .select import select_columns, select_rows
from .shuffle import shuffle
from .sort_column_value_order import sort_column_value_order
from .sort_naturally import sort_naturally
Expand Down
7 changes: 4 additions & 3 deletions janitor/functions/coalesce.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import pandas_flavor as pf

from janitor.utils import check, deprecated_alias
from janitor.functions.utils import _select_column_names
from janitor.functions.utils import _select_index


@pf.register_dataframe_method
Expand Down Expand Up @@ -95,7 +95,8 @@ def coalesce(
"The number of columns to coalesce should be a minimum of 2."
)

column_names = _select_column_names([*column_names], df)
indices = _select_index([*column_names], df, axis="columns")
column_names = df.columns[indices]

if target_column_name:
check("target_column_name", target_column_name, [str])
Expand All @@ -106,7 +107,7 @@ def coalesce(
if target_column_name is None:
target_column_name = column_names[0]

outcome = df.filter(column_names).bfill(axis="columns").iloc[:, 0]
outcome = df.loc(axis=1)[column_names].bfill(axis="columns").iloc[:, 0]
if outcome.hasnans and (default_value is not None):
outcome = outcome.fillna(default_value)

Expand Down
9 changes: 5 additions & 4 deletions janitor/functions/conditional_join.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def conditional_join(
especially if the intervals do not overlap.

Column selection in `df_columns` and `right_columns` is possible using the
[`select_columns`][janitor.functions.select_columns.select_columns] syntax.
[`select_columns`][janitor.functions.select.select_columns] syntax.

For strictly non-equi joins,
involving either `>`, `<`, `>=`, `<=` operators,
Expand Down Expand Up @@ -143,7 +143,7 @@ def conditional_join(
:param keep: Choose whether to return the first match,
last match or all matches. Default is `all`.
:param use_numba: Use numba, if installed, to accelerate the computation.
Default is `False`.
Applicable only to strictly non-equi joins. Default is `False`.
:returns: A pandas DataFrame of the two merged Pandas objects.
"""

Expand Down Expand Up @@ -1214,10 +1214,11 @@ def _cond_join_select_columns(columns: Any, df: pd.DataFrame):
Returns a Pandas DataFrame.
"""

df = df.select_columns(columns)

if isinstance(columns, dict):
df = df.select_columns([*columns])
df.columns = [columns.get(name, name) for name in df]
else:
df = df.select_columns(columns)

return df

Expand Down
130 changes: 107 additions & 23 deletions janitor/functions/pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
from pandas.core.dtypes.concat import concat_compat

from janitor.functions.utils import (
_select_column_names,
_select_index,
_computations_expand_grid,
)
from janitor.utils import check
Expand Down Expand Up @@ -52,7 +52,7 @@ def pivot_longer(
row axis.

Column selection in `index` and `column_names` is possible using the
[`select_columns`][janitor.functions.select_columns.select_columns] syntax.
[`select_columns`][janitor.functions.select.select_columns] syntax.

Example:

Expand Down Expand Up @@ -382,17 +382,35 @@ def _data_checks_pivot_longer(
"when the columns are a MultiIndex."
)

is_multi_index = isinstance(df.columns, pd.MultiIndex)
indices = None
if column_names is not None:
if is_list_like(column_names):
column_names = list(column_names)
column_names = _select_column_names(column_names, df)
column_names = list(column_names)
if is_multi_index:
column_names = _check_tuples_multiindex(
df.columns, column_names, "column_names"
)
else:
if is_list_like(column_names):
column_names = list(column_names)
indices = _select_index(column_names, df, axis="columns")
column_names = df.columns[indices]
if not is_list_like(column_names):
column_names = [column_names]
else:
column_names = list(column_names)

if index is not None:
if is_list_like(index):
index = list(index)
index = _select_column_names(index, df)
index = list(index)
if is_multi_index:
index = _check_tuples_multiindex(df.columns, index, "index")
else:
if is_list_like(index):
index = list(index)
indices = _select_index(index, df, axis="columns")
index = df.columns[indices]
if not is_list_like(index):
index = [index]
else:
index = list(index)

if index is None:
if column_names is None:
Expand Down Expand Up @@ -1181,7 +1199,7 @@ def pivot_wider(

Column selection in `index`, `names_from` and `values_from`
is possible using the
[`select_columns`][janitor.functions.select_columns.select_columns] syntax.
[`select_columns`][janitor.functions.select.select_columns] syntax.

A ValueError is raised if the combination
of the `index` and `names_from` is not unique.
Expand Down Expand Up @@ -1455,27 +1473,69 @@ def _data_checks_pivot_wider(
checking happens.
"""

is_multi_index = isinstance(df.columns, pd.MultiIndex)
indices = None
if index is not None:
if is_list_like(index):
index = list(index)
index = _select_column_names(index, df)
index = list(index)
if is_multi_index:
if not isinstance(index, list):
raise TypeError(
"For a MultiIndex column, pass a list of tuples "
"to the index argument."
)
index = _check_tuples_multiindex(df.columns, index, "index")
else:
if is_list_like(index):
index = list(index)
indices = _select_index(index, df, axis="columns")
index = df.columns[indices]
if not is_list_like(index):
index = [index]
else:
index = list(index)

if names_from is None:
raise ValueError(
"pivot_wider() is missing 1 required argument: 'names_from'"
)

if is_list_like(names_from):
names_from = list(names_from)
names_from = _select_column_names(names_from, df)
names_from = list(names_from)
if is_multi_index:
if not isinstance(names_from, list):
raise TypeError(
"For a MultiIndex column, pass a list of tuples "
"to the names_from argument."
)
names_from = _check_tuples_multiindex(
df.columns, names_from, "names_from"
)
else:
if is_list_like(names_from):
names_from = list(names_from)
indices = _select_index(names_from, df, axis="columns")
names_from = df.columns[indices]
if not is_list_like(names_from):
names_from = [names_from]
else:
names_from = list(names_from)

if values_from is not None:
if is_list_like(values_from):
values_from = list(values_from)
out = _select_column_names(values_from, df)
out = list(out)
if is_multi_index:
if not isinstance(values_from, list):
raise TypeError(
"For a MultiIndex column, pass a list of tuples "
"to the values_from argument."
)
out = _check_tuples_multiindex(
df.columns, values_from, "values_from"
)
else:
if is_list_like(values_from):
values_from = list(values_from)
indices = _select_index(values_from, df, axis="columns")
out = df.columns[indices]
if not is_list_like(out):
out = [out]
else:
out = list(out)
# hack to align with pd.pivot
if values_from == out[0]:
values_from = out[0]
Expand Down Expand Up @@ -1550,3 +1610,27 @@ def _expand(indexer, retain_categories):
ordered=indexer.ordered,
)
return indexer


def _check_tuples_multiindex(indexer, args, param):
"""
Check entries for tuples,
if indexer is a MultiIndex.

Returns a list of tuples.
"""
all_tuples = (isinstance(arg, tuple) for arg in args)
if not all(all_tuples):
raise TypeError(
f"{param} must be a list of tuples "
"when the columns are a MultiIndex."
)

not_found = set(args).difference(indexer)
if any(not_found):
raise KeyError(
f"Tuples {*not_found,} in the {param} "
"argument do not exist in the dataframe's columns."
)

return args
Loading