Skip to content

Commit

Permalink
[ENH] select_rows function implementation (#1173)
Browse files Browse the repository at this point in the history
* add changelog

* select_rows implementation

* multiindex level selection implementation

* tests added

* updates to docs and tests

* Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-devs/pyjanitor into samukweku/select_rows

* updates to changelog

* Update select_columns.ipynb

* remove unnecessary file

* add select_rows to janitor/__init__.py

* update select_rows docs

* updates to select links

* add more tests

* move utils/test__select_columns to functions/test_select_columns

* change columns_to_select to cols

* remove print

* updates

* spelling fix

* Update CHANGELOG.md

* Update utils.py

* more tests

* explicit label selection in pivot_longer and pivot_wider

* spelling fix

* tuple selection added

* update logic for pivot_wider

* improve performance when single value passed to select_*

* fix for boolean array for single select_*

* dict support for MultiIndex indexing

* changelog

* changelog

* changelog

* changelog

* fix column selection via dictionary in conditional_join

* Update pivot.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updates to docs

* simplify logic

* simplify level_labels logic

* add regex and callable options to dict

* cleanup

* test for callable errors

* callable applied across entire dataframe for performance

* add tests for MultiIndex dictionary

* explicit support for pandas/numpy objects

* add test for boolean callable length mismatch

* fix test fails for conditional_join

* Update select_columns.ipynb

* edit on conditional join; improve on Pandas/numpy object selection on a multiindex

* update

* spelling fix

* strip irrelevance from slice dispatch

* fix for IndexLabel and dict

* use loc directly if possible, else pass to _select_index

* keep dict as-is in conditional_join

* logic for when dictionary is used

* logic for fnmatch/regex selection on multiindex

* add tests for regex/fnmatch on multiindex

* remove shortcut to loc

* pass responsibility of slice to pandas

* remove print

* keys for dict for multiindex should be strings/integers only

* remove IndexLabel class

* changelog

* improve error reporting for fnmatch

* cleanup docs

* cleanup docs

* fix links

* add notes for users

* fix grammar

* shortcut to get_indexer for performance, if possible

* undo last commit

* add dispatch for range

* fix grammar

* update docs

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Ma <ericmjl@users.noreply.github.com>
  • Loading branch information
3 people authored Oct 31, 2022
1 parent 5ebf799 commit 8445dc0
Show file tree
Hide file tree
Showing 14 changed files with 1,249 additions and 819 deletions.
8 changes: 4 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,20 @@
- [DOC] Updated developer guide docs.
- [ENH] Allow column selection/renaming within conditional_join. Issue #1102. Also allow first or last match. Issue #1020 @samukweku.
- [ENH] New decorator `deprecated_kwargs` for breaking API. #1103 @Zeroto521
- [ENH] Extend select_columns to support non-string columns. Also allow selection on MultiIndex columns via level parameter. Issue #1105 @samukweku
- [ENH] Extend select_columns to support non-string columns. Issue #1105 @samukweku
- [ENH] Performance improvement for groupby_topk. Issue #1093 @samukweku
- [ENH] `min_max_scale` drop `old_min` and `old_max` to fit sklearn's method API. Issue #1068 @Zeroto521
- [ENH] Add `jointly` option for `min_max_scale` support to transform each column values or entire values. Default transform each column, similar behavior to `sklearn.preprocessing.MinMaxScaler`. (Issue #1067, PR #1112, PR #1123) @Zeroto521
- [INF] Require pyspark minimal version is v3.2.0 to cut duplicates codes. Issue #1110 @Zeroto521
- [ENH] Added support for extension arrays in `expand_grid`. Issue #1121 @samukweku
- [ENH] Add support for extension arrays in `expand_grid`. Issue #1121 @samukweku
- [ENH] Add `names_expand` and `index_expand` parameters to `pivot_wider` for exposing missing categoricals. Issue #1108 @samukweku
- [ENH] Add fix for slicing error when selecting columns in `pivot_wider`. Issue #1134 @samukweku
- [ENH] Add fix for slicing error when selecting columns in `pivot_wider`. Issue #1134 @samukweku
- [ENH] `dropna` parameter added to `pivot_longer`. Issue #1132 @samukweku
- [INF] Update `mkdocstrings` version and to fit its new coming features. PR #1138 @Zeroto521
- [BUG] Force `math.softmax` returning `Series`. PR #1139 @Zeroto521
- [INF] Set independent environment for building documentation. PR #1141 @Zeroto521
- [DOC] Add local documentation preview via github action artifact. PR #1149 @Zeroto521
- [ENH] Enable `encode_categorical` handle 2 (or more ) dimensions array. PR #1153 @Zeroto521
- [ENH] Faster computation for a single non-equi join, with a numba engine. Issue #1102 @samukweku
- [TST] Fix testcases failing on Window. Issue #1160 @Zeroto521, and @samukweku
- [INF] Cancel old workflow runs via Github Action `concurrency`. PR #1161 @Zeroto521
- [ENH] Faster computation for non-equi join, with a numba engine. Speed improvement for left/right joins when `sort_by_appearance` is False. Issue #1102 @samukweku
Expand All @@ -29,6 +28,7 @@
- [ENH] Fix error when `sort_by_appearance=True` is combined with `dropna=True`. Issue #1168 @samukweku
- [ENH] Add explicit default parameter to `case_when` function. Issue #1159 @samukweku
- [BUG] pandas 1.5.x `_MergeOperation` doesn't have `copy` keyword anymore. Issue #1174 @Zeroto521
- [ENH] `select_rows` function added for flexible row selection. Add support for MultiIndex selection via dictionary. Issue #1124 @samukweku
- [TST] Compat with macos and window, to fix `FailedHealthCheck` Issue #1181 @Zeroto521
- [INF] Merge two docs CIs (`docs-preview.yml` and `docs.yml`) to one. And add `documentation` pytest mark. PR #1183 @Zeroto521

Expand Down
2 changes: 1 addition & 1 deletion examples/notebooks/select_columns.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -433,7 +433,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
"version": "3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) \n[GCC 10.3.0]"
},
"orig_nbformat": 4
},
Expand Down
2 changes: 1 addition & 1 deletion janitor/functions/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@
from .reorder_columns import reorder_columns
from .round_to_fraction import round_to_fraction
from .row_to_names import row_to_names
from .select_columns import select_columns
from .select import select_columns, select_rows
from .shuffle import shuffle
from .sort_column_value_order import sort_column_value_order
from .sort_naturally import sort_naturally
Expand Down
7 changes: 4 additions & 3 deletions janitor/functions/coalesce.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import pandas_flavor as pf

from janitor.utils import check, deprecated_alias
from janitor.functions.utils import _select_column_names
from janitor.functions.utils import _select_index


@pf.register_dataframe_method
Expand Down Expand Up @@ -95,7 +95,8 @@ def coalesce(
"The number of columns to coalesce should be a minimum of 2."
)

column_names = _select_column_names([*column_names], df)
indices = _select_index([*column_names], df, axis="columns")
column_names = df.columns[indices]

if target_column_name:
check("target_column_name", target_column_name, [str])
Expand All @@ -106,7 +107,7 @@ def coalesce(
if target_column_name is None:
target_column_name = column_names[0]

outcome = df.filter(column_names).bfill(axis="columns").iloc[:, 0]
outcome = df.loc(axis=1)[column_names].bfill(axis="columns").iloc[:, 0]
if outcome.hasnans and (default_value is not None):
outcome = outcome.fillna(default_value)

Expand Down
9 changes: 5 additions & 4 deletions janitor/functions/conditional_join.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def conditional_join(
especially if the intervals do not overlap.
Column selection in `df_columns` and `right_columns` is possible using the
[`select_columns`][janitor.functions.select_columns.select_columns] syntax.
[`select_columns`][janitor.functions.select.select_columns] syntax.
For strictly non-equi joins,
involving either `>`, `<`, `>=`, `<=` operators,
Expand Down Expand Up @@ -143,7 +143,7 @@ def conditional_join(
:param keep: Choose whether to return the first match,
last match or all matches. Default is `all`.
:param use_numba: Use numba, if installed, to accelerate the computation.
Default is `False`.
Applicable only to strictly non-equi joins. Default is `False`.
:returns: A pandas DataFrame of the two merged Pandas objects.
"""

Expand Down Expand Up @@ -1214,10 +1214,11 @@ def _cond_join_select_columns(columns: Any, df: pd.DataFrame):
Returns a Pandas DataFrame.
"""

df = df.select_columns(columns)

if isinstance(columns, dict):
df = df.select_columns([*columns])
df.columns = [columns.get(name, name) for name in df]
else:
df = df.select_columns(columns)

return df

Expand Down
130 changes: 107 additions & 23 deletions janitor/functions/pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
from pandas.core.dtypes.concat import concat_compat

from janitor.functions.utils import (
_select_column_names,
_select_index,
_computations_expand_grid,
)
from janitor.utils import check
Expand Down Expand Up @@ -52,7 +52,7 @@ def pivot_longer(
row axis.
Column selection in `index` and `column_names` is possible using the
[`select_columns`][janitor.functions.select_columns.select_columns] syntax.
[`select_columns`][janitor.functions.select.select_columns] syntax.
Example:
Expand Down Expand Up @@ -382,17 +382,35 @@ def _data_checks_pivot_longer(
"when the columns are a MultiIndex."
)

is_multi_index = isinstance(df.columns, pd.MultiIndex)
indices = None
if column_names is not None:
if is_list_like(column_names):
column_names = list(column_names)
column_names = _select_column_names(column_names, df)
column_names = list(column_names)
if is_multi_index:
column_names = _check_tuples_multiindex(
df.columns, column_names, "column_names"
)
else:
if is_list_like(column_names):
column_names = list(column_names)
indices = _select_index(column_names, df, axis="columns")
column_names = df.columns[indices]
if not is_list_like(column_names):
column_names = [column_names]
else:
column_names = list(column_names)

if index is not None:
if is_list_like(index):
index = list(index)
index = _select_column_names(index, df)
index = list(index)
if is_multi_index:
index = _check_tuples_multiindex(df.columns, index, "index")
else:
if is_list_like(index):
index = list(index)
indices = _select_index(index, df, axis="columns")
index = df.columns[indices]
if not is_list_like(index):
index = [index]
else:
index = list(index)

if index is None:
if column_names is None:
Expand Down Expand Up @@ -1181,7 +1199,7 @@ def pivot_wider(
Column selection in `index`, `names_from` and `values_from`
is possible using the
[`select_columns`][janitor.functions.select_columns.select_columns] syntax.
[`select_columns`][janitor.functions.select.select_columns] syntax.
A ValueError is raised if the combination
of the `index` and `names_from` is not unique.
Expand Down Expand Up @@ -1455,27 +1473,69 @@ def _data_checks_pivot_wider(
checking happens.
"""

is_multi_index = isinstance(df.columns, pd.MultiIndex)
indices = None
if index is not None:
if is_list_like(index):
index = list(index)
index = _select_column_names(index, df)
index = list(index)
if is_multi_index:
if not isinstance(index, list):
raise TypeError(
"For a MultiIndex column, pass a list of tuples "
"to the index argument."
)
index = _check_tuples_multiindex(df.columns, index, "index")
else:
if is_list_like(index):
index = list(index)
indices = _select_index(index, df, axis="columns")
index = df.columns[indices]
if not is_list_like(index):
index = [index]
else:
index = list(index)

if names_from is None:
raise ValueError(
"pivot_wider() is missing 1 required argument: 'names_from'"
)

if is_list_like(names_from):
names_from = list(names_from)
names_from = _select_column_names(names_from, df)
names_from = list(names_from)
if is_multi_index:
if not isinstance(names_from, list):
raise TypeError(
"For a MultiIndex column, pass a list of tuples "
"to the names_from argument."
)
names_from = _check_tuples_multiindex(
df.columns, names_from, "names_from"
)
else:
if is_list_like(names_from):
names_from = list(names_from)
indices = _select_index(names_from, df, axis="columns")
names_from = df.columns[indices]
if not is_list_like(names_from):
names_from = [names_from]
else:
names_from = list(names_from)

if values_from is not None:
if is_list_like(values_from):
values_from = list(values_from)
out = _select_column_names(values_from, df)
out = list(out)
if is_multi_index:
if not isinstance(values_from, list):
raise TypeError(
"For a MultiIndex column, pass a list of tuples "
"to the values_from argument."
)
out = _check_tuples_multiindex(
df.columns, values_from, "values_from"
)
else:
if is_list_like(values_from):
values_from = list(values_from)
indices = _select_index(values_from, df, axis="columns")
out = df.columns[indices]
if not is_list_like(out):
out = [out]
else:
out = list(out)
# hack to align with pd.pivot
if values_from == out[0]:
values_from = out[0]
Expand Down Expand Up @@ -1550,3 +1610,27 @@ def _expand(indexer, retain_categories):
ordered=indexer.ordered,
)
return indexer


def _check_tuples_multiindex(indexer, args, param):
"""
Check entries for tuples,
if indexer is a MultiIndex.
Returns a list of tuples.
"""
all_tuples = (isinstance(arg, tuple) for arg in args)
if not all(all_tuples):
raise TypeError(
f"{param} must be a list of tuples "
"when the columns are a MultiIndex."
)

not_found = set(args).difference(indexer)
if any(not_found):
raise KeyError(
f"Tuples {*not_found,} in the {param} "
"argument do not exist in the dataframe's columns."
)

return args
Loading

0 comments on commit 8445dc0

Please sign in to comment.