[ENH] select_rows function implementation (#1173)

* add changelog * select_rows implementation * multiindex level selection implementation * tests added * updates to docs and tests * Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-devs/pyjanitor into samukweku/select_rows * updates to changelog * Update select_columns.ipynb * remove unnecessary file * add select_rows to janitor/__init__.py * update select_rows docs * updates to select links * add more tests * move utils/test__select_columns to functions/test_select_columns * change columns_to_select to cols * remove print * updates * spelling fix * Update CHANGELOG.md * Update utils.py * more tests * explicit label selection in pivot_longer and pivot_wider * spelling fix * tuple selection added * update logic for pivot_wider * improve performance when single value passed to select_* * fix for boolean array for single select_* * dict support for MultiIndex indexing * changelog * changelog * changelog * changelog * fix column selection via dictionary in conditional_join * Update pivot.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates to docs * simplify logic * simplify level_labels logic * add regex and callable options to dict * cleanup * test for callable errors * callable applied across entire dataframe for performance * add tests for MultiIndex dictionary * explicit support for pandas/numpy objects * add test for boolean callable length mismatch * fix test fails for conditional_join * Update select_columns.ipynb * edit on conditional join; improve on Pandas/numpy object selection on a multiindex * update * spelling fix * strip irrelevance from slice dispatch * fix for IndexLabel and dict * use loc directly if possible, else pass to _select_index * keep dict as-is in conditional_join * logic for when dictionary is used * logic for fnmatch/regex selection on multiindex * add tests for regex/fnmatch on multiindex * remove shortcut to loc * pass responsibility of slice to pandas * remove print * keys for dict for multiindex should be strings/integers only * remove IndexLabel class * changelog * improve error reporting for fnmatch * cleanup docs * cleanup docs * fix links * add notes for users * fix grammar * shortcut to get_indexer for performance, if possible * undo last commit * add dispatch for range * fix grammar * update docs Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Ma <ericmjl@users.noreply.github.com>
pyjanitor-devs · Oct 31, 2022 · 8445dc0 · 8445dc0
1 parent 5ebf799
commit 8445dc0
Show file tree

Hide file tree

Showing 14 changed files with 1,249 additions and 819 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,21 +6,20 @@
 -   [DOC] Updated developer guide docs.
 -   [ENH] Allow column selection/renaming within conditional_join. Issue #1102. Also allow first or last match. Issue #1020 @samukweku.
 -   [ENH] New decorator `deprecated_kwargs` for breaking API. #1103 @Zeroto521
--   [ENH] Extend select_columns to support non-string columns. Also allow selection on MultiIndex columns via level parameter. Issue #1105 @samukweku
+-   [ENH] Extend select_columns to support non-string columns. Issue #1105 @samukweku
 -   [ENH] Performance improvement for groupby_topk. Issue #1093 @samukweku
 -   [ENH] `min_max_scale` drop `old_min` and `old_max` to fit sklearn's method API. Issue #1068 @Zeroto521
 -   [ENH] Add `jointly` option for `min_max_scale` support to transform each column values or entire values. Default transform each column, similar behavior to `sklearn.preprocessing.MinMaxScaler`. (Issue #1067, PR #1112, PR #1123) @Zeroto521
 -   [INF] Require pyspark minimal version is v3.2.0 to cut duplicates codes. Issue #1110 @Zeroto521
--   [ENH] Added support for extension arrays in `expand_grid`. Issue #1121 @samukweku
+-   [ENH] Add support for extension arrays in `expand_grid`. Issue #1121 @samukweku
 -   [ENH] Add `names_expand` and `index_expand` parameters to `pivot_wider` for exposing missing categoricals. Issue #1108 @samukweku
--   [ENH] Add fix  for slicing error when selecting columns in `pivot_wider`. Issue #1134 @samukweku
+-   [ENH] Add fix for slicing error when selecting columns in `pivot_wider`. Issue #1134 @samukweku
 -   [ENH] `dropna` parameter added to `pivot_longer`. Issue #1132 @samukweku
 -   [INF] Update `mkdocstrings` version and to fit its new coming features. PR #1138 @Zeroto521
 -   [BUG] Force `math.softmax` returning `Series`. PR #1139 @Zeroto521
 -   [INF] Set independent environment for building documentation. PR #1141 @Zeroto521
 -   [DOC] Add local documentation preview via github action artifact. PR #1149 @Zeroto521
 -   [ENH] Enable `encode_categorical` handle 2 (or more ) dimensions array. PR #1153 @Zeroto521
--   [ENH] Faster computation for a single non-equi join, with a numba engine. Issue #1102 @samukweku
 -   [TST] Fix testcases failing on Window. Issue #1160 @Zeroto521, and @samukweku
 -   [INF] Cancel old workflow runs via Github Action `concurrency`. PR #1161 @Zeroto521
 -   [ENH] Faster computation for non-equi join, with a numba engine. Speed improvement for left/right joins when `sort_by_appearance` is False. Issue #1102 @samukweku
@@ -29,6 +28,7 @@
 -   [ENH] Fix error when `sort_by_appearance=True` is combined with `dropna=True`. Issue #1168 @samukweku
 -   [ENH] Add explicit default parameter to `case_when` function. Issue #1159 @samukweku
 -   [BUG] pandas 1.5.x `_MergeOperation` doesn't have `copy` keyword anymore. Issue #1174 @Zeroto521
+-   [ENH] `select_rows` function added for flexible row selection. Add support for MultiIndex selection via dictionary. Issue #1124 @samukweku
 -   [TST] Compat with macos and window, to fix `FailedHealthCheck` Issue #1181 @Zeroto521
 -   [INF] Merge two docs CIs (`docs-preview.yml` and `docs.yml`) to one. And add `documentation` pytest mark. PR #1183 @Zeroto521
 

diff --git a/examples/notebooks/select_columns.ipynb b/examples/notebooks/select_columns.ipynb
@@ -433,7 +433,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.10"
+   "version": "3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) \n[GCC 10.3.0]"
   },
   "orig_nbformat": 4
  },

diff --git a/janitor/functions/__init__.py b/janitor/functions/__init__.py
@@ -64,7 +64,7 @@
 from .reorder_columns import reorder_columns
 from .round_to_fraction import round_to_fraction
 from .row_to_names import row_to_names
-from .select_columns import select_columns
+from .select import select_columns, select_rows
 from .shuffle import shuffle
 from .sort_column_value_order import sort_column_value_order
 from .sort_naturally import sort_naturally

diff --git a/janitor/functions/coalesce.py b/janitor/functions/coalesce.py
@@ -4,7 +4,7 @@
 import pandas_flavor as pf
 
 from janitor.utils import check, deprecated_alias
-from janitor.functions.utils import _select_column_names
+from janitor.functions.utils import _select_index
 
 
 @pf.register_dataframe_method
@@ -95,7 +95,8 @@ def coalesce(
             "The number of columns to coalesce should be a minimum of 2."
         )
 
-    column_names = _select_column_names([*column_names], df)
+    indices = _select_index([*column_names], df, axis="columns")
+    column_names = df.columns[indices]
 
     if target_column_name:
         check("target_column_name", target_column_name, [str])
@@ -106,7 +107,7 @@ def coalesce(
     if target_column_name is None:
         target_column_name = column_names[0]
 
-    outcome = df.filter(column_names).bfill(axis="columns").iloc[:, 0]
+    outcome = df.loc(axis=1)[column_names].bfill(axis="columns").iloc[:, 0]
     if outcome.hasnans and (default_value is not None):
         outcome = outcome.fillna(default_value)
 

diff --git a/janitor/functions/conditional_join.py b/janitor/functions/conditional_join.py
@@ -47,7 +47,7 @@ def conditional_join(
     especially if the intervals do not overlap.
 
     Column selection in `df_columns` and `right_columns` is possible using the
-    [`select_columns`][janitor.functions.select_columns.select_columns] syntax.
+    [`select_columns`][janitor.functions.select.select_columns] syntax.
 
     For strictly non-equi joins,
     involving either `>`, `<`, `>=`, `<=` operators,
@@ -143,7 +143,7 @@ def conditional_join(
     :param keep: Choose whether to return the first match,
         last match or all matches. Default is `all`.
     :param use_numba: Use numba, if installed, to accelerate the computation.
-        Default is `False`.
+        Applicable only to strictly non-equi joins. Default is `False`.
     :returns: A pandas DataFrame of the two merged Pandas objects.
     """
 
@@ -1214,10 +1214,11 @@ def _cond_join_select_columns(columns: Any, df: pd.DataFrame):
     Returns a Pandas DataFrame.
     """
 
-    df = df.select_columns(columns)
-
     if isinstance(columns, dict):
+        df = df.select_columns([*columns])
         df.columns = [columns.get(name, name) for name in df]
+    else:
+        df = df.select_columns(columns)
 
     return df
 

diff --git a/janitor/functions/pivot.py b/janitor/functions/pivot.py
@@ -15,7 +15,7 @@
 from pandas.core.dtypes.concat import concat_compat
 
 from janitor.functions.utils import (
-    _select_column_names,
+    _select_index,
     _computations_expand_grid,
 )
 from janitor.utils import check
@@ -52,7 +52,7 @@ def pivot_longer(
     row axis.
 
     Column selection in `index` and `column_names` is possible using the
-    [`select_columns`][janitor.functions.select_columns.select_columns] syntax.
+    [`select_columns`][janitor.functions.select.select_columns] syntax.
 
     Example:
 
@@ -382,17 +382,35 @@ def _data_checks_pivot_longer(
                 "when the columns are a MultiIndex."
             )
 
+    is_multi_index = isinstance(df.columns, pd.MultiIndex)
+    indices = None
     if column_names is not None:
-        if is_list_like(column_names):
-            column_names = list(column_names)
-        column_names = _select_column_names(column_names, df)
-        column_names = list(column_names)
+        if is_multi_index:
+            column_names = _check_tuples_multiindex(
+                df.columns, column_names, "column_names"
+            )
+        else:
+            if is_list_like(column_names):
+                column_names = list(column_names)
+            indices = _select_index(column_names, df, axis="columns")
+            column_names = df.columns[indices]
+            if not is_list_like(column_names):
+                column_names = [column_names]
+            else:
+                column_names = list(column_names)
 
     if index is not None:
-        if is_list_like(index):
-            index = list(index)
-        index = _select_column_names(index, df)
-        index = list(index)
+        if is_multi_index:
+            index = _check_tuples_multiindex(df.columns, index, "index")
+        else:
+            if is_list_like(index):
+                index = list(index)
+            indices = _select_index(index, df, axis="columns")
+            index = df.columns[indices]
+            if not is_list_like(index):
+                index = [index]
+            else:
+                index = list(index)
 
     if index is None:
         if column_names is None:
@@ -1181,7 +1199,7 @@ def pivot_wider(
 
     Column selection in `index`, `names_from` and `values_from`
     is possible using the
-    [`select_columns`][janitor.functions.select_columns.select_columns] syntax.
+    [`select_columns`][janitor.functions.select.select_columns] syntax.
 
     A ValueError is raised if the combination
     of the `index` and `names_from` is not unique.
@@ -1455,27 +1473,69 @@ def _data_checks_pivot_wider(
     checking happens.
     """
 
+    is_multi_index = isinstance(df.columns, pd.MultiIndex)
+    indices = None
     if index is not None:
-        if is_list_like(index):
-            index = list(index)
-        index = _select_column_names(index, df)
-        index = list(index)
+        if is_multi_index:
+            if not isinstance(index, list):
+                raise TypeError(
+                    "For a MultiIndex column, pass a list of tuples "
+                    "to the index argument."
+                )
+            index = _check_tuples_multiindex(df.columns, index, "index")
+        else:
+            if is_list_like(index):
+                index = list(index)
+            indices = _select_index(index, df, axis="columns")
+            index = df.columns[indices]
+            if not is_list_like(index):
+                index = [index]
+            else:
+                index = list(index)
 
     if names_from is None:
         raise ValueError(
             "pivot_wider() is missing 1 required argument: 'names_from'"
         )
 
-    if is_list_like(names_from):
-        names_from = list(names_from)
-    names_from = _select_column_names(names_from, df)
-    names_from = list(names_from)
+    if is_multi_index:
+        if not isinstance(names_from, list):
+            raise TypeError(
+                "For a MultiIndex column, pass a list of tuples "
+                "to the names_from argument."
+            )
+        names_from = _check_tuples_multiindex(
+            df.columns, names_from, "names_from"
+        )
+    else:
+        if is_list_like(names_from):
+            names_from = list(names_from)
+        indices = _select_index(names_from, df, axis="columns")
+        names_from = df.columns[indices]
+        if not is_list_like(names_from):
+            names_from = [names_from]
+        else:
+            names_from = list(names_from)
 
     if values_from is not None:
-        if is_list_like(values_from):
-            values_from = list(values_from)
-        out = _select_column_names(values_from, df)
-        out = list(out)
+        if is_multi_index:
+            if not isinstance(values_from, list):
+                raise TypeError(
+                    "For a MultiIndex column, pass a list of tuples "
+                    "to the values_from argument."
+                )
+            out = _check_tuples_multiindex(
+                df.columns, values_from, "values_from"
+            )
+        else:
+            if is_list_like(values_from):
+                values_from = list(values_from)
+            indices = _select_index(values_from, df, axis="columns")
+            out = df.columns[indices]
+            if not is_list_like(out):
+                out = [out]
+            else:
+                out = list(out)
         # hack to align with pd.pivot
         if values_from == out[0]:
             values_from = out[0]
@@ -1550,3 +1610,27 @@ def _expand(indexer, retain_categories):
                 ordered=indexer.ordered,
             )
     return indexer
+
+
+def _check_tuples_multiindex(indexer, args, param):
+    """
+    Check entries for tuples,
+    if indexer is a MultiIndex.
+
+    Returns a list of tuples.
+    """
+    all_tuples = (isinstance(arg, tuple) for arg in args)
+    if not all(all_tuples):
+        raise TypeError(
+            f"{param} must be a list of tuples "
+            "when the columns are a MultiIndex."
+        )
+
+    not_found = set(args).difference(indexer)
+    if any(not_found):
+        raise KeyError(
+            f"Tuples {*not_found,} in the {param} "
+            "argument do not exist in the dataframe's columns."
+        )
+
+    return args