Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft: from_dummies #41902

Merged
merged 121 commits into from
Jun 30, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
f3e6afe
Initial draft: from_dummies
pckSF Jun 9, 2021
c7c5588
Clean-up tests with fixtures
pckSF Jun 9, 2021
d06540f
Make tests more elegant
pckSF Jun 14, 2021
1fa4e8a
Remove variable argument
pckSF Jun 22, 2021
c7f8ec8
Remove dummy_na argument
pckSF Jun 22, 2021
3cc98ca
Remove loop over df rows
pckSF Jun 30, 2021
0e131c6
Add fillna and basic tests
pckSF Jul 2, 2021
9f74dc7
Fix testnames regarding nan and unassigned
pckSF Jul 2, 2021
442b340
Remove fillna
pckSF Jul 3, 2021
38cf04d
Add from_dummies docstring
pckSF Jul 11, 2021
8eccfab
Add docstring to _from_dummies_1d
pckSF Jul 11, 2021
fd027c5
Fix column behaviour
pckSF Jul 11, 2021
106ff3c
Update handling of unassigned rows
pckSF Jul 11, 2021
2019228
Start user_guide entry
pckSF Jul 17, 2021
be39c05
Draft reshaping user_guide entry
pckSF Jul 19, 2021
d406227
Fix: remove temp workspace separation
pckSF Jul 19, 2021
61a25e0
Add raise ValueError on unassigned values
pckSF Aug 5, 2021
1d104f8
Merge updates from upstream/master
pckSF Aug 11, 2021
5bcfbb4
Fix mypy issues
pckSF Aug 11, 2021
ca6200e
Fix docstring multi-line statements
pckSF Aug 11, 2021
bf17cdb
Add TypeError for wrong dropped_first type
pckSF Aug 29, 2021
92b5dae
Add tests for incomplete seperators
pckSF Sep 6, 2021
c2cd747
Add tests for complex prefix separators
pckSF Sep 6, 2021
dc50464
Remove magic handling of non-dummy columns
pckSF Sep 9, 2021
4d9cfd0
Removed to_series argument
pckSF Sep 9, 2021
82d6743
Renamed column argument to subset
pckSF Sep 9, 2021
153202d
Renamed tests to reflect the removal of to_series
pckSF Sep 9, 2021
d3dd9f7
Fix input data NA value test to account for subset
pckSF Sep 9, 2021
e6ec175
Renamed argument prefix_sep to just sep
pckSF Sep 9, 2021
ee6025d
Improve docstring for sep
pckSF Sep 9, 2021
4e741c8
Update user guide entry
pckSF Sep 9, 2021
1b4a8e9
Fix wrong variable name in docstring: d to df
pckSF Sep 9, 2021
90177be
Fix mypy issues
pckSF Sep 9, 2021
d58c668
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Sep 10, 2021
46457fa
Fix post upstream merge mypy issues
pckSF Sep 10, 2021
131f42b
Fix errors in user guide
pckSF Sep 10, 2021
1af65ac
Merge 'upstream/master' into add-from_dummies
pckSF Oct 6, 2021
6dacf53
Allow hashable categories
pckSF Oct 7, 2021
61edd30
Add None category to mixed_cats_basic test
pckSF Oct 16, 2021
04f360c
Add index to argument types and fix resulting mypy issues
pckSF Oct 21, 2021
7ff2f3b
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Nov 16, 2021
56ea182
Remove list from dropped_first args
pckSF Nov 20, 2021
39a0199
Remove list from sep args
pckSF Nov 20, 2021
e05fe3f
Remove default category name
pckSF Nov 20, 2021
23f6c07
Adapt docstring examples to removal of list from sep and dropped_firs…
pckSF Nov 20, 2021
7190879
Update docstring: Remove default category name
pckSF Nov 20, 2021
012a1dd
Updaterst: Add missing word
pckSF Nov 20, 2021
52ed909
Add from_dummies to reshaping api
pckSF Nov 20, 2021
d8e4743
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Nov 20, 2021
0cf35d8
Add: allow dropped_first to be any hashable type
pckSF Nov 20, 2021
b9303bc
Add: Temporary mypy fix
pckSF Nov 20, 2021
3207534
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Nov 22, 2021
8089fe5
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Nov 24, 2021
55ad274
Add from_dummies to pandas __init__ file
pckSF Nov 27, 2021
1b17815
Add from_dummies to test_api tests
pckSF Nov 27, 2021
00c7b05
Fix docstring examples
pckSF Nov 27, 2021
07ba536
Adapt documentation to account for removal of list arguments
pckSF Nov 27, 2021
bbe41d0
Fix wrong parenthesis in docstring
pckSF Nov 27, 2021
329394b
Fix docstring example expected return
pckSF Nov 28, 2021
b83ac6a
Simplify from_dummies
pckSF Nov 29, 2021
1f5e1dc
Update user guide entry
pckSF Nov 29, 2021
8a3421b
Change arg dropped_first to implied_value
pckSF Nov 29, 2021
16cdaa0
Add dosctring note and test for boolean dummy values
pckSF Nov 29, 2021
174df1f
Fix docstring typo
pckSF Nov 29, 2021
e45d3f8
Change arg implied_value to implied_category
pckSF Nov 29, 2021
e83faed
Fix docstring format mistakes
pckSF Dec 4, 2021
1e12e6a
Replace argmax/min with idxmax/min
pckSF Dec 4, 2021
24e9899
Reduce complexity by using defaultdict
pckSF Dec 4, 2021
c8e7a7d
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Dec 4, 2021
0ac8fff
Ignore dependency based mypy errors
pckSF Dec 16, 2021
6af6cad
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Dec 16, 2021
54fdcbd
Add Raises section to docstring
pckSF Dec 29, 2021
ced3ed0
Change implied_category to base_category
pckSF Jan 5, 2022
6db7744
Add proper reference to get_dummies in docstring
pckSF Jan 5, 2022
c84d973
Remove unnecessary copy of input data
pckSF Jan 5, 2022
842d335
Merge upstream master
pckSF Jan 5, 2022
8f91012
Fix docstring section order
pckSF Jan 5, 2022
84d5bd8
Remove redundant f-strings
pckSF Jan 10, 2022
fd0f985
Add check for 'data' type
pckSF Jan 10, 2022
6230d0f
Add TypeError for wrong data type to docstring
pckSF Jan 14, 2022
84a60f7
Add roundtrip tests get_dummies from_dummies
pckSF Jan 14, 2022
c78ef2a
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Jan 14, 2022
52a9dea
Move from_dummies to encoding.py
pckSF Jan 29, 2022
bc658ba
Fix from_dummies import in test file
pckSF Jan 29, 2022
9fbca72
Update userguide versionadded to 1.5
pckSF Jan 30, 2022
2581fc9
Draft whats-new entry
pckSF Jan 31, 2022
85a0ed8
Change code-block to ipython
pckSF Jan 31, 2022
5b74039
Improve test names and organization
pckSF Feb 1, 2022
015ee94
Show DataFrames used in docstring examples
pckSF Feb 1, 2022
66c0292
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Feb 18, 2022
30b8ff1
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Mar 3, 2022
b261656
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Mar 18, 2022
555825b
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Mar 22, 2022
9d6e571
Merge from umstream/main
pckSF Apr 1, 2022
9f1bb8e
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Apr 2, 2022
dc52985
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Apr 3, 2022
e7d6828
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Apr 20, 2022
ae9f3d2
Fix whatsnew entry typo
pckSF Apr 20, 2022
a59ed4e
Fix whats-new
pckSF Apr 28, 2022
66c7a64
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Apr 28, 2022
76221f8
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Apr 30, 2022
7fa66b3
Change base_category to default_category
pckSF Jun 3, 2022
536f9c5
Merge updates from upstream/main
pckSF Jun 3, 2022
530889e
Add double ticks to render code in docstring
pckSF Jun 3, 2022
6536c65
Fix docstring typos and alignments
pckSF Jun 3, 2022
1272a23
Inline the check_len check for the default_vategory
pckSF Jun 3, 2022
fd3b115
Fix mypy issues by removing fixed ignores
pckSF Jun 3, 2022
bd5a118
Fix error encountered during docstring parsing
pckSF Jun 4, 2022
f7d08d0
Fix redundant backticks following :func:
pckSF Jun 4, 2022
c32e514
Add space before colon for numpydoc
pckSF Jun 4, 2022
0fda02f
Added pd.Categorical to See Also
pckSF Jun 6, 2022
62b09ae
Add version added
pckSF Jun 6, 2022
1dcdd9a
Add from_dummies to get_dummies see also
pckSF Jun 6, 2022
3c00690
Fix see also missing period error
pckSF Jun 6, 2022
4425b4a
Fix See Also of get_dummies
pckSF Jun 6, 2022
dc144f7
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Jun 6, 2022
15503b0
Fix docs compiler error
pckSF Jun 22, 2022
61a348b
Merge from master
pckSF Jun 22, 2022
f06a45c
Fix default_category=0 bug and add corresponding tests
pckSF Jun 25, 2022
f3a0f83
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Jun 25, 2022
23c133f
Use .loc[:, prefix_slice] instead of [prefix_slice]
pckSF Jun 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions pandas/core/reshape/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -1053,6 +1053,168 @@ def get_empty_frame(data) -> DataFrame:
return DataFrame(dummy_mat, index=index, columns=dummy_cols)


def from_dummies(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider moving get_dummies / from_dummies to a separate file (in /reshape), could be a precursor PR.

Copy link
Contributor Author

@pckSF pckSF Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea to improve clarity. What would be an elegant and obvious name for a collection of "reshape operations that change the data representation" - maybe transform? Or would we rather collect more categrogical/dummy specific operations instead? For me the first option seems more intuitive: I will think about a name - /reshape/transform.py could cause confusion with the .transform method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one_hot_encoding.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if its supposed to be a dummy operations file: dummy_coding.py?

data,
to_series: bool = False,
variables: None | str | list[str] | dict[str, str] = None,
prefix_sep: str | list[str] | dict[str, str] = "_",
dummy_na: bool = False,
columns: None | list[str] = None,
dropped_first: None | str | list[str] | dict[str, str] = None,
) -> Series | DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just always return a DataFrame, much simpler

Copy link
Contributor Author

@pckSF pckSF Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, and in line with this perspective as the Series can always be obtained ex post.

"""
soon
"""
from pandas.core.reshape.concat import concat

if to_series:
return _from_dummies_1d(data, dummy_na, dropped_first)

data_to_decode: DataFrame
if columns is None:
# index data with a list of all columns that are dummies
cat_columns = []
non_cat_columns = []
for col in data.columns:
if any(ps in col for ps in prefix_sep):
cat_columns.append(col)
else:
non_cat_columns.append(col)
data_to_decode = data[cat_columns]
non_cat_data = data[non_cat_columns]
elif not is_list_like(columns):
raise TypeError("Input must be a list-like for parameter 'columns'")
else:
data_to_decode = data[columns]
non_cat_data = data[[col for col in data.columns if col not in columns]]

# get separator for each prefix and lists to slice data for each prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm this is very complicated. what are you actually trying to do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to get all columns that correspond to a specific prefix such that I can extract the values for each block. I do this here to avoid deep nesting (and checking whether or not a column belongs to a prefix) later on, when the value for each entry is determined.

if isinstance(prefix_sep, dict):
variables_slice = {prefix: [] for prefix in prefix_sep}
for col in data_to_decode.columns:
for prefix in prefix_sep:
if prefix in col:
variables_slice[prefix].append(col)
else:
sep_for_prefix = {}
variables_slice = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could remove if then below using a defaultdict(list) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome advice, thank you very much :)

for col in data_to_decode.columns:
ps = [ps for ps in prefix_sep if ps in col][0]
prefix = col.split(ps)[0]
if prefix not in sep_for_prefix:
sep_for_prefix[prefix] = ps
if prefix not in variables_slice:
variables_slice[prefix] = [col]
else:
variables_slice[prefix].append(col)
prefix_sep = sep_for_prefix

# validate number of passed arguments
def check_len(item, name) -> None:
if not len(item) == len(variables_slice):
len_msg = (
f"Length of '{name}' ({len(item)}) did not match the "
"length of the columns being encoded "
f"({len(variables_slice)})."
)
raise ValueError(len_msg)

# obtain prefix to category mapping
variables: dict[str, str]
if isinstance(variables, dict):
check_len(variables, "variables")
variables = variables
elif is_list_like(variables):
check_len(variables, "variables")
variables = dict(zip(variables_slice, variables))
elif isinstance(variables, str):
variables = dict(
zip(
variables_slice,
(f"{variables}{i}" for i in range(len(variables_slice))),
)
)
else:
variables = dict(zip(variables_slice, variables_slice))

if dropped_first:
if isinstance(dropped_first, dict):
check_len(dropped_first, "dropped_first")
elif is_list_like(dropped_first):
check_len(dropped_first, "dropped_first")
dropped_first = dict(zip(variables_slice, dropped_first))
else:
dropped_first = dict(
zip(variables_slice, [dropped_first] * len(variables_slice))
)

cat_data = {var: [] for _, var in variables.items()}
for index, row in data.iterrows():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over rows in Python will be too slow - can you have a look at how the (now closed) PR did it?

Copy link
Contributor Author

@pckSF pckSF Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the row iteration. At the moment this resulted in a problem with NaN values in the output DF which I am currently looking into.. I can mirror the method of the old PR if its method is more efficient (or if it provides an easy solution for the NaN issue).

for prefix, prefix_slice in variables_slice.items():
slice_sum = row[prefix_slice].sum()
if slice_sum > 1:
raise ValueError(
f"Dummy DataFrame contains multi-assignment(s) for prefix: "
f"'{prefix}' in row {index}."
)
elif slice_sum == 0:
if dropped_first:
category = dropped_first[prefix]
elif not dummy_na:
category = np.nan
else:
raise ValueError(
f"Dummy DataFrame contains no assignment for prefix: "
f"'{prefix}' in row {index}."
)
else:
cat_index = row[prefix_slice].argmax()
category = prefix_slice[cat_index].split(prefix_sep[prefix])[1]
if dummy_na and category == "NaN":
category = np.nan
cat_data[variables[prefix]].append(category)

if columns:
return DataFrame(cat_data)
else:
return concat([non_cat_data, DataFrame(cat_data)], axis=1)


def _from_dummies_1d(
data,
dummy_na: bool = False,
dropped_first: None | str = None,
) -> Series:
"""
soon
"""
if dropped_first and not isinstance(dropped_first, str):
raise ValueError("Only one dropped first value possible in 1D dummy DataFrame.")

cat_data = []
for index, row in data.iterrows():
row_sum = row.sum()
if row_sum > 1:
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you check this much earlier with a row sum after the conversion to boolean, e.g. , if (data_to_decode.sum(1) > 1).any()?

Copy link
Contributor Author

@pckSF pckSF Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that only works if there are no prefixes/multiple variables as each prefix slice has to be checked individually and data_to_decode includes the entire subset.

f"Dummy DataFrame contains multi-assignment in row {index}."
)
elif row_sum == 0:
if dropped_first:
category = dropped_first
elif not dummy_na:
category = np.nan
else:
raise ValueError(
f"Dummy DataFrame contains no assignment in row {index}."
)
else:
category = data.columns[row.argmax()]
if dummy_na and category == "NaN":
category = np.nan
cat_data.append(category)
return Series(cat_data)


def _reorder_for_extension_array_stack(
arr: ExtensionArray, n_rows: int, n_columns: int
) -> ExtensionArray:
Expand Down
Loading