Initial draft: from_dummies #41902

pckSF · 2021-06-09T17:01:24Z

Converts dummy codes to categorical variables.

Closes #8745
Passes atomic options tests
Passes linting tests
Whatsnew entry

Tests will be expaneded with composite options and edgecases as soon as definition of final scope is complete.
~~I tried to mirror the get_dummes function wherever possible to ensure an as close inverse of the function as possible~~

Primary goal of the draft is to demonstrate the options I had in mind, optimization is the next step as soon as we decided what we want to keep, cut, or add.
(Docstrings etc. will be added based on results)

Integration as method of categorical, as proposed in #34426, can be considerd as a next step if moving in that direction is planed.

Current Options:

no arguments

Assumes all columns to be cateogries of the same variable which is called categories

example:

>>> dummies
      a   b   c   d
0     1   0   0   0      
1     0   1   0   0      
2     1   0   0   0
3     0   0   0   1      
4     0   0   1   0         

>>> from_dummies(dummies)
    categories
0       a      
1       b       
2       a
3       d
4       c

subset -- RMOEVED --

Select which columns of the input DataFrame to include for decoding.
None:
- Assumes the entire passed DataFrame consists of dummy coded variables

List or Index:

Just returns the decoded subset passed in the list.

example:

>>> dummies
   C    col1_a  col1_b  col2_a  col2_b  col2_c
0  1       1       0       0       1       0
1  2       0       1       1       0       0
2  3       1       0       0       0       1

>>> from_dummies(dummies, sep="_", subset=["col1_a", "col1_b", "col2_a", "col2_b", "col2_c"])
  col1    col2 
0  a       b
1  b       a 
2  a       c

variables -- REMOVED --

Inverts the prefix argument of get_dummies

Using:

>>> dummies
  col1_a  col1_b  col2_a  col2_b  col2_c
0    1       0       0       1       0
1    0       1       1       0       0
2    1       0       0       0       1

None:

Variable names are taken from prefix.

example:

>>> from_dummies(dummies, sep="_", variables=None)
  col1    col2 
0   a       b
1   b       a 
2   a       c

str:

Variable names are numbered from the passed string.

example:

>>> from_dummies(dummies, sep="_", variables="Varname")
   Varname1   Varname2 
0     a          b
1     b          a 
2     a          c

List:

Variable names are obtained (in order) from the passed list.

example:

>>> from_dummies(dummies, sep="_", variables=["One", "Two"])
    One     Two 
0    a       b
1    b       a 
2    a       c

Dict:
- Variable names are obtained from passed prefix to variable mapping.
- Currently orders the returned columns by index in the passed dictionary, not sure if that should be changed to strictly follow the order of prefixes in the dummy-data.
- example:
```
>>> from_dummies(dummies, sep="_", variables={"col2": "One", "col1": "Two"})
    One     Two   
0    b       a
1    a       b
2    c       a
```

sep

Separator used in the column names of the dummy categories they are character indicating the separation of the categorical names from the prefixes. For example, if your column names are prefix_A and prefix_B, you can strip the underscore by specifying sep='_'.
Required argument if prefixes are to be separated as there is no default separator.

str:

Splits columns by first instance of passed string to obtain variables and elements.

example:

>>> dummies
   col1_a  col1_b  col2_a  col2_b  col2_c
0    1       0       0       1       0
1    0       1       1       0       0
2    1       0       0       0       1

>>> from_dummies(dummies, sep="_")
   col1    col2 
0   a       b
1   b       a 
2   a       c

list: -- REMOVED --

Splits columns by first instance of any of the seperators passed in the list.

example:

>>> dummies
  col1_a  col1_b  col2-a  col2-b  col2-c
0   1       0       0       1       0
1   0       1       1       0       0
2   1       0       0       0       1

>>> from_dummies(dummies, sep=["_", "-"]) 
   col1    col2 
0   a       b
1   b       a 
2   a       c

dict: -- REMOVED --

Splits columns with prefix by first instance of corresponding seperator.

example:

>>> dummies
   col1_a-a  col1_b-b  col_2-a   col_2-b   col_2-c
0     1         0         0         1         0
1     0         1         1         0         0
2     1         0         0         0         1

>>> from_dummies(dummies, sep={"col1": "_", "col_2": "-"})
    col1     col_2 
0    a-a       b
1    b-b       a 
2    a-a       c

dummy_na -- REMOVED --

mirrors the dummy_na argument of get_dummies

False but contains NaN:

Considers row with no assignments as NaN.

example:

>>> dummies
   col1_a  col1_b  col2_a  col2_b  col2_c
0    1       0       0       0       0
1    0       1       1       0       0
2    0       0       0       0       1

>>> from_dummies(dummies, sep="_", dummy_na=False)
    col1    col2 
0    a      NaN
1    b       a 
2   NaN      c

True:

Considers row with no assignments as NaN.

example:

>>> dummies
   col1_a  col1_b  col1_NaN  col2_a  col2_b  col2_c col2_NaN
0    1       0        0        0       0       0       1
1    0       1        0        1       0       0       0
2    0       0        1        0       0       1       0

>>> from_dummies(dummies, sep="_", dummy_na=True)
   col1    col2 
0   a      NaN
1   b       a 
2  NaN      c

base_category

Inverts the drop_first argument of get_dummies

Using:

>>> dummies
   col1_a  col1_b  col2_a  col2_b  col2_c
0     1       0       0       0       0
1     0       1       1       0       0
2     0       0       0       0       1

None: -- REMOVED --

Without dummy_na=True non-asigned rows are considered NaN.

example:

>>> from_dummies(dummies, sep="_", base_category=None)
   col1    col2 
0    a      NaN
1    b       a 
2   NaN      c

str:

Assumes all missing assignments to be the base category of the passed category.

example:

>>> from_dummies(dummies, sep="_", base_category="x")
   col1    col2 
0    a       x
1    b       a 
2    x       c

List: -- REMOVED --

The base categories for variables are obtained (in order) from the passed list.

example:

>>> from_dummies(dummies, sep="_", base_category=["x", "y"])
   col1    col2 
0    a       y
1    b       a 
2    x       c

Dict:

The base categories for variables are obtained from the passed prefix to base category mapping.

example:

>>> from_dummies(dummies, sep="_", base_category={"col2": "x", "col1": "y"})
   col1    col2 
0    a       x
1    b       a 
2    y       c

fillna -- REMOVED --

Using:

>>> dummies
   col1_a  col1_b  col2_a  col2_b  col2_c
0     1       0       0      NaN      0
1     0       1       1       0       0
2     0      Nan      0       0       1

True:

Can result in double assignemnt, which will raise the respective error.

example:

>>> from_dummies(dummies, sep="_", fillna=True)
   col1    col2 
0    a       b
1    b       a 
2    b       c

False:

Can result in unassigned row which will be treated as any other unassigned row.

example:

>>> from_dummies(dummies, sep="_", fillna=False)
   col1    col2 
0    a      NaN
1    b       a 
2   NaN      c

To Discuss:

Change, remove, or add options?

To-Do

pandas/tests/reshape/test_from_dummies.py

MarcoGorelli · 2021-06-13T19:05:23Z

Wow, some serious effort here!

The examples you've put in the body of the PR are kind of hard to follow - perhaps if you write them from a Python REPL they're easier to understand?

e.g.:

>>> dummies
   col_a  col_b  col_c
0      1      0      0
1      0      1      0
2      0      0      1
3      1      0      0
>>> pd.from_dummies(dummies)
   col 
0    a
1    b
2    c
3    a

pandas/tests/reshape/test_from_dummies.py

MarcoGorelli · 2021-06-15T18:24:58Z

pandas/core/reshape/reshape.py

+            )
+
+    cat_data = {var: [] for _, var in variables.items()}
+    for index, row in data.iterrows():


Iterating over rows in Python will be too slow - can you have a look at how the (now closed) PR did it?

Removed the row iteration. At the moment this resulted in a problem with NaN values in the output DF which I am currently looking into.. I can mirror the method of the old PR if its method is more efficient (or if it provides an easy solution for the NaN issue).

pandas/core/reshape/reshape.py

MarcoGorelli · 2021-07-03T11:20:49Z

This is still marked as "draft" - just checking, in case you think it's ready for review

pckSF · 2021-07-03T11:41:39Z

This is still marked as "draft" - just checking, in case you think it's ready for review

Good point, I will write the documentation etc. and add more tests for the features we decided to keep, then I will set it ready for review.

pandas/core/reshape/reshape.py

pckSF · 2021-07-19T22:55:37Z

There are still some work in progress points, such as user guide and more tests etc., but these will incorporate the incoming feedback.

mroeschke

Looks fairly good, just some code checks failing in the CI

jreback

minor comments, ping on green.

@pandas-dev/pandas-core if any objections here?

pandas/core/reshape/encoding.py

jreback · 2022-06-05T23:03:01Z

@pckSF some doc-string validation issues: https://github.com/pandas-dev/pandas/runs/6738570255?check_suite_focus=true

jreback · 2022-06-06T12:04:45Z

exception: Error parsing See Also entry 'DataFrame of dummy variables.' in the docstring of get_dummies in /home/runner/work/pandas/pandas/pandas/core/reshape/encoding.py.)

still something with the doc build

pckSF · 2022-06-06T12:30:14Z

exception: Error parsing See Also entry 'DataFrame of dummy variables.' in the docstring of get_dummies in /home/runner/work/pandas/pandas/pandas/core/reshape/encoding.py.)

still something with the doc build

Jep, it seems like I always break something when trying to fix something else ... I will stay on it until everything is green.

jreback · 2022-06-06T12:51:20Z

exception: Error parsing See Also entry 'DataFrame of dummy variables.' in the docstring of get_dummies in /home/runner/work/pandas/pandas/pandas/core/reshape/encoding.py.)

still something with the doc build

Jep, it seems like I always break something when trying to fix something else ... I will stay on it until everything is green.

:-> also can try to build that page locally and isolate things

mroeschke

Thanks LGTM. Timeout and doc failure are unrelated

mroeschke · 2022-06-24T21:36:18Z

@bashtage @MarcoGorelli @fangchenli When you have the chance to review and approve if all looks good, it would be appreciated.

MarcoGorelli

Sorry, one more thing I hadn't picked up on before

pandas/core/reshape/encoding.py

MarcoGorelli

Awesome, looks good to me, and thanks for your patience here

Only slight concern I'd have is using

data_to_decode[prefix_slice]

instead of

data_to_decode.loc[:, prefix_slice]

, I just remembered there being some cases when selecting columns using __getitem__ doesn't work as one would expect, although I haven't been able to construct one that would fail here

But if that's not an issue, then this looks good to me, happy to see it land!

pckSF · 2022-06-25T10:45:38Z

.loc[:, prefix_slice]

I just changed that real quick to be on the safe side, also thanks for your and the teams patience :)

jreback · 2022-06-30T19:25:30Z

thanks @pckSF really nice

Initial draft: from_dummies

f3e6afe

pckSF commented Jun 9, 2021

View reviewed changes

pandas/tests/reshape/test_from_dummies.py Outdated Show resolved Hide resolved

fangchenli requested changes Jun 9, 2021

View reviewed changes

pandas/tests/reshape/test_from_dummies.py Outdated Show resolved Hide resolved

Clean-up tests with fixtures

c7c5588

MarcoGorelli requested changes Jun 13, 2021

View reviewed changes

Make tests more elegant

d06540f

MarcoGorelli requested changes Jun 15, 2021

View reviewed changes

MarcoGorelli requested changes Jun 19, 2021

View reviewed changes

pandas/core/reshape/reshape.py Outdated Show resolved Hide resolved

pckSF added 3 commits June 22, 2021 23:19

Remove variable argument

1fa4e8a

Remove dummy_na argument

c7f8ec8

Remove loop over df rows

3cc98ca

pckSF commented Jun 30, 2021

View reviewed changes

pandas/core/reshape/reshape.py Outdated Show resolved Hide resolved

pckSF added 2 commits July 3, 2021 00:24

Add fillna and basic tests

0e131c6

Fix testnames regarding nan and unassigned

9f74dc7

MarcoGorelli requested changes Jul 3, 2021

View reviewed changes

pandas/core/reshape/reshape.py Outdated Show resolved Hide resolved

Remove fillna

442b340

pckSF commented Jul 6, 2021

View reviewed changes

pandas/core/reshape/reshape.py Outdated Show resolved Hide resolved

pckSF added 7 commits July 11, 2021 16:42

Add from_dummies docstring

38cf04d

Add docstring to _from_dummies_1d

8eccfab

Fix column behaviour

fd027c5

Update handling of unassigned rows

106ff3c

Start user_guide entry

2019228

Draft reshaping user_guide entry

be39c05

Fix: remove temp workspace separation

d406227

pckSF commented Jul 19, 2021

View reviewed changes

pandas/core/reshape/reshape.py Outdated Show resolved Hide resolved

pckSF marked this pull request as ready for review July 19, 2021 22:52

Add space before colon for numpydoc

c32e514

mroeschke reviewed Jun 5, 2022

View reviewed changes

jreback requested changes Jun 5, 2022

View reviewed changes

pandas/core/reshape/encoding.py Show resolved Hide resolved

pandas/core/reshape/encoding.py Outdated Show resolved Hide resolved

pckSF added 4 commits June 6, 2022 11:17

Added pd.Categorical to See Also

0fda02f

Add version added

62b09ae

Add from_dummies to get_dummies see also

1dcdd9a

Fix see also missing period error

3c00690

pckSF added 2 commits June 6, 2022 14:26

Fix See Also of get_dummies

4425b4a

Merge remote-tracking branch 'upstream/main' into add-from_dummies

dc144f7

pckSF added 2 commits June 22, 2022 21:49

Fix docs compiler error

15503b0

Merge from master

61a348b

mroeschke approved these changes Jun 23, 2022

View reviewed changes

jreback approved these changes Jun 24, 2022

View reviewed changes

fangchenli approved these changes Jun 24, 2022

View reviewed changes

MarcoGorelli requested changes Jun 25, 2022

View reviewed changes

pandas/core/reshape/encoding.py Outdated Show resolved Hide resolved

pckSF added 2 commits June 25, 2022 12:04

Fix default_category=0 bug and add corresponding tests

f06a45c

Merge remote-tracking branch 'upstream/main' into add-from_dummies

f3a0f83

MarcoGorelli approved these changes Jun 25, 2022

View reviewed changes

MarcoGorelli requested a review from bashtage June 25, 2022 10:40

Use .loc[:, prefix_slice] instead of [prefix_slice]

23c133f

jreback merged commit ed55bdf into pandas-dev:main Jun 30, 2022

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

Initial draft: from_dummies (pandas-dev#41902)

0de6f26

mroeschke mentioned this pull request Aug 15, 2022

API/ENH: from_dummies #8745

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial draft: from_dummies #41902

Initial draft: from_dummies #41902

pckSF commented Jun 9, 2021 •

edited

Loading

MarcoGorelli commented Jun 13, 2021

MarcoGorelli Jun 15, 2021

pckSF Jun 30, 2021 •

edited

Loading

MarcoGorelli commented Jul 3, 2021

pckSF commented Jul 3, 2021

pckSF commented Jul 19, 2021 •

edited

Loading

mroeschke left a comment

jreback left a comment

jreback commented Jun 5, 2022

jreback commented Jun 6, 2022

pckSF commented Jun 6, 2022

jreback commented Jun 6, 2022

mroeschke left a comment

mroeschke commented Jun 24, 2022

MarcoGorelli left a comment

MarcoGorelli left a comment

pckSF commented Jun 25, 2022

jreback commented Jun 30, 2022

Initial draft: from_dummies #41902

Initial draft: from_dummies #41902

Conversation

pckSF commented Jun 9, 2021 • edited Loading

Current Options:

no arguments

subset -- RMOEVED --

variables -- REMOVED --

sep

dummy_na -- REMOVED --

base_category

fillna -- REMOVED --

To Discuss:

To-Do

MarcoGorelli commented Jun 13, 2021

MarcoGorelli Jun 15, 2021

Choose a reason for hiding this comment

pckSF Jun 30, 2021 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli commented Jul 3, 2021

pckSF commented Jul 3, 2021

pckSF commented Jul 19, 2021 • edited Loading

mroeschke left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Jun 5, 2022

jreback commented Jun 6, 2022

pckSF commented Jun 6, 2022

jreback commented Jun 6, 2022

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Jun 24, 2022

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

pckSF commented Jun 25, 2022

jreback commented Jun 30, 2022

pckSF commented Jun 9, 2021 •

edited

Loading

pckSF Jun 30, 2021 •

edited

Loading

pckSF commented Jul 19, 2021 •

edited

Loading