Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft: from_dummies #41902

Merged
merged 121 commits into from
Jun 30, 2022
Merged

Initial draft: from_dummies #41902

merged 121 commits into from
Jun 30, 2022

Conversation

pckSF
Copy link
Contributor

@pckSF pckSF commented Jun 9, 2021

Converts dummy codes to categorical variables.

  • Closes #8745
  • Passes atomic options tests
  • Passes linting tests
  • Whatsnew entry

Tests will be expaneded with composite options and edgecases as soon as definition of final scope is complete.
I tried to mirror the get_dummes function wherever possible to ensure an as close inverse of the function as possible

Primary goal of the draft is to demonstrate the options I had in mind, optimization is the next step as soon as we decided what we want to keep, cut, or add.
(Docstrings etc. will be added based on results)

Integration as method of categorical, as proposed in #34426, can be considerd as a next step if moving in that direction is planed.

Current Options:

no arguments

  • Assumes all columns to be cateogries of the same variable which is called categories
    • example:
      >>> dummies
            a   b   c   d
      0     1   0   0   0      
      1     0   1   0   0      
      2     1   0   0   0
      3     0   0   0   1      
      4     0   0   1   0         
      
      >>> from_dummies(dummies)
          categories
      0       a      
      1       b       
      2       a
      3       d
      4       c     

subset -- RMOEVED --

  • Select which columns of the input DataFrame to include for decoding.
  • None:
    • Assumes the entire passed DataFrame consists of dummy coded variables
  • List or Index:
    • Just returns the decoded subset passed in the list.
    • example:
      >>> dummies
         C    col1_a  col1_b  col2_a  col2_b  col2_c
      0  1       1       0       0       1       0
      1  2       0       1       1       0       0
      2  3       1       0       0       0       1
      
      >>> from_dummies(dummies, sep="_", subset=["col1_a", "col1_b", "col2_a", "col2_b", "col2_c"])
        col1    col2 
      0  a       b
      1  b       a 
      2  a       c

variables -- REMOVED --

  • Inverts the prefix argument of get_dummies
  • Using:
    >>> dummies
      col1_a  col1_b  col2_a  col2_b  col2_c
    0    1       0       0       1       0
    1    0       1       1       0       0
    2    1       0       0       0       1
  • None:
    • Variable names are taken from prefix.
    • example:
      >>> from_dummies(dummies, sep="_", variables=None)
        col1    col2 
      0   a       b
      1   b       a 
      2   a       c
  • str:
    • Variable names are numbered from the passed string.
    • example:
      >>> from_dummies(dummies, sep="_", variables="Varname")
         Varname1   Varname2 
      0     a          b
      1     b          a 
      2     a          c
  • List:
    • Variable names are obtained (in order) from the passed list.
    • example:
      >>> from_dummies(dummies, sep="_", variables=["One", "Two"])
          One     Two 
      0    a       b
      1    b       a 
      2    a       c
  • Dict:
    • Variable names are obtained from passed prefix to variable mapping.
    • Currently orders the returned columns by index in the passed dictionary, not sure if that should be changed to strictly follow the order of prefixes in the dummy-data.
    • example:
      >>> from_dummies(dummies, sep="_", variables={"col2": "One", "col1": "Two"})
          One     Two   
      0    b       a
      1    a       b
      2    c       a

sep

  • Separator used in the column names of the dummy categories they are character indicating the separation of the categorical names from the prefixes. For example, if your column names are prefix_A and prefix_B, you can strip the underscore by specifying sep='_'.
  • Required argument if prefixes are to be separated as there is no default separator.
  • str:
    • Splits columns by first instance of passed string to obtain variables and elements.
    • example:
      >>> dummies
         col1_a  col1_b  col2_a  col2_b  col2_c
      0    1       0       0       1       0
      1    0       1       1       0       0
      2    1       0       0       0       1
      
      >>> from_dummies(dummies, sep="_")
         col1    col2 
      0   a       b
      1   b       a 
      2   a       c
  • list: -- REMOVED --
    • Splits columns by first instance of any of the seperators passed in the list.
    • example:
      >>> dummies
        col1_a  col1_b  col2-a  col2-b  col2-c
      0   1       0       0       1       0
      1   0       1       1       0       0
      2   1       0       0       0       1
      
      >>> from_dummies(dummies, sep=["_", "-"]) 
         col1    col2 
      0   a       b
      1   b       a 
      2   a       c
  • dict: -- REMOVED --
    • Splits columns with prefix by first instance of corresponding seperator.
    • example:
      >>> dummies
         col1_a-a  col1_b-b  col_2-a   col_2-b   col_2-c
      0     1         0         0         1         0
      1     0         1         1         0         0
      2     1         0         0         0         1
      
      >>> from_dummies(dummies, sep={"col1": "_", "col_2": "-"})
          col1     col_2 
      0    a-a       b
      1    b-b       a 
      2    a-a       c

dummy_na -- REMOVED --

  • mirrors the dummy_na argument of get_dummies
  • False but contains NaN:
    • Considers row with no assignments as NaN.
    • example:
      >>> dummies
         col1_a  col1_b  col2_a  col2_b  col2_c
      0    1       0       0       0       0
      1    0       1       1       0       0
      2    0       0       0       0       1
      
      >>> from_dummies(dummies, sep="_", dummy_na=False)
          col1    col2 
      0    a      NaN
      1    b       a 
      2   NaN      c
  • True:
    • Considers row with no assignments as NaN.
    • example:
      >>> dummies
         col1_a  col1_b  col1_NaN  col2_a  col2_b  col2_c col2_NaN
      0    1       0        0        0       0       0       1
      1    0       1        0        1       0       0       0
      2    0       0        1        0       0       1       0
      
      >>> from_dummies(dummies, sep="_", dummy_na=True)
         col1    col2 
      0   a      NaN
      1   b       a 
      2  NaN      c

base_category

  • Inverts the drop_first argument of get_dummies
  • Using:
    >>> dummies
       col1_a  col1_b  col2_a  col2_b  col2_c
    0     1       0       0       0       0
    1     0       1       1       0       0
    2     0       0       0       0       1
  • None: -- REMOVED --
    • Without dummy_na=True non-asigned rows are considered NaN.
    • example:
      >>> from_dummies(dummies, sep="_", base_category=None)
         col1    col2 
      0    a      NaN
      1    b       a 
      2   NaN      c
  • str:
    • Assumes all missing assignments to be the base category of the passed category.
    • example:
      >>> from_dummies(dummies, sep="_", base_category="x")
         col1    col2 
      0    a       x
      1    b       a 
      2    x       c
  • List: -- REMOVED --
    • The base categories for variables are obtained (in order) from the passed list.
    • example:
      >>> from_dummies(dummies, sep="_", base_category=["x", "y"])
         col1    col2 
      0    a       y
      1    b       a 
      2    x       c
  • Dict:
    • The base categories for variables are obtained from the passed prefix to base category mapping.
    • example:
      >>> from_dummies(dummies, sep="_", base_category={"col2": "x", "col1": "y"})
         col1    col2 
      0    a       x
      1    b       a 
      2    y       c

fillna -- REMOVED --

  • Using:
    >>> dummies
       col1_a  col1_b  col2_a  col2_b  col2_c
    0     1       0       0      NaN      0
    1     0       1       1       0       0
    2     0      Nan      0       0       1
  • True:
    • Can result in double assignemnt, which will raise the respective error.
    • example:
      >>> from_dummies(dummies, sep="_", fillna=True)
         col1    col2 
      0    a       b
      1    b       a 
      2    b       c
  • False:
    • Can result in unassigned row which will be treated as any other unassigned row.
    • example:
      >>> from_dummies(dummies, sep="_", fillna=False)
         col1    col2 
      0    a      NaN
      1    b       a 
      2   NaN      c

To Discuss:

  1. Change, remove, or add options?

To-Do

  • Add handling of NA values in input
  • Adapt discussed changes (C1, C2)
  • Expand tests
  • Work on optimization (O1)
  • Raise relevant errors
  • Whatsnew entry

@MarcoGorelli
Copy link
Member

Wow, some serious effort here!

The examples you've put in the body of the PR are kind of hard to follow - perhaps if you write them from a Python REPL they're easier to understand?

e.g.:

>>> dummies
   col_a  col_b  col_c
0      1      0      0
1      0      1      0
2      0      0      1
3      1      0      0
>>> pd.from_dummies(dummies)
   col 
0    a
1    b
2    c
3    a

)

cat_data = {var: [] for _, var in variables.items()}
for index, row in data.iterrows():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over rows in Python will be too slow - can you have a look at how the (now closed) PR did it?

Copy link
Contributor Author

@pckSF pckSF Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the row iteration. At the moment this resulted in a problem with NaN values in the output DF which I am currently looking into.. I can mirror the method of the old PR if its method is more efficient (or if it provides an easy solution for the NaN issue).

@MarcoGorelli
Copy link
Member

This is still marked as "draft" - just checking, in case you think it's ready for review

@pckSF
Copy link
Contributor Author

pckSF commented Jul 3, 2021

This is still marked as "draft" - just checking, in case you think it's ready for review

Good point, I will write the documentation etc. and add more tests for the features we decided to keep, then I will set it ready for review.

@pckSF pckSF marked this pull request as ready for review July 19, 2021 22:52
@pckSF
Copy link
Contributor Author

pckSF commented Jul 19, 2021

There are still some work in progress points, such as user guide and more tests etc., but these will incorporate the incoming feedback.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fairly good, just some code checks failing in the CI

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, ping on green.

@pandas-dev/pandas-core if any objections here?

@jreback
Copy link
Contributor

jreback commented Jun 5, 2022

@jreback
Copy link
Contributor

jreback commented Jun 6, 2022

exception: Error parsing See Also entry 'DataFrame of dummy variables.' in the docstring of get_dummies in /home/runner/work/pandas/pandas/pandas/core/reshape/encoding.py.)

still something with the doc build

@pckSF
Copy link
Contributor Author

pckSF commented Jun 6, 2022

exception: Error parsing See Also entry 'DataFrame of dummy variables.' in the docstring of get_dummies in /home/runner/work/pandas/pandas/pandas/core/reshape/encoding.py.)

still something with the doc build

Jep, it seems like I always break something when trying to fix something else ... I will stay on it until everything is green.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2022

exception: Error parsing See Also entry 'DataFrame of dummy variables.' in the docstring of get_dummies in /home/runner/work/pandas/pandas/pandas/core/reshape/encoding.py.)

still something with the doc build

Jep, it seems like I always break something when trying to fix something else ... I will stay on it until everything is green.

:-> also can try to build that page locally and isolate things

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM. Timeout and doc failure are unrelated

@mroeschke
Copy link
Member

@bashtage @MarcoGorelli @fangchenli When you have the chance to review and approve if all looks good, it would be appreciated.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one more thing I hadn't picked up on before

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, looks good to me, and thanks for your patience here

Only slight concern I'd have is using

data_to_decode[prefix_slice]

instead of

data_to_decode.loc[:, prefix_slice]

, I just remembered there being some cases when selecting columns using __getitem__ doesn't work as one would expect, although I haven't been able to construct one that would fail here

But if that's not an issue, then this looks good to me, happy to see it land!

@MarcoGorelli MarcoGorelli requested a review from bashtage June 25, 2022 10:40
@pckSF
Copy link
Contributor Author

pckSF commented Jun 25, 2022

.loc[:, prefix_slice]

I just changed that real quick to be on the safe side, also thanks for your and the teams patience :)

@jreback jreback merged commit ed55bdf into pandas-dev:main Jun 30, 2022
@jreback
Copy link
Contributor

jreback commented Jun 30, 2022

thanks @pckSF really nice

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
@mroeschke mroeschke mentioned this pull request Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants