-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial draft: from_dummies #41902
Initial draft: from_dummies #41902
Conversation
Wow, some serious effort here! The examples you've put in the body of the PR are kind of hard to follow - perhaps if you write them from a Python REPL they're easier to understand? e.g.: >>> dummies
col_a col_b col_c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
>>> pd.from_dummies(dummies)
col
0 a
1 b
2 c
3 a |
pandas/core/reshape/reshape.py
Outdated
) | ||
|
||
cat_data = {var: [] for _, var in variables.items()} | ||
for index, row in data.iterrows(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterating over rows in Python will be too slow - can you have a look at how the (now closed) PR did it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the row iteration. At the moment this resulted in a problem with NaN values in the output DF which I am currently looking into.. I can mirror the method of the old PR if its method is more efficient (or if it provides an easy solution for the NaN issue).
This is still marked as "draft" - just checking, in case you think it's ready for review |
Good point, I will write the documentation etc. and add more tests for the features we decided to keep, then I will set it ready for review. |
There are still some work in progress points, such as user guide and more tests etc., but these will incorporate the incoming feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fairly good, just some code checks failing in the CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, ping on green.
@pandas-dev/pandas-core if any objections here?
@pckSF some doc-string validation issues: https://github.com/pandas-dev/pandas/runs/6738570255?check_suite_focus=true |
still something with the doc build |
Jep, it seems like I always break something when trying to fix something else ... I will stay on it until everything is green. |
:-> also can try to build that page locally and isolate things |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks LGTM. Timeout and doc failure are unrelated
@bashtage @MarcoGorelli @fangchenli When you have the chance to review and approve if all looks good, it would be appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one more thing I hadn't picked up on before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, looks good to me, and thanks for your patience here
Only slight concern I'd have is using
data_to_decode[prefix_slice]
instead of
data_to_decode.loc[:, prefix_slice]
, I just remembered there being some cases when selecting columns using __getitem__
doesn't work as one would expect, although I haven't been able to construct one that would fail here
But if that's not an issue, then this looks good to me, happy to see it land!
I just changed that real quick to be on the safe side, also thanks for your and the teams patience :) |
thanks @pckSF really nice |
Converts dummy codes to categorical variables.
Tests will be expaneded with composite options and edgecases as soon as definition of final scope is complete.
I tried to mirror theget_dummes
function wherever possible to ensure an as close inverse of the function as possiblePrimary goal of the draft is to demonstrate the options I had in mind, optimization is the next step as soon as we decided what we want to keep, cut, or add.
(Docstrings etc. will be added based on results)
Integration as method of
categorical
, as proposed in #34426, can be considerd as a next step if moving in that direction is planed.Current Options:
no arguments
categories
subset -- RMOEVED --
DataFrame
to include for decoding.None
:DataFrame
consists of dummy coded variablesList
orIndex
:variables -- REMOVED --
prefix
argument ofget_dummies
None
:str
:List
:Dict
:sep
prefix_A
andprefix_B
, you can strip the underscore by specifyingsep='_'
.str
:list
: -- REMOVED --dict
: -- REMOVED --dummy_na -- REMOVED --
dummy_na
argument ofget_dummies
False
but contains NaN:True
:base_category
drop_first
argument ofget_dummies
None
: -- REMOVED --dummy_na=True
non-asigned rows are considered NaN.str
:List
: -- REMOVED --Dict
:fillna -- REMOVED --
True
:>>> from_dummies(dummies, sep="_", fillna=True) col1 col2 0 a b 1 b a 2 b c
False
:>>> from_dummies(dummies, sep="_", fillna=False) col1 col2 0 a NaN 1 b a 2 NaN c
To Discuss:
To-Do
C1,C2)