-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical.(get|from)_dummies #34426
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks @clbarnes !
pandas/core/arrays/categorical.py
Outdated
@@ -379,6 +382,110 @@ def __init__( | |||
self._dtype = self._dtype.update_dtype(dtype) | |||
self._codes = coerce_indexer_dtype(codes, dtype.categories) | |||
|
|||
@classmethod | |||
def from_dummies(cls, dummies: "DataFrame", ordered=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you annotate ordered
, as well as the return type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a preference for how to annotate the self return type? As I understand it, you can do it with a TypeVar
, with a string, or by using from __future__ import annotations
(although only in 3.7+).
pandas/core/arrays/categorical.py
Outdated
codes = (df.astype(int) * mult_by).sum(axis=1) - 1 | ||
codes[codes.isna()] = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have misunderstood what's happening here, but would it work to use pd.factorize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing what factorize
does, but I don't think it would save any effort here - factorize wants a 1D array (which this isn't until codes
is instantiated), and then doesn't produce a Categorical
unless given one, so the last line wouldn't change either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yes, I was thinking something like
codes, _ = factorize(df.idxmax(axis=1))
, but then it'd be necessary to deal with the case when there's a row without any ones, which'd require more code and would be no shorter/clearer than what you've already got. So, ignore the factorize
suggestion :)
Could of things then:
- is
astype(int)
necessary? It should work without it - perhaps there could be a comment explaining why you do
-1
at the end of this line
codes = (df.astype(int) * mult_by).sum(axis=1) - 1
?
Now that I've run the code, it's obvious what it does, but it wasn't when I was just reading it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the astype
wasn't necessary. There's a comment now, best way I could think of explaining it but I can go for prose if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will look soon
likely would want a doc-example of both of these near in https://pandas.pydata.org/docs/dev/user_guide/reshaping.html#computing-indicator-dummy-variables and/or links to categorical.rst |
52b6900
to
5ae8517
Compare
pandas/core/arrays/categorical.py
Outdated
-------- | ||
:func:`pandas.get_dummies` | ||
""" | ||
from pandas import DataFrame, CategoricalIndex, Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not using the existing get_dummies
implementation? This shold be identical to pd.get_dummies(Categorical)
, right?
pandas/core/arrays/categorical.py
Outdated
codes[codes.isna()] = -1 | ||
return cls.from_codes(codes, df.columns.values, ordered=ordered) | ||
|
||
def to_dummies(self, na_column=None) -> "DataFrame": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be called get_dummies
I don't think the justification for deviating from that name is strong enough to warrant a different name.
I also think it should exactly match the signature of get_dummies
.
pandas/core/arrays/categorical.py
Outdated
A column whose header is NA will be dropped; | ||
any row with a NA value will be uncategorised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need examples showing these.
pandas/core/arrays/categorical.py
Outdated
|
||
Parameters | ||
---------- | ||
dummies : DataFrame of bool-like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the type on the first line. The structure can go on the second line.
def test_from_dummies_gt1(self): | ||
# GH 8745 | ||
dummies = DataFrame([[1, 0, 1], [0, 1, 0], [0, 0, 1]], columns=["a", "b", "c"]) | ||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure we're raising an informative error message here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
|
||
def test_from_dummies_nan(self): | ||
# GH 8745 | ||
raw = ["a", "a", "b", "c", "c", "a", np.nan] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is nan
the only supported NA value that's dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll change the selection of columns to drop to be dummies.columns[dummies.columns.isna()]
, and update parametrize this test over [np.nan, pd.NA, pd.NaT, None]
.
@@ -643,3 +645,55 @@ def test_constructor_string_and_tuples(self): | |||
c = pd.Categorical(np.array(["c", ("a", "b"), ("b", "a"), "c"], dtype=object)) | |||
expected_index = pd.Index([("a", "b"), ("b", "a"), "c"]) | |||
assert c.categories.equals(expected_index) | |||
|
|||
def test_from_dummies(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize this over sparse = True/False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of simplicity I have not made an attempt to support loading from a sparse array but I can give it a go!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments @TomAugspurger ! I've addressed most of them but think the to_dummies
and get_dummies
shared functionality is probably worth discussion on the main thread.
|
||
def test_from_dummies_nan(self): | ||
# GH 8745 | ||
raw = ["a", "a", "b", "c", "c", "a", np.nan] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll change the selection of columns to drop to be dummies.columns[dummies.columns.isna()]
, and update parametrize this test over [np.nan, pd.NA, pd.NaT, None]
.
@@ -643,3 +645,55 @@ def test_constructor_string_and_tuples(self): | |||
c = pd.Categorical(np.array(["c", ("a", "b"), ("b", "a"), "c"], dtype=object)) | |||
expected_index = pd.Index([("a", "b"), ("b", "a"), "c"]) | |||
assert c.categories.equals(expected_index) | |||
|
|||
def test_from_dummies(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of simplicity I have not made an attempt to support loading from a sparse array but I can give it a go!
def test_from_dummies_gt1(self): | ||
# GH 8745 | ||
dummies = DataFrame([[1, 0, 1], [0, 1, 0], [0, 0, 1]], columns=["a", "b", "c"]) | ||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
@TomAugspurger My initial implementation was a thin wrapper around I do feel more strongly about keeping the name and API different to My justifications for not supporting all of the Would be interested for other parties to weigh in. |
Yeah, your reasons for
In my opinion, 2 wins out.
That would indeed be helpful. Which parameters / behaviors are especially hard to support? I'm happy to raise a |
Some operations, like regression and classification, | ||
encodes a single categorical variable as a column for each category, | ||
with each row having False in all but one column (True). | ||
These are called dummy variables, or one-hot encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add a link to a wiki page (or scikit lean ok too)
|
||
The :meth:`pandas.Categorical.from_dummies` class method accepts a dataframe | ||
whose dtypes are coercible to boolean, and an ``ordered`` argument | ||
for whether the resulting ``Categorical`` should be considered ordered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to have some good cross links to the current get_dummies section
otherwise this is very confusing
i would prefer that these are actually in the get_dummies with just a small note here
@@ -379,6 +382,137 @@ def __init__( | |||
self._dtype = self._dtype.update_dtype(dtype) | |||
self._codes = coerce_indexer_dtype(codes, dtype.categories) | |||
|
|||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with the others
either remove this or make it identical to get_dummies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the discussion around this was more about the necessity/ implementation of to_dummies
- there currently isn't an equivalent of from_dummies
in pandas, which is what originally precipitated the issue. But note taken re. to_dummies
if that was the intention.
pandas/core/arrays/categorical.py
Outdated
# 010 020 2 1 | ||
# 001 * 1,2,3 => 003 -> 3 -> 2 = correct codes | ||
# 100 100 1 0 | ||
codes = ((df * mult_by).sum(axis=1, skipna=False) - 1).astype("Int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually a memory heavy impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I guess the alternative would be something like
codes = pd.Series(np.full(len(df), np.nan), dtype="Int64")
row_totals = df.sum(axis=1, skipna=False)
... # multicat_rows check
codes[row_totals == 0] = -1
row_idx, code = np.nonzero(row_totals > 0)
codes[row_idx] = code
@clbarnes Is this still active? Can you fix merge conflicts if so? |
Sorry I haven't been able to get back to this, I'll see what I can do today. The tl;dr of the changes requested seems to be
|
b857b9b
to
3f9573c
Compare
I'm calling |
Still to do:
|
pandas/core/arrays/categorical.py
Outdated
1 0.0 1.0 0.0 | ||
2 0.0 0.0 1.0 | ||
""" | ||
# Would be better to use pandas.core.reshape.reshape._get_dummies_1d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't leave comments like this. why cant you use _get_dummies_1d ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importing a function with a leading underscore fails a lint. It's one of the home-rolled lints, so it can't be ignored with # noqa
.
if fillna is not None: | ||
df = df.fillna(fillna, inplace=copied) | ||
|
||
row_totals = df.sum(axis=1, skipna=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why skipna?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no explicit fillna
policy given, and there are still NA values in the data, I'd prefer to raise an error rather than silently pass bunk data through. Therefore nans should not be skipped in this step so that they can be checked for in the next line.
|
||
row_totals = df.sum(axis=1, skipna=False) | ||
if row_totals.isna().any(): | ||
raise ValueError("Unhandled NA values in dummy array") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some holes left in the tests, on the to do list
Simplistic implementation to go between dummy variables and Categoricals.
765f001
to
0facec6
Compare
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@clbarnes how's this going? If you're interested in continuing can you merge master & resolve conflicts and address comments? |
Sure, will do - did the first 90% but things got busy before I could tackle the last 90%. I'll try to get to this next week. |
Sounds good (& no rush ofc). Ping us when you're ready for the next round of reviews |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@clbarnes still active? I think this was pretty close. |
Thanks for the PR, but it appears to have gotten stale. Going to close for now but let us know if you can merge master and update and we'd be happy to reopen. |
Simply converting categorical variables to and from dummy variables.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Intentionally smaller-scoped than #31795 (and indeed
get_dummies
) as a broadly useful MVP which can be chained with other basic functionality. The tests are fairly rudimentary and I welcome any edge cases which should be picked out.For discussion
from_dummies
to_dummies
Categorical.to_dummies
instead of matching the free functionget_dummies
. The symmetry of to/from aids understanding, and usingget_
might imply A) that something is being got, which it isn't or B) signature/ feature parity with the existing method, which wasn't a design goal for me.to_dummies
return type:cls.to_dummies
returns bools, whereget_dummies
returns uint8s by default, which doesn't make a lot of sense to me as we are representing boolean data (and they're the same in memory anyway). Dummy variables are generally used for regression where a continuous variable is required, so ints don't get us any closer to what we may want, and being able to index into the categories using the row may be valuable.dtype
argument: I didn't see any benefit ofcls.to_dummies(dtype=float)
overcls.to_dummies().astype(float)
. The latter is more explicit, no slower, and minimises API surface.prefix
,prefix_sep
arguments: these unnecessarily assume string column headers. If someone wants to rename their columns, they can: IMO it's not a core requirement of this method.dumma_na
replaced withna_column
: if we're including the argument, we may as well let the user decide what they want to call their column, using the dtype they prefer (e.g."other"
,-1
etc). They can always supplynp.nan
if they wantget_dummies
-like behaviour.sparse
argument: This I regret. Producing a sparse array would be valuable. However, it would drastically complicate the method, so I left it out for the MVP. The caller can always sparsify it after it's produced, so long as RAM isn't an issue for the temporary df.drop_first
argument: If someone wants to drop one of their columns, they can (cls.to_dummies().drop(columns="my_col")
): again, not a core requirement of this method. Broadly speaking, I'm not in favour of adding arguments to save the caller <=1 line of their own code unless there are e.g. speed gains.