-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add normalization to crosstab #12578
Conversation
fbc15c5
to
0f835b7
Compare
Thanks for the PR:) As pointed in #12569, I prefer adding new |
@sinhrks Yes, combining an (Or rather, I added two exceptions -- one for passing Since I can't think of an actual function you could pass to |
@nickeubank, no you simply allow string args to what @sinhrks suggests prob makes the most sense
|
table = table / table.sum(axis=1).sum(axis=0) | ||
|
||
elif normalize == 'rows': | ||
table = table.apply(lambda x: x / x.sum(), axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in any event this is just: table/x.sum(1)
. we almost never actually want to use .apply
@sinhrks little better? |
If no values array is passed, computes a frequency table | ||
If 'pct' is passed, will normalize frequency table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may better to describe normalization is performed for the entire level (not for sub-levels) in MultiIndex
case.
I think it is nice if
|
I am slightly -1 on adding this functionality to
|
Thanks @jorvisvandenbosse -- I think that's well said. Perhaps we should solicit a few more opinions to see if we can move towards consensus on this? |
@jorisvandenbossche soln seems reasonable. you might want to enable |
That would be my preference |
+1 |
My opinion is based on the understanding that
I don't have strong opposition against @jorisvandenbossche 's option. One concern is API gets complex more than required. It's less likely to normalize other values than count and sum. |
That's what I said above as well.
Indeed, in general I am very reluctant in adding new keyword arguments. But in this case, I think it makes it more complex to explain what
That's indeed true. |
Sounds like we're agreed I think my main two thoughts are:
Regarding complexity, I think of crosstab as a tool that may be used to generate analysis outputs to potentially put in papers (that's my interest at least). Given that, I think making it as flexible and powerful is highly desirable, even at the cost of an extra key word. |
so we are talking about doing these ops:
This kind of feels like a post-processing step rather than a result of a single operation. |
You can indeed rather simple do this after |
yeah, just thinking if we have a kw (which is fine), then should explain what it is actually doing in the doc-string. As not completely obvious. another option, is maybe to actually have a
|
Yes, it's post-processing. Indeed, that's how it's implemented. The one complication with a stand-alone normalize is that it can't be easily designed to deal with the |
@nickeubank well one could argue the So that's really another abstraction and just being shoved into the current one. |
e7e8c19
to
300ed92
Compare
c70f569
to
39718b8
Compare
dtype='int64'), | ||
columns=pd.Index([3, 4], name='b')) | ||
calculated = pd.crosstab(df.a, df.b, values=df.c, aggfunc='count', | ||
normalize=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something this, just do in a list, as the tests are more clear
for arg in ['index', True, 'columns']:
result = ....
tm.assert_frame.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this too
looks pretty good. |
@@ -18,8 +18,7 @@ Highlights include: | |||
New features | |||
~~~~~~~~~~~~ | |||
|
|||
|
|||
|
|||
- ``pd.crosstab()`` has gained ``normalize`` argument for normalizing frequency tables (:issue:`12569`). Examples in updated docs :ref:`here <reshaping.crosstabulations>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you add a mini-example here (same one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback. Thanks. I'm going to be traveling away from my computer for a
little over two weeks. I'm not sure what the timing is for 0.18.1. I'm
happy to make changes when I get back, but won't be able to do anything
till then.
On Mon, Apr 4, 2016 at 10:59 AM Jeff Reback notifications@github.com
wrote:
In doc/source/whatsnew/v0.18.1.txt
#12578 (comment):@@ -18,8 +18,7 @@ Highlights include:
New features- - +- ``pd.crosstab()`` has gained ``normalize`` argument for normalizing frequency tables (:issue:`12569`). Examples in updated docs :ref:`here <reshaping.crosstabulations>`.
why don't you add a mini-example here (same one)
—
You are receiving this because you were mentioned.Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/12578/files/39718b8f4f398d9cdf6c38583453c804675a5e6a#r58418709
@nickeubank when you are back, pls rebase. This looked pretty good. |
39718b8
to
5d27469
Compare
@jreback rebased! |
5d27469
to
d9764ec
Compare
@jreback tweaked |
d9764ec
to
69df06c
Compare
@@ -402,6 +404,9 @@ It takes a number of arguments | |||
- ``colnames``: sequence, default None, if passed, must match number of column | |||
arrays passed | |||
- ``margins``: boolean, default False, Add row/column margins (subtotals) | |||
- ``normalize``: boolean, {'all', 'index', 'columns'}, or {0,1}, default False. | |||
Normalize by dividing all values by the sum of values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very small detail, but I think there is one space too much at the beginning of this line
69df06c
to
f2474c3
Compare
@jorisvandenbossche all integrated |
|
||
# Actual Normalizations | ||
normalizers = { | ||
False: lambda x: x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never used, correct?
f2474c3
to
c4b5847
Compare
@sinhrks ok! updated |
c4b5847
to
e5015f8
Compare
thanks @nickeubank |
Closes #12569
Note does NOT address #12577