Add normalization to crosstab #12578

nickeubank · 2016-03-09T21:58:21Z

Note does NOT address #12577

sinhrks · 2016-03-10T22:30:56Z

Thanks for the PR:)

As pointed in #12569, I prefer adding new aggfunc to new option (normalize). For example, isn't agfunc="std" and normalize=True meaningless?

nickeubank · 2016-03-10T22:40:53Z

@sinhrks Yes, combining an aggfunc with normalize would be a problem, which is why I added an exception for that.

(Or rather, I added two exceptions -- one for passing aggfunc without values which I think should raise an exception since when that happens aggfunc is ignored silently, and one for passing normalize with values. Together, they ensure no one tries to combine normalize with values or an aggfunc)

Since I can't think of an actual function you could pass to aggfunc that would do row or column normalization (unless I'm mistaken?), I'm not in favor of it being treated like it's just a special aggfunc. Since it's doing something you can't do with an aggfunc, seems like it should stand alone.

jreback · 2016-03-10T22:41:54Z

@nickeubank, no you simply allow string args to aggfunc='normalize'or whatever, its less confusing (and you can of course always pass an actual function)

what @sinhrks suggests prob makes the most sense

pct, pct_row, pct_col; pretty clear what they mean (and the doc-string will reinforce these). these are convience aggregators.

jreback · 2016-03-10T22:43:57Z

pandas/tools/pivot.py

+                table = table / table.sum(axis=1).sum(axis=0)
+
+            elif normalize == 'rows':
+                table = table.apply(lambda x: x / x.sum(), axis=1)


in any event this is just: table/x.sum(1). we almost never actually want to use .apply

nickeubank · 2016-03-10T22:44:11Z

@jreback @sinhrks ok, will update!

nickeubank · 2016-03-14T01:47:39Z

@sinhrks little better?

jreback · 2016-03-14T01:54:02Z

pandas/tools/pivot.py

        If no values array is passed, computes a frequency table
+        If 'pct' is passed, will normalize frequency table.


add a versionadded tag

It may better to describe normalization is performed for the entire level (not for sub-levels) in MultiIndex case.

sinhrks · 2016-03-14T10:16:13Z

I think it is nice if pct_xxx can normalize sum of values if values specified (and all of them are positive).

a = np.array([1, 1, 2, 2])
b = np.array([1, 2, 1, 2])
c = np.array([10, 20, 40, 80])
pd.crosstab(a, b, c, aggfunc=np.sum)
# col_0   1   2
# row_0        
# 1      10  20
# 2      40  80

pd.crosstab(a, b, c, aggfunc='pct_cols')
# col_0   1    2
# row_0        
# 1     0.2  0.2
# 2     0.8  0.8

jorisvandenbossche · 2016-03-14T11:00:41Z

I am slightly -1 on adding this functionality to aggfunc:

First, having both the possibility of passing a func or a string (that not maps directly to an existing function) seems a bit unpythonic, and it is really about two different cases: how to aggregate multiple values, or whether or not to normalize the final result.
Currently, you can already pass strings to aggfunc to specify the aggregation function (like in groupby.agg), so the current implementation checking for those specific strings will break that functionality.
Eg something like pd.crosstab(index, columns, values, aggfunc='sum')
value_counts already has the normalize keyword, so there is a precedent for this (and I think it is an equivalent functionality, so would be more consistent to use that?)
as @sinhrks just mentions, you may also want to normalize the result of an aggregation (eg pd.crosstab(index, columns, values, aggfunc='sum', normalize=True). That would not be possible with the current proposed API.

nickeubank · 2016-03-14T15:07:01Z

Thanks @jorvisvandenbosse -- I think that's well said. Perhaps we should solicit a few more opinions to see if we can move towards consensus on this?

jreback · 2016-03-14T15:22:51Z

@jorisvandenbossche soln seems reasonable.

you might want to enable normalize=True -> pct, normalize='index' -> pct_rows, normalize='columns' -> 'pct_cols'

nickeubank · 2016-03-14T16:20:21Z

That would be my preference

jorisvandenbossche · 2016-03-14T20:46:25Z

you might want to enable normalize=True -> pct, normalize='index' -> pct_rows, normalize='columns' -> 'pct_cols'

+1

sinhrks · 2016-03-15T08:37:07Z

My opinion is based on the understanding that crosstab is similar to pivot_table. Thus:

Because pivot_table accepts str, I think it is OK to add other str options to aggfunc.
aggfunc="pct", "pct_cols", "pct_rows" outputs normalize counts if values is not specified, and normalize values if values is specified.
I don't distinguish aggregate/normalize because aggfunc can be defined fornormalization. (mean is (category sum)/(category counts), pct is (category sum)/(total counts))

I don't have strong opposition against @jorisvandenbossche 's option. One concern is API gets complex more than required. It's less likely to normalize other values than count and sum.

jorisvandenbossche · 2016-03-15T08:48:56Z

Because pivot_table accepts str, I think it is OK to add other str options to aggfunc

That's what I said above as well. crosstab's aggfunc also accepts strings like groupby().agg() accepts strings. But there it are strings that match an existing function/method. This is not the case with 'pct_cols/rows'.

One concern is API gets complex more than required.

Indeed, in general I am very reluctant in adding new keyword arguments. But in this case, I think it makes it more complex to explain what aggfunc does in case of combining it. It seems cleaner to separate the two notions

It's less likely to normalize other values than count and sum.

That's indeed true.

nickeubank · 2016-03-15T18:43:38Z

Sounds like we're agreed aggfunc can take strings (I'll remove that exception!)

I think my main two thoughts are:

There is no function that could do row-normalization. Thus, since it's functionally doing something different than any explicit aggfunc could do, it should be it's own option, even though I agree with @sinhrks there are few situations when you'd want to normalize the output of a crosstab generated with a custom aggfunc. Indeed, other programs like Stata that do something similar (accept aggregation functions for table) also split out normalization into its own syntax.
As @jorisvandenbossche has pointed out, normalize is already a keyword in a core function (value_counts), so I think it's pretty API consistent to have.

Regarding complexity, I think of crosstab as a tool that may be used to generate analysis outputs to potentially put in papers (that's my interest at least). Given that, I think making it as flexible and powerful is highly desirable, even at the cost of an extra key word.

jreback · 2016-03-15T19:08:43Z

so we are talking about doing these ops:

In [20]: r = pd.crosstab(df.a,df.b)

In [21]: r
Out[21]: 
b  3  4
a      
1  1  0
2  1  3

In [22]: r/r.sum()
Out[22]: 
b    3    4
a          
1  0.5  0.0
2  0.5  1.0

In [23]: r/r.sum().sum()
Out[23]: 
b    3    4
a          
1  0.2  0.0
2  0.2  0.6

In [24]: r.div(r.sum(1),axis='index')
Out[24]: 
b     3     4
a            
1  1.00  0.00
2  0.25  0.75

This kind of feels like a post-processing step rather than a result of a single operation.

jorisvandenbossche · 2016-03-16T12:51:04Z

You can indeed rather simple do this after crosstab (although the row-wise case is not that trivial anymore), but still I would be OK to add a convenience keyword for this.

jreback · 2016-03-16T12:55:05Z

yeah, just thinking if we have a kw (which is fine), then should explain what it is actually doing in the doc-string. As not completely obvious.

another option, is maybe to actually have a .normalize(...) function on frames.

pd.crosstab(df.a, df.b).normalize('columns')

nickeubank · 2016-03-16T18:31:56Z

Yes, it's post-processing. Indeed, that's how it's implemented.

The one complication with a stand-alone normalize is that it can't be easily designed to deal with the margin columns -- if you do crosstab(df.a, df.b, margins=True) then you have to do a special normalization procedure to deal with the fact that the bottom row is a sum of all the rows above. A method that just divides all values by the total sum, you end up off by a factor of 1/2. I really thing there's value in having this as an integrated method.

jreback · 2016-03-16T18:59:33Z

@nickeubank well one could argue the margins itself should be an external method as well! e.g. its kind of a 'footer' concept almost like a collection of DataFrame + footer Series + footer Series (rhs).

So that's really another abstraction and just being shoved into the current one.

nickeubank · 2016-04-03T23:51:01Z

@jreback Think addressed all your concerns. Only think missing is that I wasn't sure how to write some of the requested tests for dropna=False because I'm a little confused about expected behavior (see new comment on #10772).

jreback · 2016-04-04T17:56:32Z

pandas/tools/tests/test_pivot.py

+                                            dtype='int64'),
+                             columns=pd.Index([3, 4], name='b'))
+        calculated = pd.crosstab(df.a, df.b, values=df.c, aggfunc='count',
+                                 normalize=True)


something this, just do in a list, as the tests are more clear

for arg in ['index', True, 'columns']: result = .... tm.assert_frame.....

jreback · 2016-04-04T17:58:39Z

looks pretty good.

@sinhrks @jorisvandenbossche
cc @rosnfeld

jreback · 2016-04-04T17:59:24Z

doc/source/whatsnew/v0.18.1.txt

@@ -18,8 +18,7 @@ Highlights include:
 New features
 ~~~~~~~~~~~~

-
-
+- ``pd.crosstab()`` has gained ``normalize`` argument for normalizing frequency tables (:issue:`12569`). Examples in updated docs :ref:`here <reshaping.crosstabulations>`.


why don't you add a mini-example here (same one)

@jreback. Thanks. I'm going to be traveling away from my computer for a
little over two weeks. I'm not sure what the timing is for 0.18.1. I'm
happy to make changes when I get back, but won't be able to do anything
till then.
On Mon, Apr 4, 2016 at 10:59 AM Jeff Reback notifications@github.com
wrote:

In doc/source/whatsnew/v0.18.1.txt
#12578 (comment):

@@ -18,8 +18,7 @@ Highlights include:
New features

- - +- ``pd.crosstab()`` has gained ``normalize`` argument for normalizing frequency tables (:issue:`12569`). Examples in updated docs :ref:`here <reshaping.crosstabulations>`.

why don't you add a mini-example here (same one)

—
You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/12578/files/39718b8f4f398d9cdf6c38583453c804675a5e6a#r58418709

jreback · 2016-04-17T15:50:10Z

@nickeubank when you are back, pls rebase. This looked pretty good.

nickeubank · 2016-04-18T15:47:15Z

@jreback rebased!

nickeubank · 2016-04-18T16:28:04Z

@jreback tweaked

jorisvandenbossche · 2016-04-18T16:38:33Z

doc/source/reshaping.rst

@@ -402,6 +404,9 @@ It takes a number of arguments
 - ``colnames``: sequence, default None, if passed, must match number of column
  arrays passed
 - ``margins``: boolean, default False, Add row/column margins (subtotals)
+- ``normalize``: boolean, {'all', 'index', 'columns'}, or {0,1}, default False.
+   Normalize by dividing all values by the sum of values.


Very small detail, but I think there is one space too much at the beginning of this line

nickeubank · 2016-04-18T17:06:16Z

@jorisvandenbossche all integrated

sinhrks · 2016-04-18T20:46:20Z

pandas/tools/pivot.py

+
+        # Actual Normalizations
+        normalizers = {
+            False: lambda x: x,


Never used, correct?

nickeubank · 2016-04-19T02:36:16Z

@sinhrks ok! updated

jreback · 2016-04-25T14:34:16Z

thanks @nickeubank

nickeubank force-pushed the improve_crosstab branch 2 times, most recently from fbc15c5 to 0f835b7 Compare March 9, 2016 22:00

nickeubank mentioned this pull request Mar 9, 2016

ENH: Add normalize to crosstab #12569

Closed

jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 9, 2016

jreback reviewed Mar 10, 2016
View reviewed changes

jreback reviewed Mar 14, 2016
View reviewed changes

jreback mentioned this pull request Mar 16, 2016

API: add top-level functions as method #12640

Closed

4 tasks

nickeubank force-pushed the improve_crosstab branch from e7e8c19 to 300ed92 Compare April 3, 2016 23:48

nickeubank force-pushed the improve_crosstab branch 2 times, most recently from c70f569 to 39718b8 Compare April 4, 2016 16:17

jreback reviewed Apr 4, 2016
View reviewed changes

jreback added this to the 0.18.1 milestone Apr 4, 2016

jreback reviewed Apr 4, 2016
View reviewed changes

nickeubank force-pushed the improve_crosstab branch from 39718b8 to 5d27469 Compare April 18, 2016 15:47

nickeubank force-pushed the improve_crosstab branch from 5d27469 to d9764ec Compare April 18, 2016 16:27

nickeubank force-pushed the improve_crosstab branch from d9764ec to 69df06c Compare April 18, 2016 16:29

jorisvandenbossche reviewed Apr 18, 2016
View reviewed changes

nickeubank force-pushed the improve_crosstab branch from 69df06c to f2474c3 Compare April 18, 2016 17:06

sinhrks reviewed Apr 18, 2016
View reviewed changes

pandas/tools/pivot.py

# Actual Normalizations

normalizers = {

False: lambda x: x,

Copy link

Member

sinhrks Apr 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never used, correct?

nickeubank force-pushed the improve_crosstab branch from f2474c3 to c4b5847 Compare April 19, 2016 02:35

add normalization to crosstab

e5015f8

nickeubank force-pushed the improve_crosstab branch from c4b5847 to e5015f8 Compare April 20, 2016 14:55

jreback closed this in bb494b7 Apr 25, 2016

		If no values array is passed, computes a frequency table
		If 'pct' is passed, will normalize frequency table.

Add normalization to crosstab #12578

Add normalization to crosstab #12578

Conversation

nickeubank commented Mar 9, 2016

sinhrks commented Mar 10, 2016

nickeubank commented Mar 10, 2016

jreback commented Mar 10, 2016

Choose a reason for hiding this comment

nickeubank commented Mar 10, 2016

nickeubank commented Mar 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sinhrks commented Mar 14, 2016

jorisvandenbossche commented Mar 14, 2016

nickeubank commented Mar 14, 2016

jreback commented Mar 14, 2016

nickeubank commented Mar 14, 2016

jorisvandenbossche commented Mar 14, 2016

sinhrks commented Mar 15, 2016

jorisvandenbossche commented Mar 15, 2016

nickeubank commented Mar 15, 2016

jreback commented Mar 15, 2016

jorisvandenbossche commented Mar 16, 2016

jreback commented Mar 16, 2016

nickeubank commented Mar 16, 2016

jreback commented Mar 16, 2016

nickeubank commented Apr 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 17, 2016

nickeubank commented Apr 18, 2016

nickeubank commented Apr 18, 2016

Choose a reason for hiding this comment

nickeubank commented Apr 18, 2016

Choose a reason for hiding this comment

nickeubank commented Apr 19, 2016

jreback commented Apr 25, 2016