Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalization to crosstab #12578

Closed
wants to merge 1 commit into from

Conversation

nickeubank
Copy link
Contributor

Closes #12569

Note does NOT address #12577

@nickeubank nickeubank force-pushed the improve_crosstab branch 2 times, most recently from fbc15c5 to 0f835b7 Compare March 9, 2016 22:00
@jreback jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 9, 2016
@sinhrks
Copy link
Member

sinhrks commented Mar 10, 2016

Thanks for the PR:)

As pointed in #12569, I prefer adding new aggfunc to new option (normalize). For example, isn't agfunc="std" and normalize=True meaningless?

@nickeubank
Copy link
Contributor Author

@sinhrks Yes, combining an aggfunc with normalize would be a problem, which is why I added an exception for that.

(Or rather, I added two exceptions -- one for passing aggfunc without values which I think should raise an exception since when that happens aggfunc is ignored silently, and one for passing normalize with values. Together, they ensure no one tries to combine normalize with values or an aggfunc)

Since I can't think of an actual function you could pass to aggfunc that would do row or column normalization (unless I'm mistaken?), I'm not in favor of it being treated like it's just a special aggfunc. Since it's doing something you can't do with an aggfunc, seems like it should stand alone.

@jreback
Copy link
Contributor

jreback commented Mar 10, 2016

@nickeubank, no you simply allow string args to aggfunc='normalize'or whatever, its less confusing (and you can of course always pass an actual function)

what @sinhrks suggests prob makes the most sense

pct, pct_row, pct_col; pretty clear what they mean (and the doc-string will reinforce these). these are convience aggregators.

table = table / table.sum(axis=1).sum(axis=0)

elif normalize == 'rows':
table = table.apply(lambda x: x / x.sum(), axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in any event this is just: table/x.sum(1). we almost never actually want to use .apply

@nickeubank
Copy link
Contributor Author

@jreback @sinhrks ok, will update!

@nickeubank
Copy link
Contributor Author

@sinhrks little better?

If no values array is passed, computes a frequency table
If 'pct' is passed, will normalize frequency table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may better to describe normalization is performed for the entire level (not for sub-levels) in MultiIndex case.

@sinhrks
Copy link
Member

sinhrks commented Mar 14, 2016

I think it is nice if pct_xxx can normalize sum of values if values specified (and all of them are positive).

a = np.array([1, 1, 2, 2])
b = np.array([1, 2, 1, 2])
c = np.array([10, 20, 40, 80])
pd.crosstab(a, b, c, aggfunc=np.sum)
# col_0   1   2
# row_0        
# 1      10  20
# 2      40  80

pd.crosstab(a, b, c, aggfunc='pct_cols')
# col_0   1    2
# row_0        
# 1     0.2  0.2
# 2     0.8  0.8

@jorisvandenbossche
Copy link
Member

I am slightly -1 on adding this functionality to aggfunc:

  • First, having both the possibility of passing a func or a string (that not maps directly to an existing function) seems a bit unpythonic, and it is really about two different cases: how to aggregate multiple values, or whether or not to normalize the final result.
  • Currently, you can already pass strings to aggfunc to specify the aggregation function (like in groupby.agg), so the current implementation checking for those specific strings will break that functionality.
    Eg something like pd.crosstab(index, columns, values, aggfunc='sum')
  • value_counts already has the normalize keyword, so there is a precedent for this (and I think it is an equivalent functionality, so would be more consistent to use that?)
  • as @sinhrks just mentions, you may also want to normalize the result of an aggregation (eg pd.crosstab(index, columns, values, aggfunc='sum', normalize=True). That would not be possible with the current proposed API.

@nickeubank
Copy link
Contributor Author

Thanks @jorvisvandenbosse -- I think that's well said. Perhaps we should solicit a few more opinions to see if we can move towards consensus on this?

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

@jorisvandenbossche soln seems reasonable.

you might want to enable normalize=True -> pct, normalize='index' -> pct_rows, normalize='columns' -> 'pct_cols'

@nickeubank
Copy link
Contributor Author

That would be my preference

@jorisvandenbossche
Copy link
Member

you might want to enable normalize=True -> pct, normalize='index' -> pct_rows, normalize='columns' -> 'pct_cols'

+1

@sinhrks
Copy link
Member

sinhrks commented Mar 15, 2016

My opinion is based on the understanding that crosstab is similar to pivot_table. Thus:

  • Because pivot_table accepts str, I think it is OK to add other str options to aggfunc.
  • aggfunc="pct", "pct_cols", "pct_rows" outputs normalize counts if values is not specified, and normalize values if values is specified.
  • I don't distinguish aggregate/normalize because aggfunc can be defined fornormalization. (mean is (category sum)/(category counts), pct is (category sum)/(total counts))

I don't have strong opposition against @jorisvandenbossche 's option. One concern is API gets complex more than required. It's less likely to normalize other values than count and sum.

@jorisvandenbossche
Copy link
Member

Because pivot_table accepts str, I think it is OK to add other str options to aggfunc

That's what I said above as well. crosstab's aggfunc also accepts strings like groupby().agg() accepts strings. But there it are strings that match an existing function/method. This is not the case with 'pct_cols/rows'.

One concern is API gets complex more than required.

Indeed, in general I am very reluctant in adding new keyword arguments. But in this case, I think it makes it more complex to explain what aggfunc does in case of combining it. It seems cleaner to separate the two notions

It's less likely to normalize other values than count and sum.

That's indeed true.

@nickeubank
Copy link
Contributor Author

Sounds like we're agreed aggfunc can take strings (I'll remove that exception!)

I think my main two thoughts are:

  • There is no function that could do row-normalization. Thus, since it's functionally doing something different than any explicit aggfunc could do, it should be it's own option, even though I agree with @sinhrks there are few situations when you'd want to normalize the output of a crosstab generated with a custom aggfunc. Indeed, other programs like Stata that do something similar (accept aggregation functions for table) also split out normalization into its own syntax.
  • As @jorisvandenbossche has pointed out, normalize is already a keyword in a core function (value_counts), so I think it's pretty API consistent to have.

Regarding complexity, I think of crosstab as a tool that may be used to generate analysis outputs to potentially put in papers (that's my interest at least). Given that, I think making it as flexible and powerful is highly desirable, even at the cost of an extra key word.

@jreback
Copy link
Contributor

jreback commented Mar 15, 2016

so we are talking about doing these ops:

In [20]: r = pd.crosstab(df.a,df.b)

In [21]: r
Out[21]: 
b  3  4
a      
1  1  0
2  1  3

In [22]: r/r.sum()
Out[22]: 
b    3    4
a          
1  0.5  0.0
2  0.5  1.0

In [23]: r/r.sum().sum()
Out[23]: 
b    3    4
a          
1  0.2  0.0
2  0.2  0.6

In [24]: r.div(r.sum(1),axis='index')
Out[24]: 
b     3     4
a            
1  1.00  0.00
2  0.25  0.75

This kind of feels like a post-processing step rather than a result of a single operation.

@jorisvandenbossche
Copy link
Member

You can indeed rather simple do this after crosstab (although the row-wise case is not that trivial anymore), but still I would be OK to add a convenience keyword for this.

@jreback
Copy link
Contributor

jreback commented Mar 16, 2016

yeah, just thinking if we have a kw (which is fine), then should explain what it is actually doing in the doc-string. As not completely obvious.

another option, is maybe to actually have a .normalize(...) function on frames.

pd.crosstab(df.a, df.b).normalize('columns')

@nickeubank
Copy link
Contributor Author

Yes, it's post-processing. Indeed, that's how it's implemented.

The one complication with a stand-alone normalize is that it can't be easily designed to deal with the margin columns -- if you do crosstab(df.a, df.b, margins=True) then you have to do a special normalization procedure to deal with the fact that the bottom row is a sum of all the rows above. A method that just divides all values by the total sum, you end up off by a factor of 1/2. I really thing there's value in having this as an integrated method.

@jreback
Copy link
Contributor

jreback commented Mar 16, 2016

@nickeubank well one could argue the margins itself should be an external method as well! e.g. its kind of a 'footer' concept almost like a collection of DataFrame + footer Series + footer Series (rhs).

So that's really another abstraction and just being shoved into the current one.

@nickeubank
Copy link
Contributor Author

@jreback Think addressed all your concerns. Only think missing is that I wasn't sure how to write some of the requested tests for dropna=False because I'm a little confused about expected behavior (see new comment on #10772).

@nickeubank nickeubank force-pushed the improve_crosstab branch 2 times, most recently from c70f569 to 39718b8 Compare April 4, 2016 16:17
dtype='int64'),
columns=pd.Index([3, 4], name='b'))
calculated = pd.crosstab(df.a, df.b, values=df.c, aggfunc='count',
normalize=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something this, just do in a list, as the tests are more clear

for arg in ['index', True, 'columns']:
     result = ....
     tm.assert_frame.....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this too

@jreback
Copy link
Contributor

jreback commented Apr 4, 2016

looks pretty good.

@sinhrks @jorisvandenbossche
cc @rosnfeld

@jreback jreback added this to the 0.18.1 milestone Apr 4, 2016
@@ -18,8 +18,7 @@ Highlights include:
New features
~~~~~~~~~~~~



- ``pd.crosstab()`` has gained ``normalize`` argument for normalizing frequency tables (:issue:`12569`). Examples in updated docs :ref:`here <reshaping.crosstabulations>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you add a mini-example here (same one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback. Thanks. I'm going to be traveling away from my computer for a
little over two weeks. I'm not sure what the timing is for 0.18.1. I'm
happy to make changes when I get back, but won't be able to do anything
till then.
On Mon, Apr 4, 2016 at 10:59 AM Jeff Reback notifications@github.com
wrote:

In doc/source/whatsnew/v0.18.1.txt
#12578 (comment):

@@ -18,8 +18,7 @@ Highlights include:
New features


-
-
+- ``pd.crosstab()`` has gained ``normalize`` argument for normalizing frequency tables (:issue:`12569`). Examples in updated docs :ref:`here <reshaping.crosstabulations>`.

why don't you add a mini-example here (same one)


You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/12578/files/39718b8f4f398d9cdf6c38583453c804675a5e6a#r58418709

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

@nickeubank when you are back, pls rebase. This looked pretty good.

@nickeubank
Copy link
Contributor Author

@jreback rebased!

@nickeubank
Copy link
Contributor Author

@jreback tweaked

@@ -402,6 +404,9 @@ It takes a number of arguments
- ``colnames``: sequence, default None, if passed, must match number of column
arrays passed
- ``margins``: boolean, default False, Add row/column margins (subtotals)
- ``normalize``: boolean, {'all', 'index', 'columns'}, or {0,1}, default False.
Normalize by dividing all values by the sum of values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very small detail, but I think there is one space too much at the beginning of this line

@nickeubank
Copy link
Contributor Author

@jorisvandenbossche all integrated


# Actual Normalizations
normalizers = {
False: lambda x: x,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never used, correct?

@nickeubank
Copy link
Contributor Author

@sinhrks ok! updated

@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

thanks @nickeubank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants