Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add groupby().ngroup() method to count groups (#11642) #14026

Merged
merged 1 commit into from
Jun 1, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1707,6 +1707,7 @@ Computations / Descriptive Stats
GroupBy.mean
GroupBy.median
GroupBy.min
GroupBy.ngroup
GroupBy.nth
GroupBy.ohlc
GroupBy.prod
Expand Down
63 changes: 57 additions & 6 deletions doc/source/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1122,12 +1122,36 @@ To see the order in which each row appears within its group, use the

.. ipython:: python

df = pd.DataFrame(list('aaabba'), columns=['A'])
df
dfg = pd.DataFrame(list('aaabba'), columns=['A'])
dfg

dfg.groupby('A').cumcount()

dfg.groupby('A').cumcount(ascending=False)

.. _groupby.ngroup:

Enumerate groups
~~~~~~~~~~~~~~~~

.. versionadded:: 0.20.2

To see the ordering of the groups (as opposed to the order of rows
within a group given by ``cumcount``) you can use the ``ngroup``
method.

Note that the numbers given to the groups match the order in which the
groups would be seen when iterating over the groupby object, not the
order they are first observed.

.. ipython:: python

df.groupby('A').cumcount()
dfg = pd.DataFrame(list('aaabba'), columns=['A'])
dfg

df.groupby('A').cumcount(ascending=False) # kwarg only
dfg.groupby('A').ngroup()

dfg.groupby('A').ngroup(ascending=False)

Plotting
~~~~~~~~
Expand Down Expand Up @@ -1176,14 +1200,41 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
df
df.groupby(df.sum(), axis=1).sum()

.. _groupby.multicolumn_factorization

Multi-column factorization
~~~~~~~~~~~~~~~~~~~~~~~~~~

By using ``.ngroup()``, we can extract information about the groups in
a way similar to :func:`factorize` (as described further in the
:ref:`reshaping API <reshaping.factorization>`) but which applies
naturally to multiple columns of mixed type and different
sources. This can be useful as an intermediate categorical-like step
in processing, when the relationships between the group rows are more
important than their content, or as input to an algorithm which only
accepts the integer encoding. (For more information about support in
pandas for full categorical data, see the :ref:`Categorical
introduction <categorical>` and the
:ref:`API documentation <api.categorical>`.)

.. ipython:: python

dfg = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})

dfg

dfg.groupby(["A", "B"]).ngroup()

dfg.groupby(["A", [0, 0, 0, 1, 1]]).ngroup()

Groupby by Indexer to 'resample' data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Resampling produces new hypothetical samples(resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.
Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.

In order to resample to work on indices that are non-datetimelike , the following procedure can be utilized.

In the following examples, **df.index // 5** returns a binary array which is used to determine what get's selected for the groupby operation.
In the following examples, **df.index // 5** returns a binary array which is used to determine what gets selected for the groupby operation.

.. note:: The below example shows how we can downsample by consolidation of samples into fewer samples. Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** function, we aggregate the information contained in many samples into a small subset of values which is their standard deviation thereby reducing the number of samples.

Expand Down
2 changes: 1 addition & 1 deletion doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -636,7 +636,7 @@ When a column contains only one level, it will be omitted in the result.

pd.get_dummies(df, drop_first=True)


.. _reshaping.factorize:

Factorizing values
------------------
Expand Down
5 changes: 5 additions & 0 deletions doc/source/whatsnew/v0.20.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ Enhancements
- ``Series`` provides a ``to_latex`` method (:issue:`16180`)
- Added :attr:`Index.is_strictly_monotonic_increasing` and :attr:`Index.is_strictly_monotonic_decreasing` properties (:issue:`16515`)

- A new groupby method :meth:`~pandas.core.groupby.GroupBy.ngroup`,
parallel to the existing :meth:`~pandas.core.groupby.GroupBy.cumcount`,
has been added to return the group order (:issue:`11642`); see
:ref:`here <groupby.ngroup>`.

.. _whatsnew_0202.performance:

Performance Improvements
Expand Down
75 changes: 74 additions & 1 deletion pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@
'last', 'first',
'head', 'tail', 'median',
'mean', 'sum', 'min', 'max',
'cumcount',
'cumcount', 'ngroup',
'resample',
'rank', 'quantile',
'fillna',
Expand Down Expand Up @@ -1437,6 +1437,75 @@ def nth(self, n, dropna=None):

return result

@Substitution(name='groupby')
@Appender(_doc_template)
def ngroup(self, ascending=True):
"""
Number each group from 0 to the number of groups - 1.

This is the enumerative complement of cumcount. Note that the
numbers given to the groups match the order in which the groups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded tag

would be seen when iterating over the groupby object, not the
order they are first observed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is good, add something like this to the docs (in groupy.rst) where you show an example


.. versionadded:: 0.20.2

Parameters
----------
ascending : bool, default True
If False, number in reverse, from number of group - 1 to 0.

Examples
--------

>>> df = pd.DataFrame({"A": list("aaabba")})
>>> df
A
0 a
1 a
2 a
3 b
4 b
5 a
>>> df.groupby('A').ngroup()
0 0
1 0
2 0
3 1
4 1
5 0
dtype: int64
>>> df.groupby('A').ngroup(ascending=False)
0 1
1 1
2 1
3 0
4 0
5 1
dtype: int64
>>> df.groupby(["A", [1,1,2,3,2,1]]).ngroup()
0 0
1 0
2 1
3 3
4 2
5 0
dtype: int64

See also
--------
.cumcount : Number the rows in each group.

"""

self._set_group_selection()

index = self._selected_obj.index
result = Series(self.grouper.group_info[0], index)
if not ascending:
result = self.ngroups - 1 - result
return result

@Substitution(name='groupby')
@Appender(_doc_template)
def cumcount(self, ascending=True):
Expand Down Expand Up @@ -1481,6 +1550,10 @@ def cumcount(self, ascending=True):
4 0
5 0
dtype: int64

See also
--------
.ngroup : Number the groups themselves.
"""

self._set_group_selection()
Expand Down
Loading