BUG: groupby fast_apply vs python apply handles same-indexed result differently #40446

jorisvandenbossche · 2021-03-15T13:12:32Z

From #39146 (comment) (discovered while investigating a benchmark difference). It seems that in groupby/ops.py, the fast_apply (using libreduction) vs the generic python apply gives a different result in case of same-indexed output of the function.

Using a small example dataframe and a function to be applied which simply copies the input:

N = 10
df = pd.DataFrame(
    {
        "key": np.random.randint(0, 3, size=N),
        "value1": np.random.randn(N),
        "value2": ["foo", "bar"] * (N // 2),
    }
)

def df_copy_function(g):
    # ensure that the group name is available (see GH #15062)
    g.name
    return g.copy()

By default you get this result:

In [3]: df.groupby("key").apply(df_copy_function)
Out[3]: 
       key    value1 value2
key                        
0   8    0 -0.149534    foo
    9    0 -0.391135    bar
1   1    1 -0.581107    bar
    2    1 -0.338278    foo
    3    1  0.768924    bar
    6    1 -0.778718    foo
2   0    2  0.196477    foo
    4    2 -0.364822    foo
    5    2 -0.976079    bar
    7    2 -2.671668    bar

But if I trigger to not take the fast apply path (in this case by making one column an extension dtype), we get a different result:

In [4]: df['value2'] = df["value2"].astype("string")

In [5]: df.groupby("key").apply(df_copy_function)
Out[5]: 
   key    value1 value2
0    2  0.196477    foo
1    1 -0.581107    bar
2    1 -0.338278    foo
3    1  0.768924    bar
4    2 -0.364822    foo
5    2 -0.976079    bar
6    1 -0.778718    foo
7    2 -2.671668    bar
8    0 -0.149534    foo
9    0 -0.391135    bar

This might be another manifestation of #34998 and the issues linked from that PR.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche added Bug Groupby labels Mar 15, 2021

jorisvandenbossche mentioned this issue Mar 15, 2021

Refactor - ArrayManager overview issue #39146

Closed

11 tasks

jbrockmendel mentioned this issue Aug 11, 2021

REF: remove libreduction.apply_frame_axis0 #42992

Merged

4 tasks

jreback added this to the 1.4 milestone Aug 12, 2021

jreback closed this as completed in #42992 Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby fast_apply vs python apply handles same-indexed result differently #40446

BUG: groupby fast_apply vs python apply handles same-indexed result differently #40446

jorisvandenbossche commented Mar 15, 2021

BUG: groupby fast_apply vs python apply handles same-indexed result differently #40446

BUG: groupby fast_apply vs python apply handles same-indexed result differently #40446

Comments

jorisvandenbossche commented Mar 15, 2021