Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby fast_apply vs python apply handles same-indexed result differently #40446

Closed
jorisvandenbossche opened this issue Mar 15, 2021 · 0 comments · Fixed by #42992
Closed
Milestone

Comments

@jorisvandenbossche
Copy link
Member

From #39146 (comment) (discovered while investigating a benchmark difference). It seems that in groupby/ops.py, the fast_apply (using libreduction) vs the generic python apply gives a different result in case of same-indexed output of the function.

Using a small example dataframe and a function to be applied which simply copies the input:

N = 10
df = pd.DataFrame(
    {
        "key": np.random.randint(0, 3, size=N),
        "value1": np.random.randn(N),
        "value2": ["foo", "bar"] * (N // 2),
    }
)

def df_copy_function(g):
    # ensure that the group name is available (see GH #15062)
    g.name
    return g.copy()

By default you get this result:

In [3]: df.groupby("key").apply(df_copy_function)
Out[3]: 
       key    value1 value2
key                        
0   8    0 -0.149534    foo
    9    0 -0.391135    bar
1   1    1 -0.581107    bar
    2    1 -0.338278    foo
    3    1  0.768924    bar
    6    1 -0.778718    foo
2   0    2  0.196477    foo
    4    2 -0.364822    foo
    5    2 -0.976079    bar
    7    2 -2.671668    bar

But if I trigger to not take the fast apply path (in this case by making one column an extension dtype), we get a different result:

In [4]: df['value2'] = df["value2"].astype("string")

In [5]: df.groupby("key").apply(df_copy_function)
Out[5]: 
   key    value1 value2
0    2  0.196477    foo
1    1 -0.581107    bar
2    1 -0.338278    foo
3    1  0.768924    bar
4    2 -0.364822    foo
5    2 -0.976079    bar
6    1 -0.778718    foo
7    2 -2.671668    bar
8    0 -0.149534    foo
9    0 -0.391135    bar

This might be another manifestation of #34998 and the issues linked from that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants