[SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs skip empty partitions #24926

BryanCutler · 2019-06-21T01:33:19Z

What changes were proposed in this pull request?

When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same.

This PR checks the mapPartitionsInternal iterator to be non-empty before calling ArrowPythonRunner to start computation on the iterator.

How was this patch tested?

Existing tests. Ran the following benchmarks a simple example where most partitions are empty:

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *

df = spark.createDataFrame( 
     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], 
     ("id", "v"))

@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())

df.groupby("id").apply(normalize).count()

Before

In [4]: %timeit df.groupby("id").apply(normalize).count()                                                                    
1.58 s ± 62.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)             

In [5]: %timeit df.groupby("id").apply(normalize).count()                                                                    
1.52 s ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)             

In [6]: %timeit df.groupby("id").apply(normalize).count()                                                                    
1.52 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After this Change

In [2]: %timeit df.groupby("id").apply(normalize).count()                                                                    
646 ms ± 89.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)             

In [3]: %timeit df.groupby("id").apply(normalize).count()                                                                    
408 ms ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df.groupby("id").apply(normalize).count()                                                                    
381 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

…pty partitions

BryanCutler · 2019-06-21T01:36:02Z

This probably won't have an impact if data is not small, but given how many tests there are for these it should hopefully reduce some testing time.

Strangely, AggregateInPandasExec did not show an improvement, although the code is very similar. I'll have to look at that one in more detail.

BryanCutler · 2019-06-21T01:36:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala

-        (key, rows.map(prunedProj))
-      }
+      // Only execute on non-empty partitions
+      if (iter.nonEmpty) {


The diff is a little off, this is really the only change

BryanCutler · 2019-06-21T01:37:22Z

cc @HyukjinKwon @icexelloss

SparkQA · 2019-06-21T04:44:09Z

Test build #106744 has finished for PR 24926 at commit 4eaef85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Yea, I think I made the same change separately in R vectorization as well. Looks good as is too. Although I am sure we have, just for sure, do we have a test that process empty partitions?

BryanCutler · 2019-06-21T16:37:52Z

do we have a test that process empty partitions?

We have one for pandas scalar udfs, but not specifically for these. Although I believe a bunch of empty partitions are created during the shuffle, it still would be good to test starting with one. Let me go ahead and add that.

BryanCutler · 2019-06-21T19:31:03Z

Looks like this cut almost 30s off of test time lol, happy friday @shaneknapp !

GroupedMapPandasUDFTests
Ran 17 tests in 53.236s vs.
Ran 18 tests in 24.102s

SparkQA · 2019-06-21T22:46:21Z

Test build #106779 has finished for PR 24926 at commit 9be0110.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-06-22T02:20:13Z

Merged to master.

BryanCutler · 2019-06-24T17:15:31Z

Thanks @dongjoon-hyun and @HyukjinKwon !

shaneknapp · 2019-06-24T19:56:30Z

haha, thanks! ;)

…

On Fri, Jun 21, 2019 at 12:31 PM Bryan Cutler ***@***.***> wrote: Looks like this cut almost 30s off of test time lol, happy friday @shaneknapp <https://github.com/shaneknapp> ! GroupedMapPandasUDFTests Ran 17 tests in 53.236s vs. Ran 18 tests in 24.102s — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24926?email_source=notifications&email_token=AAMIHLG3EDINJ7FWBJQNPPLP3UUAVA5CNFSM4H2GLX62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYJMLKQ#issuecomment-504546730>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAMIHLE3ZXVDGLXV76ZHCO3P3UUAVANCNFSM4H2GLX6Q> .

## What changes were proposed in this pull request? When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same. This PR checks the `mapPartitionsInternal` iterator to be non-empty before calling `ArrowPythonRunner` to start computation on the iterator. ## How was this patch tested? Existing tests. Ran the following benchmarks a simple example where most partitions are empty: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import * df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v return pdf.assign(v=(v - v.mean()) / v.std()) df.groupby("id").apply(normalize).count() ``` **Before** ``` In [4]: %timeit df.groupby("id").apply(normalize).count() 1.58 s ± 62.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: %timeit df.groupby("id").apply(normalize).count() 1.52 s ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [6]: %timeit df.groupby("id").apply(normalize).count() 1.52 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` **After this Change** ``` In [2]: %timeit df.groupby("id").apply(normalize).count() 646 ms ± 89.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [3]: %timeit df.groupby("id").apply(normalize).count() 408 ms ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [4]: %timeit df.groupby("id").apply(normalize).count() 381 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Closes apache#24926 from BryanCutler/pyspark-pandas_udf-map-agg-skip-empty-parts-SPARK-28128. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Change FlatMapGroupsInPandasExec and AggregateInPandasExec to skip em…

4eaef85

…pty partitions

BryanCutler commented Jun 21, 2019

View reviewed changes

HyukjinKwon approved these changes Jun 21, 2019

View reviewed changes

Added tests and cleanup

9be0110

dongjoon-hyun approved these changes Jun 22, 2019

View reviewed changes

HyukjinKwon closed this in 5ad1053 Jun 22, 2019

BryanCutler deleted the pyspark-pandas_udf-map-agg-skip-empty-parts-SPARK-28128 branch June 24, 2019 17:15

dongjoon-hyun added the SQL label Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs skip empty partitions #24926

[SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs skip empty partitions #24926

BryanCutler commented Jun 21, 2019

BryanCutler commented Jun 21, 2019

BryanCutler Jun 21, 2019

BryanCutler commented Jun 21, 2019

SparkQA commented Jun 21, 2019

HyukjinKwon left a comment

BryanCutler commented Jun 21, 2019

BryanCutler commented Jun 21, 2019

SparkQA commented Jun 21, 2019

HyukjinKwon commented Jun 22, 2019

BryanCutler commented Jun 24, 2019

shaneknapp commented Jun 24, 2019 via email

[SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs skip empty partitions #24926

[SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs skip empty partitions #24926

Conversation

BryanCutler commented Jun 21, 2019

What changes were proposed in this pull request?

How was this patch tested?

BryanCutler commented Jun 21, 2019

BryanCutler Jun 21, 2019

Choose a reason for hiding this comment

BryanCutler commented Jun 21, 2019

SparkQA commented Jun 21, 2019

HyukjinKwon left a comment

Choose a reason for hiding this comment

BryanCutler commented Jun 21, 2019

BryanCutler commented Jun 21, 2019

SparkQA commented Jun 21, 2019

HyukjinKwon commented Jun 22, 2019

BryanCutler commented Jun 24, 2019

shaneknapp commented Jun 24, 2019 via email