[MINOR][BUILD] Exclude pyspark-coverage-site/ dir from RAT #24950

srowen · 2019-06-24T14:21:52Z

What changes were proposed in this pull request?

Looks like a directory pyspark-site-coverage/ is now (?) generated and fails RAT checks. It should just be excluded. See: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6029/console

How was this patch tested?

N/A

## What changes were proposed in this pull request? When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same. This PR checks the `mapPartitionsInternal` iterator to be non-empty before calling `ArrowPythonRunner` to start computation on the iterator. ## How was this patch tested? Existing tests. Ran the following benchmarks a simple example where most partitions are empty: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import * df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v return pdf.assign(v=(v - v.mean()) / v.std()) df.groupby("id").apply(normalize).count() ``` **Before** ``` In [4]: %timeit df.groupby("id").apply(normalize).count() 1.58 s ± 62.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: %timeit df.groupby("id").apply(normalize).count() 1.52 s ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [6]: %timeit df.groupby("id").apply(normalize).count() 1.52 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` **After this Change** ``` In [2]: %timeit df.groupby("id").apply(normalize).count() 646 ms ± 89.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [3]: %timeit df.groupby("id").apply(normalize).count() 408 ms ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [4]: %timeit df.groupby("id").apply(normalize).count() 381 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Closes #24926 from BryanCutler/pyspark-pandas_udf-map-agg-skip-empty-parts-SPARK-28128. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

HyukjinKwon · 2019-06-24T14:39:02Z

This is fine too. Alternatively, maybe we should rather remove pyspark-site-coverage/ entirely explicitly after each run. pyspark-site-coverage/ was added in spark-master-sbt-hadoop-2.7 specifically as of #23117.

In the job specifically, it runs PySpark tests with coverage and updates it to https://github.com/spark-test/pyspark-coverage-site.

I met this issue before .. it was here - #23729 pyspark-coverage-site should be removed after each run but somehow it wasn't.

RAT checks happens after PySpark tests. So it's likely the same thing being happening again.

@shaneknapp and @srowen, maybe I have to reopen #23729 and manually remove to make sure such things don't happen.

HyukjinKwon

We can merge this one and see if the builds go happy as well. I don't understand why this directory is left over and looks like only way to verify is to merge and see if it passes at SBT Hadoop 2.7.

SparkQA · 2019-06-24T16:00:32Z

Test build #106834 has finished for PR 24950 at commit 63acb5a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2019-06-24T17:38:10Z

this change is fine, but another option could be to move the RAT test to the beginning of dev/run-tests.py:main().

SparkQA · 2019-06-24T19:04:31Z

Test build #4807 has finished for PR 24950 at commit 63acb5a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-06-24T19:07:51Z

Merged to master

shaneknapp · 2019-06-24T19:54:17Z

yay! happiness ensues:

========================================================================
Running Apache RAT checks
========================================================================
Attempting to fetch rat
RAT checks passed.

## What changes were proposed in this pull request? Looks like a directory `pyspark-site-coverage/` is now (?) generated and fails RAT checks. It should just be excluded. See: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6029/console ## How was this patch tested? N/A Closes apache#24950 from srowen/pysparkcoveragesite. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

Exclude pyspark-coverage-site/ dir from RAT

63acb5a

srowen self-assigned this Jun 24, 2019

HyukjinKwon approved these changes Jun 24, 2019

View reviewed changes

shaneknapp mentioned this pull request Jun 24, 2019

[SPARK-7721][INFRA][FOLLOW-UP] Remove cloned coverage repo after posting HTMLs #23729

Closed

srowen closed this in 67042e9 Jun 24, 2019

srowen deleted the pysparkcoveragesite branch August 9, 2019 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR][BUILD] Exclude pyspark-coverage-site/ dir from RAT #24950

[MINOR][BUILD] Exclude pyspark-coverage-site/ dir from RAT #24950

srowen commented Jun 24, 2019

HyukjinKwon commented Jun 24, 2019

HyukjinKwon left a comment

SparkQA commented Jun 24, 2019

shaneknapp commented Jun 24, 2019

SparkQA commented Jun 24, 2019

srowen commented Jun 24, 2019

shaneknapp commented Jun 24, 2019

[MINOR][BUILD] Exclude pyspark-coverage-site/ dir from RAT #24950

[MINOR][BUILD] Exclude pyspark-coverage-site/ dir from RAT #24950

Conversation

srowen commented Jun 24, 2019

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Jun 24, 2019

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented Jun 24, 2019

shaneknapp commented Jun 24, 2019

SparkQA commented Jun 24, 2019

srowen commented Jun 24, 2019

shaneknapp commented Jun 24, 2019