Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR][BUILD] Exclude pyspark-coverage-site/ dir from RAT #24950

Closed
wants to merge 1 commit into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Jun 24, 2019

What changes were proposed in this pull request?

Looks like a directory pyspark-site-coverage/ is now (?) generated and fails RAT checks. It should just be excluded. See: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6029/console

How was this patch tested?

N/A

@srowen srowen self-assigned this Jun 24, 2019
srowen referenced this pull request Jun 24, 2019
## What changes were proposed in this pull request?

When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same.

This PR checks the `mapPartitionsInternal` iterator to be non-empty before calling `ArrowPythonRunner` to start computation on the iterator.

## How was this patch tested?

Existing tests. Ran the following benchmarks a simple example where most partitions are empty:

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *

df = spark.createDataFrame(
     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
     ("id", "v"))

pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())

df.groupby("id").apply(normalize).count()
```

**Before**
```
In [4]: %timeit df.groupby("id").apply(normalize).count()
1.58 s ± 62.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit df.groupby("id").apply(normalize).count()
1.52 s ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit df.groupby("id").apply(normalize).count()
1.52 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

**After this Change**
```
In [2]: %timeit df.groupby("id").apply(normalize).count()
646 ms ± 89.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df.groupby("id").apply(normalize).count()
408 ms ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df.groupby("id").apply(normalize).count()
381 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Closes #24926 from BryanCutler/pyspark-pandas_udf-map-agg-skip-empty-parts-SPARK-28128.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@HyukjinKwon
Copy link
Member

This is fine too. Alternatively, maybe we should rather remove pyspark-site-coverage/ entirely explicitly after each run. pyspark-site-coverage/ was added in spark-master-sbt-hadoop-2.7 specifically as of #23117.

In the job specifically, it runs PySpark tests with coverage and updates it to https://github.com/spark-test/pyspark-coverage-site.

I met this issue before .. it was here - #23729 pyspark-coverage-site should be removed after each run but somehow it wasn't.

RAT checks happens after PySpark tests. So it's likely the same thing being happening again.

@shaneknapp and @srowen, maybe I have to reopen #23729 and manually remove to make sure such things don't happen.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this one and see if the builds go happy as well. I don't understand why this directory is left over and looks like only way to verify is to merge and see if it passes at SBT Hadoop 2.7.

@SparkQA
Copy link

SparkQA commented Jun 24, 2019

Test build #106834 has finished for PR 24950 at commit 63acb5a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shaneknapp
Copy link
Contributor

this change is fine, but another option could be to move the RAT test to the beginning of dev/run-tests.py:main().

@SparkQA
Copy link

SparkQA commented Jun 24, 2019

Test build #4807 has finished for PR 24950 at commit 63acb5a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Jun 24, 2019

Merged to master

@srowen srowen closed this in 67042e9 Jun 24, 2019
@shaneknapp
Copy link
Contributor

yay! happiness ensues:

========================================================================
Running Apache RAT checks
========================================================================
Attempting to fetch rat
RAT checks passed.

kiku-jw pushed a commit to kiku-jw/spark that referenced this pull request Jun 26, 2019
## What changes were proposed in this pull request?

Looks like a directory `pyspark-site-coverage/` is now (?) generated and fails RAT checks. It should just be excluded. See: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6029/console

## How was this patch tested?

N/A

Closes apache#24950 from srowen/pysparkcoveragesite.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@srowen srowen deleted the pysparkcoveragesite branch August 9, 2019 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants