Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7721][INFRA] Run and generate test coverage report from Python via Jenkins #23117

Closed
wants to merge 34 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Nov 22, 2018

What changes were proposed in this pull request?

Background

For the current status, the test script that generates coverage information was merged
into Spark, #20204

So, we can generate the coverage report and site by, for example:

run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql

like run-tests script in ./python.

Proposed change

The next step is to host this coverage report via github.io automatically
by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/).

This uses my testing account for Spark, @spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor.

To cut this short, this PR targets to run the coverage in
spark-master-test-sbt-hadoop-2.7

In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below:

# Clone PySpark coverage site.
git clone https://github.com/spark-test/pyspark-coverage-site.git

# Remove existing HTMLs.
rm -fr pyspark-coverage-site/*

# Copy generated coverage HTMLs.
cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/

# Check out to a temporary branch.
git symbolic-ref HEAD refs/heads/latest_branch

# Add all the files.
git add -A

# Commit current HTMLs.
git commit -am "Coverage report at latest commit in Apache Spark"

# Delete the old branch.
git branch -D gh-pages

# Rename the temporary branch to master.
git branch -m gh-pages

# Finally, force update to our repository.
git push -f origin gh-pages

So, it is a one single up-to-date coverage can be shown in the github-io page. The commands above were manually tested.

TODOs

  • Write a draft @HyukjinKwon
  • pip install coverage to all python implementations (pypy, python2, python3) in Jenkins workers - @shaneknapp
  • Set hidden SPARK_TEST_KEY for @spark-test's password in Jenkins via Jenkins's feature
    This should be set in both PR builder and spark-master-test-sbt-hadoop-2.7 so that later other PRs can test and fix the bugs - @shaneknapp
  • Set an environment variable that indicates spark-master-test-sbt-hadoop-2.7 so that that specific build can report and update the coverage site - @shaneknapp
  • Make PR builder's test passed @HyukjinKwon
  • Fix flaky test related with coverage @HyukjinKwon
    • 6 consecutive passes out of 7 runs

This PR will be co-authored with me and @shaneknapp

How was this patch tested?

It will be tested via Jenkins.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Nov 22, 2018

@shaneknapp
Copy link
Contributor

i'll try and take a look at this over the next couple of days, but it's a holiday weekend and i may not be able to get to this until monday.

@HyukjinKwon
Copy link
Member Author

It's not urgent :) so it's okay. Actually i'm on a vacation for a week as well. Thanks for taking a look @shaneknapp !!

@HyukjinKwon
Copy link
Member Author

Hey @shaneknapp, have you found some time to take a look for this?

@shaneknapp
Copy link
Contributor

not yet, but i will carve out some time today and wednesday to look closer.

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be great to have a test coverge report!

does this add time to running the tests?

@HyukjinKwon
Copy link
Member Author

does this add time to running the tests?

Given my local tests, the time diff looked slightly increasing .. I want to see how it works in Jenkins ..

@HyukjinKwon
Copy link
Member Author

Hey, @shaneknapp, mind if I ask to take a look please when you're available?

@shaneknapp
Copy link
Contributor

shaneknapp commented Dec 20, 2018 via email

@HyukjinKwon
Copy link
Member Author

gentle ping @shaneknapp :D.

@HyukjinKwon
Copy link
Member Author

gentle ping .. @shaneknapp

@shaneknapp
Copy link
Contributor

shaneknapp commented Jan 5, 2019 via email

@HyukjinKwon
Copy link
Member Author

@shaneknapp, I updated PR description for action items to be clear. Please let me know if there's any question about it when you start to work. Thank you so much!

@shaneknapp
Copy link
Contributor

will coverage version 4.5.2 be sufficient (same across pypy/py2.7/py3.4)?

@shaneknapp
Copy link
Contributor

alright, coverage==4.5.2 is installed on all workers, across all python/pypy envs.

@shaneknapp
Copy link
Contributor

now i'm currently looking in to how to get the python coverage tests to deploy solely to the spark-master-test-sbt-hadoop-2.7 build via the jenkins job configs... setting this up w/the PRB will be much easier.

@HyukjinKwon
Copy link
Member Author

Yea coverage versions look good! Thanks.

Ah, @shaneknapp, I think I can handle how to run the python coverage tests within dev/run-tests.py if there's any environment variable set to spark-master-test-sbt-hadoop-2.7 build (for instance, let's say SPARK_MASTER_SBT_HADOOP_2_7).

I can check the environment variable (SPARK_MASTER_SBT_HADOOP_2_7) within this script run-tests.py, and run the python coverage tests.

@shaneknapp
Copy link
Contributor

just a quick FYI, i'll get back to my part of this early next week. i'm out of the office at our lab's retreat.

@HyukjinKwon
Copy link
Member Author

Thanks, @shaneknapp. (I'm just checking the installation of coverage FWIW)

@HyukjinKwon
Copy link
Member Author

I filed for the flaky test (SPARK-26646).


from pyspark import SparkConf, SparkContext, RDD
from pyspark.streaming import StreamingContext
from pyspark.testing.streamingutils import PySparkStreamingTestCase


@unittest.skipIf(
"pypy" in platform.python_implementation().lower() and "COVERAGE_PROCESS_START" in os.environ,
"PyPy implementation causes to hang DStream tests forever when Coverage report is used.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I am not sure but those tests hang forever when coverage is used.

@HyukjinKwon
Copy link
Member Author

FWIW, @shaneknapp, I added an empty commit authored by you at 82732eaded312b0cae6ec4876d0b5791dd4faa54 so .. this PR will be committed with you (for instance like 51bee7a)

@apache apache deleted a comment from SparkQA Jan 21, 2019
@apache apache deleted a comment from SparkQA Jan 21, 2019
@apache apache deleted a comment from SparkQA Jan 21, 2019
@apache apache deleted a comment from SparkQA Jan 21, 2019
Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon
Copy link
Member Author

IIRC, I think @rxin, @JoshRosen, and @shaneknapp agreed upon my approach a long ago in general when we discussed about it in emails. Let me get this in in few days if there are no notable comments and start to monitor spark-master-test-sbt-hadoop-2.7.

@SparkQA
Copy link

SparkQA commented Jan 30, 2019

Test build #101884 has finished for PR 23117 at commit a1c0601.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 31, 2019

Test build #101927 has finished for PR 23117 at commit 426ef11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

I'm going to merge this since it's not going to affect Spark itself and also PR builders. Im gonna monitor SBT master job and see if it works.

Merged to master.

@asfgit asfgit closed this in cdd694c Feb 1, 2019
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 1, 2019

Argh, looks it causes an error when it pushes HTMLs into https://spark-test.github.io/pyspark-coverage-site/ due to low git version at spark-master-test-sbt-hadoop-2.7:

  error: The requested URL returned error: 403 Forbidden while accessing https://spark-test:****@github.com/spark-test/pyspark-coverage-site.git/info/refs

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5465/console

  Please upgrade your git client.
  GitHub.com no longer supports git over dumb-http: https://github.com/blog/809-git-dumb-http-transport-to-be-turned-off-in-90-days

https://github.com/spark-test/pyspark-coverage-site.git/info/refs

I am going to interact with @shaneknapp in a private channel to speed up to fix it.

@shaneknapp
Copy link
Contributor

actually, no, it shouldn't be the git version. that build is using 2.7.2, and according to the blog post we need 1.6.6.

@HyukjinKwon
Copy link
Member Author

Thanks, @shaneknapp. I at least checked the build is now passing https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5468/

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… via Jenkins

## What changes were proposed in this pull request?

### Background

For the current status, the test script that generates coverage information was merged
into Spark, apache#20204

So, we can generate the coverage report and site by, for example:

```
run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql
```

like `run-tests` script in `./python`.

### Proposed change

The next step is to host this coverage report via `github.io` automatically
by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/).

This uses my testing account for Spark, spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor.

To cut this short, this PR targets to run the coverage in
[spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/)

In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below:

```bash
# Clone PySpark coverage site.
git clone https://github.com/spark-test/pyspark-coverage-site.git

# Remove existing HTMLs.
rm -fr pyspark-coverage-site/*

# Copy generated coverage HTMLs.
cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/

# Check out to a temporary branch.
git symbolic-ref HEAD refs/heads/latest_branch

# Add all the files.
git add -A

# Commit current HTMLs.
git commit -am "Coverage report at latest commit in Apache Spark"

# Delete the old branch.
git branch -D gh-pages

# Rename the temporary branch to master.
git branch -m gh-pages

# Finally, force update to our repository.
git push -f origin gh-pages
```

So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested.

### TODOs

- [x] Write a draft HyukjinKwon
- [x] `pip install coverage` to all python implementations (pypy, python2, python3) in Jenkins workers  - shaneknapp
- [x] Set hidden `SPARK_TEST_KEY` for spark-test's password in Jenkins via Jenkins's feature
 This should be set in both PR builder and `spark-master-test-sbt-hadoop-2.7` so that later other PRs can test and fix the bugs - shaneknapp
- [x] Set an environment variable that indicates `spark-master-test-sbt-hadoop-2.7` so that that specific build can report and update the coverage site - shaneknapp
- [x] Make PR builder's test passed HyukjinKwon
- [x] Fix flaky test related with coverage HyukjinKwon
  -  6 consecutive passes out of 7 runs

This PR will be co-authored with me and shaneknapp

## How was this patch tested?

It will be tested via Jenkins.

Closes apache#23117 from HyukjinKwon/SPARK-7721.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon deleted the SPARK-7721 branch March 3, 2020 01:20
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 30, 2020

Hi, @HyukjinKwon and @shaneknapp .
It seems that AmbLab Jenkins lost SPARK_TEST_KEY during the recent transition and has been failing for last 5 days.

Generating HTML files for PySpark coverage under /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python/test_coverage/htmlcov
/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3
[error] 'SPARK_TEST_KEY' environment variable was not set. Unable to post PySpark coverage results.

@dongjoon-hyun
Copy link
Member

In addition to that it breaks branch-3.0 too because its Jenkins configuration has SPARK_MASTER_SBT_HADOOP_2_7=1.

@HyukjinKwon
Copy link
Member Author

Thank you @dongjoon-hyun. Let me set the key accordinlgy.

@dongjoon-hyun
Copy link
Member

Thank you~

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Nov 30, 2020

@shaneknapp,

  • I removed SPARK_MASTER_SBT_HADOOP_2_7 in
    • spark-branch-3.0-test-sbt-hadoop-2.7-hive-2.3
    • spark-branch-3.0-test-sbt-hadoop-2.7-hive-1.2
    • spark-branch-3.0-test-sbt-hadoop-3.2-hive-2.3
  • and added SPARK_TEST_KEY as a password in spark-master-test-sbt-hadoop-2.7-hive-2.3:
    Screen Shot 2020-11-30 at 3 19 24 PM

@dongjoon-hyun
Copy link
Member

@HyukjinKwon
Copy link
Member Author

Sure! I will check other jobs too

@dongjoon-hyun
Copy link
Member

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants