Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7721][INFRA] Run and generate test coverage report from Python via Jenkins #23117

Closed
wants to merge 34 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
d503128
Run and generate test coverage report from Python via Jenkins
HyukjinKwon Nov 22, 2018
135e7ae
Add a logic to remove existing reports
HyukjinKwon Nov 22, 2018
08ab742
Fold the functions
HyukjinKwon Nov 27, 2018
125019d
Debug #1
HyukjinKwon Jan 20, 2019
e1a3f9d
Debug 2
HyukjinKwon Jan 21, 2019
3eb7611
Debug 3
HyukjinKwon Jan 21, 2019
88954cd
Revert "Debug #1"
HyukjinKwon Jan 21, 2019
37fcd3c
Revert "Debug 2"
HyukjinKwon Jan 21, 2019
dafe7de
Revert "Debug 3"
HyukjinKwon Jan 21, 2019
643246b
Disable DStream tests when PyPy is used with coverage
HyukjinKwon Jan 21, 2019
8133a08
Update run-tests.py
HyukjinKwon Jan 21, 2019
6d46f8c
Make the tests less flaky
HyukjinKwon Jan 29, 2019
ad006e7
Skip scala tests for now (debug)
HyukjinKwon Jan 29, 2019
92d74f0
Use SPARK_MATER_SBT_HADOOP_2_7
HyukjinKwon Jan 29, 2019
ac6efef
debug 2
HyukjinKwon Jan 29, 2019
f34fd8d
debug 3
HyukjinKwon Jan 29, 2019
9ea948d
debug 4
HyukjinKwon Jan 29, 2019
8afab85
Revert "debug 4"
HyukjinKwon Jan 29, 2019
9b4a5b8
Revert "debug 3"
HyukjinKwon Jan 29, 2019
23398d3
Revert "debug 2"
HyukjinKwon Jan 29, 2019
f3c7b71
Avoid shell interpretation
HyukjinKwon Jan 29, 2019
a88c16f
newlines pretty
HyukjinKwon Jan 29, 2019
b7d3cef
Pretty comment
HyukjinKwon Jan 29, 2019
c09ddb8
Avoid shell interpreting
HyukjinKwon Jan 29, 2019
0a66669
D'oh!
HyukjinKwon Jan 29, 2019
c2412f6
Work around by `symbolic-ref`
HyukjinKwon Jan 29, 2019
c660dcf
Fix some comments accordingly
HyukjinKwon Jan 29, 2019
5132334
Fix comments and save Py4J access
HyukjinKwon Jan 30, 2019
0a1216a
Remove workarounds to speed up tests
HyukjinKwon Jan 30, 2019
efb0299
typo
HyukjinKwon Jan 30, 2019
d4e30f4
Add a badge for PySpark coverage
HyukjinKwon Jan 30, 2019
f538410
Match it
HyukjinKwon Jan 30, 2019
a1c0601
Adding Shanke as co-author
shaneknapp Nov 20, 2018
426ef11
Fix a typo
HyukjinKwon Jan 31, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

[![Jenkins Build](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/badge/icon)](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7)
[![AppVeyor Build](https://img.shields.io/appveyor/ci/ApacheSoftwareFoundation/spark/master.svg?style=plastic&logo=appveyor)](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark)
[![PySpark Coverage](https://img.shields.io/badge/dynamic/xml.svg?label=pyspark%20coverage&url=https%3A%2F%2Fspark-test.github.io%2Fpyspark-coverage-site&query=%2Fhtml%2Fbody%2Fdiv%5B1%5D%2Fdiv%2Fh1%2Fspan&colorB=brightgreen&style=plastic)](https://spark-test.github.io/pyspark-coverage-site)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screen shot 2019-01-30 at 2 01 20 pm

It looks this and links to the coverage site https://spark-test.github.io/pyspark-coverage-site/.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should leave this link in README.md anyway. Think badge with a link is more effective.


Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
Expand Down
63 changes: 60 additions & 3 deletions dev/run-tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
import re
import sys
import subprocess
import glob
import shutil
from collections import namedtuple

from sparktestsupport import SPARK_HOME, USER_HOME, ERROR_CODES
Expand Down Expand Up @@ -400,15 +402,66 @@ def run_scala_tests(build_tool, hadoop_version, test_modules, excluded_tags):
run_scala_tests_sbt(test_modules, test_profiles)


def run_python_tests(test_modules, parallelism):
def run_python_tests(test_modules, parallelism, with_coverage=False):
set_title_and_block("Running PySpark tests", "BLOCK_PYSPARK_UNIT_TESTS")

command = [os.path.join(SPARK_HOME, "python", "run-tests")]
if with_coverage:
# Coverage makes the PySpark tests flaky due to heavy parallelism.
# When we run PySpark tests with coverage, it uses 4 for now as
# workaround.
parallelism = 4
script = "run-tests-with-coverage"
else:
script = "run-tests"
command = [os.path.join(SPARK_HOME, "python", script)]
if test_modules != [modules.root]:
command.append("--modules=%s" % ','.join(m.name for m in test_modules))
command.append("--parallelism=%i" % parallelism)
run_cmd(command)

if with_coverage:
post_python_tests_results()


def post_python_tests_results():
if "SPARK_TEST_KEY" not in os.environ:
print("[error] 'SPARK_TEST_KEY' environment variable was not set. Unable to post "
"PySpark coverage results.")
sys.exit(1)
spark_test_key = os.environ.get("SPARK_TEST_KEY")
# The steps below upload HTMLs to 'github.com/spark-test/pyspark-coverage-site'.
# 1. Clone PySpark coverage site.
run_cmd([
"git",
"clone",
"https://spark-test:%s@github.com/spark-test/pyspark-coverage-site.git" % spark_test_key])
# 2. Remove existing HTMLs.
run_cmd(["rm", "-fr"] + glob.glob("pyspark-coverage-site/*"))
# 3. Copy generated coverage HTMLs.
for f in glob.glob("%s/python/test_coverage/htmlcov/*" % SPARK_HOME):
shutil.copy(f, "pyspark-coverage-site/")
os.chdir("pyspark-coverage-site")
try:
# 4. Check out to a temporary branch.
run_cmd(["git", "symbolic-ref", "HEAD", "refs/heads/latest_branch"])
# 5. Add all the files.
run_cmd(["git", "add", "-A"])
# 6. Commit current HTMLs.
run_cmd([
"git",
"commit",
"-am",
"Coverage report at latest commit in Apache Spark",
'--author="Apache Spark Test Account <sparktestacc@gmail.com>"'])
# 7. Delete the old branch.
run_cmd(["git", "branch", "-D", "gh-pages"])
# 8. Rename the temporary branch to master.
run_cmd(["git", "branch", "-m", "gh-pages"])
# 9. Finally, force update to our repository.
run_cmd(["git", "push", "-f", "origin", "gh-pages"])
finally:
os.chdir("..")


def run_python_packaging_tests():
set_title_and_block("Running PySpark packaging tests", "BLOCK_PYSPARK_PIP_TESTS")
Expand Down Expand Up @@ -567,7 +620,11 @@ def main():

modules_with_python_tests = [m for m in test_modules if m.python_test_goals]
if modules_with_python_tests:
run_python_tests(modules_with_python_tests, opts.parallelism)
# We only run PySpark tests with coverage report in one specific job with
# Spark master with SBT in Jenkins.
is_sbt_master_job = "SPARK_MASTER_SBT_HADOOP_2_7" in os.environ
run_python_tests(
modules_with_python_tests, opts.parallelism, with_coverage=is_sbt_master_job)
run_python_packaging_tests()
if any(m.should_run_r_tests for m in test_modules):
run_sparkr_tests()
Expand Down
10 changes: 10 additions & 0 deletions python/pyspark/streaming/tests/test_dstream.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,16 @@
import unittest
from functools import reduce
from itertools import chain
import platform

from pyspark import SparkConf, SparkContext, RDD
from pyspark.streaming import StreamingContext
from pyspark.testing.streamingutils import PySparkStreamingTestCase


@unittest.skipIf(
"pypy" in platform.python_implementation().lower() and "COVERAGE_PROCESS_START" in os.environ,
"PyPy implementation causes to hang DStream tests forever when Coverage report is used.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I am not sure but those tests hang forever when coverage is used.

class BasicOperationTests(PySparkStreamingTestCase):

def test_map(self):
Expand Down Expand Up @@ -389,6 +393,9 @@ def failed_func(i):
self.fail("a failed func should throw an error")


@unittest.skipIf(
"pypy" in platform.python_implementation().lower() and "COVERAGE_PROCESS_START" in os.environ,
"PyPy implementation causes to hang DStream tests forever when Coverage report is used.")
class WindowFunctionTests(PySparkStreamingTestCase):

timeout = 15
Expand Down Expand Up @@ -466,6 +473,9 @@ def func(dstream):
self._test_func(input, func, expected)


@unittest.skipIf(
"pypy" in platform.python_implementation().lower() and "COVERAGE_PROCESS_START" in os.environ,
"PyPy implementation causes to hang DStream tests forever when Coverage report is used.")
class CheckpointTests(unittest.TestCase):

setupCalled = False
Expand Down