Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3809][SQL] Fixes test suites in hive-thriftserver #2675

Closed
wants to merge 6 commits into from

Conversation

liancheng
Copy link
Contributor

As @scwf pointed out, HiveThriftServer2Suite isn't effective anymore after the Thrift server was made a daemon. On the other hand, these test suites were known flaky, PR #2214 tried to fix them but failed because of unknown Jenkins build error. This PR fixes both sets of issues.

In this PR, instead of watching start-thriftserver.sh output, the test code start a tail process to watch the log file. A Thread.sleep has to be introduced because the kill command used in stop-thriftserver.sh is not synchronous.

As for the root cause of the mysterious Jenkins build failure. Please refer to this comment below for details.


(Copied from PR description of #2214)

This PR fixes two issues of HiveThriftServer2Suite and brings 1 enhancement:

  1. Although metastore, warehouse directories and listening port are randomly chosen, all test cases share the same configuration. Due to parallel test execution, one of the two test case is doomed to fail
  2. We caught any exceptions thrown from a test case and print diagnosis information, but forgot to re-throw the exception...
  3. When the forked server process ends prematurely (e.g., fails to start), the serverRunning promise is completed with a failure, preventing the test code to keep waiting until timeout.

So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2675 at commit 5094bb4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have finished for PR 2675 at commit 5094bb4.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21328/Test FAILed.

@liancheng
Copy link
Contributor Author

The console output suggests that the CLI process and the Thrift server process were started and executed successfully but the timeout was too tight. Try relaxing the timeout.

@liancheng liancheng changed the title [3809][SQL] Fixes test suites in hive-thriftserver [SPARK-3809][SQL] Fixes test suites in hive-thriftserver Oct 6, 2014
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21332/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2675 at commit 962f5f8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have finished for PR 2675 at commit 962f5f8.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@scwf
Copy link
Contributor

scwf commented Oct 7, 2014

Hi @liancheng, I think i get the root cause here. In TestHive.scala(or another place we reset?) we reset the log4j level

      // HACK: Hive is too noisy by default.
      org.apache.log4j.LogManager.getCurrentLoggers.foreach { log =>
        log.asInstanceOf[org.apache.log4j.Logger].setLevel(org.apache.log4j.Level.WARN)
      }

So here the level is WARN and the process will not loginfo "ThriftBinaryCLIService listening on", that leads to time out exception and test failed. From the jenkins output it only print WARN logs

Maybe we should reset log4j level here to test this:)

@liancheng
Copy link
Contributor Author

But why it always passes locally? Or can you reproduce it locally? Anyway, I think this is really good catch, I'll try to verify. BIG THANKS! This issue have been bothering me a lot :)

@scwf
Copy link
Contributor

scwf commented Oct 7, 2014

Yes, i have reproduced locally. i have tested use this patch and if test suites of hive subproject run before that of thriftserver it will failed and the failed info is as same as the jenkins output

@liancheng
Copy link
Contributor Author

Awesome :)

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2675 at commit f279261.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2675 at commit f279261.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21446/Test FAILed.

@liancheng
Copy link
Contributor Author

@scwf Jenkins still fails... Actually I just realized that HiveThriftServer2Suite calls start-thriftserver.sh to start a separate server process, and the log level of the new process can't be controlled by the test code. On the other hand, I can reproduce this issue by putting a log4j.properties under conf/ and adding this line to surpress INFO logs:

log4j.rootCategory=WARN, console

@scwf
Copy link
Contributor

scwf commented Oct 8, 2014

I also realized that. But i think there is some unkown reason here, i moved

    // HACK: Hive is too noisy by default.
    org.apache.log4j.LogManager.getCurrentLoggers.foreach { log =>
      log.asInstanceOf[org.apache.log4j.Logger].setLevel(org.apache.log4j.Level.WARN)
    }

into method startThriftServerWithin, then
sbt/sbt -Phive,hadoop-2.3 "test-only org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite" failed.

So can you remove the hack in TestHive.scala and then retest this? I think the problem is still there.

@scwf
Copy link
Contributor

scwf commented Oct 8, 2014

Wow, very very strange!!! If i add a print in the hack, then HiveThriftServer2Suite success.

    // HACK: Hive is too noisy by default.
    org.apache.log4j.LogManager.getCurrentLoggers.foreach { log =>
      println(s"################## ${log}")
      log.asInstanceOf[org.apache.log4j.Logger].setLevel(org.apache.log4j.Level.WARN)
    }

@liancheng
Copy link
Contributor Author

OK, I removed the log level hack from TestHiveContext, let's see what happens.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2675 at commit a4219c4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2675 at commit a4219c4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21463/Test FAILed.

@scwf
Copy link
Contributor

scwf commented Oct 8, 2014

Failed again, this time even warn log did not printed.

@liancheng
Copy link
Contributor Author

Had offline discussion with @scwf, the root cause has been found. Jenkins defines SPARK_TESTING=1 when executing the tests. This adds a bunch of test-classes directory into CLASSPATH (done by bin/compute-classpath.sh). Unfortunately, a log4j.properties under sql/hive/target/scala-2.10/test-classes defines the following line:

log4j.appender.CA.Threshold = WARN

which surpresses INFO level logs. HiveThriftServer2Suite checks a INFO level log line to assert whether the server process has been started successfully. That's why it always fails on Jenkins. This process matches my local tests perfectly, but doesn't explain why @scwf can sometimes reproduce this issue locally. The 1 minute timeout is long enough, I don't think machine performance matters here.

Big thanks to @scwf for investigating this issue and finally found out the cause!

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2675 at commit e79b2d7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2675 at commit e79b2d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21531/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2675 at commit e79e1cd.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@liancheng liancheng force-pushed the fix-thriftserver-tests branch from e79e1cd to 5f6b796 Compare October 10, 2014 00:21
@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have started for PR 2675 at commit 5f6b796.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have finished for PR 2675 at commit 5f6b796.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21560/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have started for PR 2675 at commit 5f6b796.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have finished for PR 2675 at commit 5f6b796.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

Ah, finally it passes! This mysterious Jenkins failure had been bothering me for two months. @scwf Thanks again for debugging this!

@scwf
Copy link
Contributor

scwf commented Oct 10, 2014

Welcome :)

@liancheng liancheng force-pushed the fix-thriftserver-tests branch from 5f6b796 to 1c384b7 Compare October 11, 2014 05:25
@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have started for PR 2675 at commit 1c384b7.

  • This patch merges cleanly.

@liancheng
Copy link
Contributor Author

@marmbrus This should be ready to go once Jenkins nods.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have finished for PR 2675 at commit 1c384b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21626/Test PASSed.

@marmbrus
Copy link
Contributor

Thanks guys! Merged to master.

@asfgit asfgit closed this in 9eb49d4 Oct 13, 2014
@liancheng liancheng deleted the fix-thriftserver-tests branch October 14, 2014 00:54
liancheng added a commit to liancheng/spark that referenced this pull request Nov 9, 2014
As scwf pointed out, `HiveThriftServer2Suite` isn't effective anymore after the Thrift server was made a daemon. On the other hand, these test suites were known flaky, PR apache#2214 tried to fix them but failed because of unknown Jenkins build error. This PR fixes both sets of issues.

In this PR, instead of watching `start-thriftserver.sh` output, the test code start a `tail` process to watch the log file. A `Thread.sleep` has to be introduced because the `kill` command used in `stop-thriftserver.sh` is not synchronous.

As for the root cause of the mysterious Jenkins build failure. Please refer to [this comment](apache#2675 (comment)) below for details.

----

(Copied from PR description of apache#2214)

This PR fixes two issues of `HiveThriftServer2Suite` and brings 1 enhancement:

1. Although metastore, warehouse directories and listening port are randomly chosen, all test cases share the same configuration. Due to parallel test execution, one of the two test case is doomed to fail
2. We caught any exceptions thrown from a test case and print diagnosis information, but forgot to re-throw the exception...
3. When the forked server process ends prematurely (e.g., fails to start), the `serverRunning` promise is completed with a failure, preventing the test code to keep waiting until timeout.

So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood.

Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: wangfei <wangfei1@huawei.com>

Closes apache#2675 from liancheng/fix-thriftserver-tests and squashes the following commits:

1c384b7 [Cheng Lian] Minor code cleanup, restore the logging level hack in TestHive.scala
7805c33 [wangfei]  reset SPARK_TESTING to avoid loading Log4J configurations in testing class paths
af2b5a9 [Cheng Lian] Removes log level hacks from TestHiveContext
d116405 [wangfei] make sure that log4j level is INFO
ee92a82 [Cheng Lian] Relaxes timeout
7fd6757 [Cheng Lian] Fixes test suites in hive-thriftserver

Conflicts:
	sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
marmbrus pushed a commit to marmbrus/spark that referenced this pull request Nov 11, 2014
This PR backports apache#2843 to branch-1.1. The key difference is that this one doesn't support Hive 0.13.1 and thus always returns `0.12.0` when `spark.sql.hive.version` is queried.

6 other commits on which apache#2843 depends were also backported, they are:

- apache#2887 for `SessionState` lifecycle control
- apache#2675, apache#2823 & apache#3060 for major test suite refactoring and bug fixes
- apache#2164, for Parquet test suites updates
- apache#2493, for reading `spark.sql.*` configurations

Author: Cheng Lian <lian@databricks.com>
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes apache#3113 from liancheng/get-info-for-1.1 and squashes the following commits:

d354161 [Cheng Lian] Provides Spark and Hive version in HiveThriftServer2 for branch-1.1
0c2a244 [Michael Armbrust] [SPARK-3646][SQL] Copy SQL configuration from SparkConf when a SQLContext is created.
3202a36 [Michael Armbrust] [SQL] Decrease partitions when testing
7f395b7 [Cheng Lian] [SQL] Fixes race condition in CliSuite
0dd28ec [Cheng Lian] [SQL] Fixes the race condition that may cause test failure
5928b39 [Cheng Lian] [SPARK-3809][SQL] Fixes test suites in hive-thriftserver
faeca62 [Cheng Lian] [SPARK-4037][SQL] Removes the SessionState instance created in HiveThriftServer2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants