-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3809][SQL] Fixes test suites in hive-thriftserver #2675
Conversation
QA tests have started for PR 2675 at commit
|
QA tests have finished for PR 2675 at commit
|
Test FAILed. |
The console output suggests that the CLI process and the Thrift server process were started and executed successfully but the timeout was too tight. Try relaxing the timeout. |
Test FAILed. |
QA tests have started for PR 2675 at commit
|
QA tests have finished for PR 2675 at commit
|
Hi @liancheng, I think i get the root cause here. In TestHive.scala(or another place we reset?) we reset the log4j level
So here the level is WARN and the process will not loginfo "ThriftBinaryCLIService listening on", that leads to time out exception and test failed. From the jenkins output it only print WARN logs Maybe we should reset log4j level here to test this:) |
But why it always passes locally? Or can you reproduce it locally? Anyway, I think this is really good catch, I'll try to verify. BIG THANKS! This issue have been bothering me a lot :) |
Yes, i have reproduced locally. i have tested use this patch and if test suites of hive subproject run before that of thriftserver it will failed and the failed info is as same as the jenkins output |
Awesome :) |
QA tests have started for PR 2675 at commit
|
QA tests have finished for PR 2675 at commit
|
Test FAILed. |
@scwf Jenkins still fails... Actually I just realized that
|
I also realized that. But i think there is some unkown reason here, i moved
into method So can you remove the hack in TestHive.scala and then retest this? I think the problem is still there. |
Wow, very very strange!!! If i add a print in the hack, then HiveThriftServer2Suite success.
|
OK, I removed the log level hack from |
QA tests have started for PR 2675 at commit
|
QA tests have finished for PR 2675 at commit
|
Test FAILed. |
Failed again, this time even warn log did not printed. |
Had offline discussion with @scwf, the root cause has been found. Jenkins defines
which surpresses INFO level logs. Big thanks to @scwf for investigating this issue and finally found out the cause! |
QA tests have started for PR 2675 at commit
|
QA tests have finished for PR 2675 at commit
|
Test FAILed. |
QA tests have started for PR 2675 at commit
|
Can one of the admins verify this patch? |
e79e1cd
to
5f6b796
Compare
QA tests have started for PR 2675 at commit
|
QA tests have finished for PR 2675 at commit
|
Test FAILed. |
QA tests have started for PR 2675 at commit
|
QA tests have finished for PR 2675 at commit
|
Ah, finally it passes! This mysterious Jenkins failure had been bothering me for two months. @scwf Thanks again for debugging this! |
Welcome :) |
5f6b796
to
1c384b7
Compare
QA tests have started for PR 2675 at commit
|
@marmbrus This should be ready to go once Jenkins nods. |
QA tests have finished for PR 2675 at commit
|
Test PASSed. |
Thanks guys! Merged to master. |
As scwf pointed out, `HiveThriftServer2Suite` isn't effective anymore after the Thrift server was made a daemon. On the other hand, these test suites were known flaky, PR apache#2214 tried to fix them but failed because of unknown Jenkins build error. This PR fixes both sets of issues. In this PR, instead of watching `start-thriftserver.sh` output, the test code start a `tail` process to watch the log file. A `Thread.sleep` has to be introduced because the `kill` command used in `stop-thriftserver.sh` is not synchronous. As for the root cause of the mysterious Jenkins build failure. Please refer to [this comment](apache#2675 (comment)) below for details. ---- (Copied from PR description of apache#2214) This PR fixes two issues of `HiveThriftServer2Suite` and brings 1 enhancement: 1. Although metastore, warehouse directories and listening port are randomly chosen, all test cases share the same configuration. Due to parallel test execution, one of the two test case is doomed to fail 2. We caught any exceptions thrown from a test case and print diagnosis information, but forgot to re-throw the exception... 3. When the forked server process ends prematurely (e.g., fails to start), the `serverRunning` promise is completed with a failure, preventing the test code to keep waiting until timeout. So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood. Author: Cheng Lian <lian.cs.zju@gmail.com> Author: wangfei <wangfei1@huawei.com> Closes apache#2675 from liancheng/fix-thriftserver-tests and squashes the following commits: 1c384b7 [Cheng Lian] Minor code cleanup, restore the logging level hack in TestHive.scala 7805c33 [wangfei] reset SPARK_TESTING to avoid loading Log4J configurations in testing class paths af2b5a9 [Cheng Lian] Removes log level hacks from TestHiveContext d116405 [wangfei] make sure that log4j level is INFO ee92a82 [Cheng Lian] Relaxes timeout 7fd6757 [Cheng Lian] Fixes test suites in hive-thriftserver Conflicts: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
This PR backports apache#2843 to branch-1.1. The key difference is that this one doesn't support Hive 0.13.1 and thus always returns `0.12.0` when `spark.sql.hive.version` is queried. 6 other commits on which apache#2843 depends were also backported, they are: - apache#2887 for `SessionState` lifecycle control - apache#2675, apache#2823 & apache#3060 for major test suite refactoring and bug fixes - apache#2164, for Parquet test suites updates - apache#2493, for reading `spark.sql.*` configurations Author: Cheng Lian <lian@databricks.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes apache#3113 from liancheng/get-info-for-1.1 and squashes the following commits: d354161 [Cheng Lian] Provides Spark and Hive version in HiveThriftServer2 for branch-1.1 0c2a244 [Michael Armbrust] [SPARK-3646][SQL] Copy SQL configuration from SparkConf when a SQLContext is created. 3202a36 [Michael Armbrust] [SQL] Decrease partitions when testing 7f395b7 [Cheng Lian] [SQL] Fixes race condition in CliSuite 0dd28ec [Cheng Lian] [SQL] Fixes the race condition that may cause test failure 5928b39 [Cheng Lian] [SPARK-3809][SQL] Fixes test suites in hive-thriftserver faeca62 [Cheng Lian] [SPARK-4037][SQL] Removes the SessionState instance created in HiveThriftServer2
As @scwf pointed out,
HiveThriftServer2Suite
isn't effective anymore after the Thrift server was made a daemon. On the other hand, these test suites were known flaky, PR #2214 tried to fix them but failed because of unknown Jenkins build error. This PR fixes both sets of issues.In this PR, instead of watching
start-thriftserver.sh
output, the test code start atail
process to watch the log file. AThread.sleep
has to be introduced because thekill
command used instop-thriftserver.sh
is not synchronous.As for the root cause of the mysterious Jenkins build failure. Please refer to this comment below for details.
(Copied from PR description of #2214)
This PR fixes two issues of
HiveThriftServer2Suite
and brings 1 enhancement:serverRunning
promise is completed with a failure, preventing the test code to keep waiting until timeout.So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood.