[SPARK-7758][SQL]Override more configs to avoid failure when connect to a postgre sql #6314

WangTaoTheTonic · 2015-05-21T07:36:14Z

https://issues.apache.org/jira/browse/SPARK-7758

When initializing executionHive, we only masks
javax.jdo.option.ConnectionURL to override metastore location. However,
other properties that relates to the actual Hive metastore data source are not
masked. For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, executionHive actually tries to use settings read from
hive-site.xml, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of Hive ObjectStore.getDataSourceProps()
method, all properties whose name mentions "jdo" and "datanucleus" must be
included.

Have tested using postgre sql as metastore, it worked fine.

SparkQA · 2015-05-21T08:40:30Z

Test build #33228 has finished for PR 6314 at commit 21bf202.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WangTaoTheTonic · 2015-05-21T09:26:27Z

Jenkins, retest this please.

SparkQA · 2015-05-21T10:31:51Z

Test build #33241 has finished for PR 6314 at commit 21bf202.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WangTaoTheTonic · 2015-05-21T13:30:35Z

From the failed test cases, it looks like not a good fix. I have no more idea as not a spark-sql expert. Anyone could offer some help?

marmbrus · 2015-05-21T18:25:31Z

Yeah, we definitely need the execution hive for Hive to function, but likely what we want to do is filter out any configuration that is telling it to use postgres (it should always use a dummy local derby instance)

marmbrus · 2015-05-21T18:26:01Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

-    logInfo(s"Initilizing execution hive, version $hiveExecutionVersion")
-    new ClientWrapper(
-      version = IsolatedClientLoader.hiveVersion(hiveExecutionVersion),
-      config = newTemporaryConfiguration())


I would try overriding more settings here.

@marmbrus, i am not understand here, can you explain, why we need executionHive? is hive also do like this(use a local dummy metastore for temporary functions)?

@scwf This is a Spark SQL specific design, which enables us to connect multiple versions of Hive metastore with a single code base. The executionHive is used for internal jobs within Spark SQL framework, while the metadataHive is used to talk to external Hive metastore.

SparkQA · 2015-05-22T03:15:57Z

Test build #33318 has finished for PR 6314 at commit e3e683d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-22T05:39:02Z

Test build #33321 has finished for PR 6314 at commit 92a81fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-22T08:48:35Z

Test build #33337 has finished for PR 6314 at commit 86caf2c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-22T10:22:37Z

Test build #33336 has finished for PR 6314 at commit e4f0feb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WangTaoTheTonic · 2015-05-22T11:03:56Z

@marmbrus I overrided more configs with their default value in HiveConf (refer to https://github.com/pwendell/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288) plus datanucleus.rdbms.datastoreAdapterClassName not included in HiveConf.

Have tested using postgre sql as metastore, it worked fine.

yhuai · 2015-05-22T15:03:06Z

@WangTaoTheTonic Seems newTemporaryConfiguration is used by executionHive. Can you explain the reason that your change can fix the problem of metadataHive?

liancheng · 2015-05-22T15:22:42Z

@yhuai We headed to the wrong direction at first. It's not because metadataHive can't find proper PostgreSQL configurations. The reason of the failure is that executionHive only overrides the metastore location without overriding other JDO and Datanucleus properties, thus those properties are read from hive-site.xml and talking about PostgreSQL. This makes executionHive trying to connect to the temporary Derby metastore with PostgreSQL settings, and causes the error. What @WangTaoTheTonic did is overriding all related properties with Hive default values (gathered from ConfVars). That's why updating executionHive related code path corrects the behavior.

liancheng · 2015-05-22T15:27:20Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

-      "javax.jdo.option.ConnectionURL" -> s"jdbc:derby:;databaseName=$localMetastore;create=true")
+    val propMap: HashMap[String, String] = HashMap()
+    HiveConf.ConfVars.values().foreach { confvar  =>
+      if (confvar.varname.contains("datanucleus") || confvar.varname.contains("jdo")) {


Could you please add a comment to explain where this if condition comes from for future reference? Thanks.

liancheng · 2015-05-22T15:39:02Z

@WangTaoTheTonic Would you mind to help updating the PR description to the following? Thanks!

When initializing `executionHive`, we only masks
`javax.jdo.option.ConnectionURL` to override metastore location.  However,
other properties that relates to the actual Hive metastore data source are not
masked.  For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, `executionHive` actually tries to use settings read from
`hive-site.xml`, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()`
method] [1], all properties whose name mentions "jdo" and "datanucleus" must be
included.

[1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288

liancheng · 2015-05-22T15:41:23Z

Forgot to say, LGTM, and thanks for fixing this!

WangTaoTheTonic · 2015-05-22T16:52:07Z

@yhuai I think @liancheng explains all clearly. Thanks for your comments and explanation @liancheng.

yhuai · 2015-05-22T17:52:31Z

@WangTaoTheTonic Can you update your pr description per Cheng?

SparkQA · 2015-05-22T18:43:09Z

Test build #33345 has finished for PR 6314 at commit ca7ae7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-22T21:12:28Z

Test build #851 has finished for PR 6314 at commit ca7ae7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…t to a postgre sql https://issues.apache.org/jira/browse/SPARK-7758 When initializing `executionHive`, we only masks `javax.jdo.option.ConnectionURL` to override metastore location. However, other properties that relates to the actual Hive metastore data source are not masked. For example, when using Spark SQL with a PostgreSQL backed Hive metastore, `executionHive` actually tries to use settings read from `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary Derby metastore, thus causes error. To fix this, we need to mask all metastore data source properties. Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()` method] [1], all properties whose name mentions "jdo" and "datanucleus" must be included. [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288 Have tested using postgre sql as metastore, it worked fine. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits: ca7ae7c [WangTaoTheTonic] add comments 86caf2c [WangTaoTheTonic] delete unused import e4f0feb [WangTaoTheTonic] block more data source related property 92a81fa [WangTaoTheTonic] fix style check e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql (cherry picked from commit 31d5d46) Signed-off-by: Michael Armbrust <michael@databricks.com>

marmbrus · 2015-05-22T21:46:34Z

Thanks for fixing this! Merged to master and 1.4.

…t to a postgre sql https://issues.apache.org/jira/browse/SPARK-7758 When initializing `executionHive`, we only masks `javax.jdo.option.ConnectionURL` to override metastore location. However, other properties that relates to the actual Hive metastore data source are not masked. For example, when using Spark SQL with a PostgreSQL backed Hive metastore, `executionHive` actually tries to use settings read from `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary Derby metastore, thus causes error. To fix this, we need to mask all metastore data source properties. Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()` method] [1], all properties whose name mentions "jdo" and "datanucleus" must be included. [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288 Have tested using postgre sql as metastore, it worked fine. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes apache#6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits: ca7ae7c [WangTaoTheTonic] add comments 86caf2c [WangTaoTheTonic] delete unused import e4f0feb [WangTaoTheTonic] block more data source related property 92a81fa [WangTaoTheTonic] fix style check e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql

marmbrus reviewed May 21, 2015
View reviewed changes

override more configs to avoid failuer connecting to postgre sql

e3e683d

WangTaoTheTonic force-pushed the SPARK-7758 branch from 21bf202 to e3e683d Compare May 22, 2015 03:09

fix style check

92a81fa

WangTaoTheTonic added 2 commits May 22, 2015 15:52

block more data source related property

e4f0feb

delete unused import

86caf2c

WangTaoTheTonic changed the title ~~[SPARK-7758][SQL]delete the executionHive to avoid failure when connect to a postgre sql~~ [SPARK-7758][SQL]Override more configs to avoid failure when connect to a postgre sql May 22, 2015

liancheng reviewed May 22, 2015
View reviewed changes

add comments

ca7ae7c

asfgit closed this in 31d5d46 May 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7758][SQL]Override more configs to avoid failure when connect to a postgre sql #6314

[SPARK-7758][SQL]Override more configs to avoid failure when connect to a postgre sql #6314

WangTaoTheTonic commented May 21, 2015

SparkQA commented May 21, 2015

WangTaoTheTonic commented May 21, 2015

SparkQA commented May 21, 2015

WangTaoTheTonic commented May 21, 2015

marmbrus commented May 21, 2015

marmbrus May 21, 2015

scwf May 22, 2015

liancheng May 22, 2015

SparkQA commented May 22, 2015

SparkQA commented May 22, 2015

SparkQA commented May 22, 2015

SparkQA commented May 22, 2015

WangTaoTheTonic commented May 22, 2015

yhuai commented May 22, 2015

liancheng commented May 22, 2015

liancheng May 22, 2015

liancheng commented May 22, 2015

liancheng commented May 22, 2015

WangTaoTheTonic commented May 22, 2015

yhuai commented May 22, 2015

SparkQA commented May 22, 2015

SparkQA commented May 22, 2015

marmbrus commented May 22, 2015

[SPARK-7758][SQL]Override more configs to avoid failure when connect to a postgre sql #6314

[SPARK-7758][SQL]Override more configs to avoid failure when connect to a postgre sql #6314

Conversation

WangTaoTheTonic commented May 21, 2015

SparkQA commented May 21, 2015

WangTaoTheTonic commented May 21, 2015

SparkQA commented May 21, 2015

WangTaoTheTonic commented May 21, 2015

marmbrus commented May 21, 2015

marmbrus May 21, 2015

Choose a reason for hiding this comment

scwf May 22, 2015

Choose a reason for hiding this comment

liancheng May 22, 2015

Choose a reason for hiding this comment

SparkQA commented May 22, 2015

SparkQA commented May 22, 2015

SparkQA commented May 22, 2015

SparkQA commented May 22, 2015

WangTaoTheTonic commented May 22, 2015

yhuai commented May 22, 2015

liancheng commented May 22, 2015

liancheng May 22, 2015

Choose a reason for hiding this comment

liancheng commented May 22, 2015

liancheng commented May 22, 2015

WangTaoTheTonic commented May 22, 2015

yhuai commented May 22, 2015

SparkQA commented May 22, 2015

SparkQA commented May 22, 2015

marmbrus commented May 22, 2015