Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7758][SQL]Override more configs to avoid failure when connect to a postgre sql #6314

Closed
wants to merge 5 commits into from

Conversation

WangTaoTheTonic
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-7758

When initializing executionHive, we only masks
javax.jdo.option.ConnectionURL to override metastore location. However,
other properties that relates to the actual Hive metastore data source are not
masked. For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, executionHive actually tries to use settings read from
hive-site.xml, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of Hive ObjectStore.getDataSourceProps()
method
, all properties whose name mentions "jdo" and "datanucleus" must be
included.

Have tested using postgre sql as metastore, it worked fine.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33228 has finished for PR 6314 at commit 21bf202.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WangTaoTheTonic
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33241 has finished for PR 6314 at commit 21bf202.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WangTaoTheTonic
Copy link
Contributor Author

From the failed test cases, it looks like not a good fix. I have no more idea as not a spark-sql expert. Anyone could offer some help?

@marmbrus
Copy link
Contributor

Yeah, we definitely need the execution hive for Hive to function, but likely what we want to do is filter out any configuration that is telling it to use postgres (it should always use a dummy local derby instance)

logInfo(s"Initilizing execution hive, version $hiveExecutionVersion")
new ClientWrapper(
version = IsolatedClientLoader.hiveVersion(hiveExecutionVersion),
config = newTemporaryConfiguration())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try overriding more settings here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus, i am not understand here, can you explain, why we need executionHive? is hive also do like this(use a local dummy metastore for temporary functions)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scwf This is a Spark SQL specific design, which enables us to connect multiple versions of Hive metastore with a single code base. The executionHive is used for internal jobs within Spark SQL framework, while the metadataHive is used to talk to external Hive metastore.

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33318 has finished for PR 6314 at commit e3e683d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33321 has finished for PR 6314 at commit 92a81fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33337 has finished for PR 6314 at commit 86caf2c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33336 has finished for PR 6314 at commit e4f0feb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WangTaoTheTonic
Copy link
Contributor Author

@marmbrus I overrided more configs with their default value in HiveConf (refer to https://github.com/pwendell/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288) plus datanucleus.rdbms.datastoreAdapterClassName not included in HiveConf.

Have tested using postgre sql as metastore, it worked fine.

@WangTaoTheTonic WangTaoTheTonic changed the title [SPARK-7758][SQL]delete the executionHive to avoid failure when connect to a postgre sql [SPARK-7758][SQL]Override more configs to avoid failure when connect to a postgre sql May 22, 2015
@yhuai
Copy link
Contributor

yhuai commented May 22, 2015

@WangTaoTheTonic Seems newTemporaryConfiguration is used by executionHive. Can you explain the reason that your change can fix the problem of metadataHive?

@liancheng
Copy link
Contributor

@yhuai We headed to the wrong direction at first. It's not because metadataHive can't find proper PostgreSQL configurations. The reason of the failure is that executionHive only overrides the metastore location without overriding other JDO and Datanucleus properties, thus those properties are read from hive-site.xml and talking about PostgreSQL. This makes executionHive trying to connect to the temporary Derby metastore with PostgreSQL settings, and causes the error. What @WangTaoTheTonic did is overriding all related properties with Hive default values (gathered from ConfVars). That's why updating executionHive related code path corrects the behavior.

"javax.jdo.option.ConnectionURL" -> s"jdbc:derby:;databaseName=$localMetastore;create=true")
val propMap: HashMap[String, String] = HashMap()
HiveConf.ConfVars.values().foreach { confvar =>
if (confvar.varname.contains("datanucleus") || confvar.varname.contains("jdo")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment to explain where this if condition comes from for future reference? Thanks.

@liancheng
Copy link
Contributor

@WangTaoTheTonic Would you mind to help updating the PR description to the following? Thanks!

When initializing `executionHive`, we only masks
`javax.jdo.option.ConnectionURL` to override metastore location.  However,
other properties that relates to the actual Hive metastore data source are not
masked.  For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, `executionHive` actually tries to use settings read from
`hive-site.xml`, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()`
method] [1], all properties whose name mentions "jdo" and "datanucleus" must be
included.

[1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288

@liancheng
Copy link
Contributor

Forgot to say, LGTM, and thanks for fixing this!

@WangTaoTheTonic
Copy link
Contributor Author

@yhuai I think @liancheng explains all clearly. Thanks for your comments and explanation @liancheng.

@yhuai
Copy link
Contributor

yhuai commented May 22, 2015

@WangTaoTheTonic Can you update your pr description per Cheng?

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33345 has finished for PR 6314 at commit ca7ae7c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #851 has finished for PR 6314 at commit ca7ae7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 22, 2015
…t to a postgre sql

https://issues.apache.org/jira/browse/SPARK-7758

When initializing `executionHive`, we only masks
`javax.jdo.option.ConnectionURL` to override metastore location.  However,
other properties that relates to the actual Hive metastore data source are not
masked.  For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, `executionHive` actually tries to use settings read from
`hive-site.xml`, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()`
method] [1], all properties whose name mentions "jdo" and "datanucleus" must be
included.

[1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288

Have tested using postgre sql as metastore, it worked fine.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits:

ca7ae7c [WangTaoTheTonic] add comments
86caf2c [WangTaoTheTonic] delete unused import
e4f0feb [WangTaoTheTonic] block more data source related property
92a81fa [WangTaoTheTonic] fix style check
e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql

(cherry picked from commit 31d5d46)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 31d5d46 May 22, 2015
@marmbrus
Copy link
Contributor

Thanks for fixing this! Merged to master and 1.4.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
…t to a postgre sql

https://issues.apache.org/jira/browse/SPARK-7758

When initializing `executionHive`, we only masks
`javax.jdo.option.ConnectionURL` to override metastore location.  However,
other properties that relates to the actual Hive metastore data source are not
masked.  For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, `executionHive` actually tries to use settings read from
`hive-site.xml`, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()`
method] [1], all properties whose name mentions "jdo" and "datanucleus" must be
included.

[1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288

Have tested using postgre sql as metastore, it worked fine.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes apache#6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits:

ca7ae7c [WangTaoTheTonic] add comments
86caf2c [WangTaoTheTonic] delete unused import
e4f0feb [WangTaoTheTonic] block more data source related property
92a81fa [WangTaoTheTonic] fix style check
e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…t to a postgre sql

https://issues.apache.org/jira/browse/SPARK-7758

When initializing `executionHive`, we only masks
`javax.jdo.option.ConnectionURL` to override metastore location.  However,
other properties that relates to the actual Hive metastore data source are not
masked.  For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, `executionHive` actually tries to use settings read from
`hive-site.xml`, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()`
method] [1], all properties whose name mentions "jdo" and "datanucleus" must be
included.

[1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288

Have tested using postgre sql as metastore, it worked fine.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes apache#6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits:

ca7ae7c [WangTaoTheTonic] add comments
86caf2c [WangTaoTheTonic] delete unused import
e4f0feb [WangTaoTheTonic] block more data source related property
92a81fa [WangTaoTheTonic] fix style check
e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…t to a postgre sql

https://issues.apache.org/jira/browse/SPARK-7758

When initializing `executionHive`, we only masks
`javax.jdo.option.ConnectionURL` to override metastore location.  However,
other properties that relates to the actual Hive metastore data source are not
masked.  For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, `executionHive` actually tries to use settings read from
`hive-site.xml`, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()`
method] [1], all properties whose name mentions "jdo" and "datanucleus" must be
included.

[1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288

Have tested using postgre sql as metastore, it worked fine.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes apache#6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits:

ca7ae7c [WangTaoTheTonic] add comments
86caf2c [WangTaoTheTonic] delete unused import
e4f0feb [WangTaoTheTonic] block more data source related property
92a81fa [WangTaoTheTonic] fix style check
e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants