-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7758][SQL]Override more configs to avoid failure when connect to a postgre sql #6314
Conversation
Test build #33228 has finished for PR 6314 at commit
|
Jenkins, retest this please. |
Test build #33241 has finished for PR 6314 at commit
|
From the failed test cases, it looks like not a good fix. I have no more idea as not a spark-sql expert. Anyone could offer some help? |
Yeah, we definitely need the execution hive for Hive to function, but likely what we want to do is filter out any configuration that is telling it to use postgres (it should always use a dummy local derby instance) |
logInfo(s"Initilizing execution hive, version $hiveExecutionVersion") | ||
new ClientWrapper( | ||
version = IsolatedClientLoader.hiveVersion(hiveExecutionVersion), | ||
config = newTemporaryConfiguration()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try overriding more settings here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus, i am not understand here, can you explain, why we need executionHive
? is hive also do like this(use a local dummy metastore for temporary functions)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scwf This is a Spark SQL specific design, which enables us to connect multiple versions of Hive metastore with a single code base. The executionHive
is used for internal jobs within Spark SQL framework, while the metadataHive
is used to talk to external Hive metastore.
21bf202
to
e3e683d
Compare
Test build #33318 has finished for PR 6314 at commit
|
Test build #33321 has finished for PR 6314 at commit
|
Test build #33337 has finished for PR 6314 at commit
|
Test build #33336 has finished for PR 6314 at commit
|
@marmbrus I overrided more configs with their default value in HiveConf (refer to https://github.com/pwendell/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288) plus Have tested using postgre sql as metastore, it worked fine. |
@WangTaoTheTonic Seems |
@yhuai We headed to the wrong direction at first. It's not because |
"javax.jdo.option.ConnectionURL" -> s"jdbc:derby:;databaseName=$localMetastore;create=true") | ||
val propMap: HashMap[String, String] = HashMap() | ||
HiveConf.ConfVars.values().foreach { confvar => | ||
if (confvar.varname.contains("datanucleus") || confvar.varname.contains("jdo")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a comment to explain where this if
condition comes from for future reference? Thanks.
@WangTaoTheTonic Would you mind to help updating the PR description to the following? Thanks!
|
Forgot to say, LGTM, and thanks for fixing this! |
@yhuai I think @liancheng explains all clearly. Thanks for your comments and explanation @liancheng. |
@WangTaoTheTonic Can you update your pr description per Cheng? |
Test build #33345 has finished for PR 6314 at commit
|
Test build #851 has finished for PR 6314 at commit
|
…t to a postgre sql https://issues.apache.org/jira/browse/SPARK-7758 When initializing `executionHive`, we only masks `javax.jdo.option.ConnectionURL` to override metastore location. However, other properties that relates to the actual Hive metastore data source are not masked. For example, when using Spark SQL with a PostgreSQL backed Hive metastore, `executionHive` actually tries to use settings read from `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary Derby metastore, thus causes error. To fix this, we need to mask all metastore data source properties. Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()` method] [1], all properties whose name mentions "jdo" and "datanucleus" must be included. [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288 Have tested using postgre sql as metastore, it worked fine. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits: ca7ae7c [WangTaoTheTonic] add comments 86caf2c [WangTaoTheTonic] delete unused import e4f0feb [WangTaoTheTonic] block more data source related property 92a81fa [WangTaoTheTonic] fix style check e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql (cherry picked from commit 31d5d46) Signed-off-by: Michael Armbrust <michael@databricks.com>
Thanks for fixing this! Merged to master and 1.4. |
…t to a postgre sql https://issues.apache.org/jira/browse/SPARK-7758 When initializing `executionHive`, we only masks `javax.jdo.option.ConnectionURL` to override metastore location. However, other properties that relates to the actual Hive metastore data source are not masked. For example, when using Spark SQL with a PostgreSQL backed Hive metastore, `executionHive` actually tries to use settings read from `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary Derby metastore, thus causes error. To fix this, we need to mask all metastore data source properties. Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()` method] [1], all properties whose name mentions "jdo" and "datanucleus" must be included. [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288 Have tested using postgre sql as metastore, it worked fine. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes apache#6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits: ca7ae7c [WangTaoTheTonic] add comments 86caf2c [WangTaoTheTonic] delete unused import e4f0feb [WangTaoTheTonic] block more data source related property 92a81fa [WangTaoTheTonic] fix style check e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql
…t to a postgre sql https://issues.apache.org/jira/browse/SPARK-7758 When initializing `executionHive`, we only masks `javax.jdo.option.ConnectionURL` to override metastore location. However, other properties that relates to the actual Hive metastore data source are not masked. For example, when using Spark SQL with a PostgreSQL backed Hive metastore, `executionHive` actually tries to use settings read from `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary Derby metastore, thus causes error. To fix this, we need to mask all metastore data source properties. Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()` method] [1], all properties whose name mentions "jdo" and "datanucleus" must be included. [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288 Have tested using postgre sql as metastore, it worked fine. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes apache#6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits: ca7ae7c [WangTaoTheTonic] add comments 86caf2c [WangTaoTheTonic] delete unused import e4f0feb [WangTaoTheTonic] block more data source related property 92a81fa [WangTaoTheTonic] fix style check e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql
…t to a postgre sql https://issues.apache.org/jira/browse/SPARK-7758 When initializing `executionHive`, we only masks `javax.jdo.option.ConnectionURL` to override metastore location. However, other properties that relates to the actual Hive metastore data source are not masked. For example, when using Spark SQL with a PostgreSQL backed Hive metastore, `executionHive` actually tries to use settings read from `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary Derby metastore, thus causes error. To fix this, we need to mask all metastore data source properties. Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()` method] [1], all properties whose name mentions "jdo" and "datanucleus" must be included. [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288 Have tested using postgre sql as metastore, it worked fine. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes apache#6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits: ca7ae7c [WangTaoTheTonic] add comments 86caf2c [WangTaoTheTonic] delete unused import e4f0feb [WangTaoTheTonic] block more data source related property 92a81fa [WangTaoTheTonic] fix style check e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql
https://issues.apache.org/jira/browse/SPARK-7758
When initializing
executionHive
, we only masksjavax.jdo.option.ConnectionURL
to override metastore location. However,other properties that relates to the actual Hive metastore data source are not
masked. For example, when using Spark SQL with a PostgreSQL backed Hive
metastore,
executionHive
actually tries to use settings read fromhive-site.xml
, which talks about PostgreSQL, to connect to the temporaryDerby metastore, thus causes error.
To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of Hive
ObjectStore.getDataSourceProps()
method, all properties whose name mentions "jdo" and "datanucleus" must be
included.
Have tested using postgre sql as metastore, it worked fine.