Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47482] Add HiveDialect to sql module #45644

Closed
wants to merge 1 commit into from

Conversation

xleoken
Copy link
Member

@xleoken xleoken commented Mar 21, 2024

What changes were proposed in this pull request?

Add HiveDialect to sql module

Why are the changes needed?

In scenarios with multiple hive catalogs, throw ParseException

SQL

bin/spark-sql \
  --conf "spark.sql.catalog.aaa=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.aaa.url=jdbc:hive2://172.16.10.12:10000/data" \
  --conf "spark.sql.catalog.aaa.driver=org.apache.hive.jdbc.HiveDriver" \
  --conf "spark.sql.catalog.bbb=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.bbb.url=jdbc:hive2://172.16.10.13:10000/data" \
  --conf "spark.sql.catalog.bbb.driver=org.apache.hive.jdbc.HiveDriver"

select count(1) from aaa.data.data_part;

Exception

24/03/19 21:58:25 INFO HiveSessionImpl: Operation log session directory is created: /tmp/root/operation_logs/f15a5434-6356-455b-aa8e-4ce9903c1b81
24/03/19 21:58:25 INFO SparkExecuteStatementOperation: Submitting query 'SELECT * FROM "data"."data_part" WHERE 1=0' with a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 INFO SparkExecuteStatementOperation: Running query with a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 INFO DAGScheduler: Asked to cancel job group a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 ERROR SparkExecuteStatementOperation: Error executing query with a7459d6d-2a5c-4b56-945c-3159e58d12fd, currentState RUNNING, 
org.apache.spark.sql.catalyst.parser.ParseException: 
Syntax error at or near '"data"'(line 1, pos 14)

== SQL ==
SELECT * FROM "data"."data_part" WHERE 1=0
--------------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:620)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:620)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)

Does this PR introduce any user-facing change?

no

How was this patch tested?

local test

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Mar 21, 2024
@xleoken
Copy link
Member Author

xleoken commented Mar 21, 2024

@xleoken I think you can implements the catalog plugin and register two custom hive jdbc dialects.

Just FYI, SPARK-47496 makes loading a custom dialect much easier.

This is too heavy for users and there's no need for it.

As Daniel Fernandez said, only two functions should be overriden. in https://issues.apache.org/jira/browse/SPARK-22016

https://issues.apache.org/jira/browse/SPARK-21063
https://issues.apache.org/jira/browse/SPARK-22016
https://issues.apache.org/jira/browse/SPARK-31457

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xleoken
Copy link
Member Author

xleoken commented Mar 22, 2024

Hi @dongjoon-hyun @yaooqinn @HyukjinKwon, please look into this issue seriously. The old related PRs hasn't been active for a long time, we can discuss this here.

When we met this issue, the client told me Table or view not found, while the server told org.apache.spark.sql.catalyst.parser.ParseException. We spend a lot time to analyze this issue, and solved it.

By the way, can throw not support jdbc:hive2 exception directly? Or update the doc to told user need to custom dialect.

Make a list

  • From the following exception stacktrace, we need to spend a lot of time analyzing that the root cause of this problem is from JdbcDialects#quoteIdentifier.
  • It can be provided as a thridparty library or implements the catalog plugin, it is too heavy for users.
    As yaooqinn said, it's difficult to register a custom JDBC dialect to use. [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering #45626

1、Startup thriftserver

sbin/start-thriftserver.sh

2、Startup spark-shell

bin/spark-shell \
--conf spark.sql.catalog.aaa=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog \
--conf spark.sql.catalog.aaa.url=jdbc:hive2://172.16.10.12:10000/data \
--conf spark.sql.catalog.aaa.driver=org.apache.hive.jdbc.HiveDriver

3、Query

select * from aaa.data.data_part limit 1

4、Client Exception : (Table or view not found: aaa.data.data_part)

scala> spark.sql("select * from aaa.data.data_part limit 1").show();
24/03/22 08:35:53 WARN HiveConnection: Failed to connect to 172.16.10.12:10000
org.apache.spark.sql.AnalysisException: Table or view not found: aaa.data.data_part; line 1 pos 14;
'GlobalLimit 1
+- 'LocalLimit 1
   +- 'Project [*]
      +- 'UnresolvedRelation [aaa, data, data_part], [], false

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:131)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:102)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:367)

5、Server Exception (org.apache.spark.sql.catalyst.parser.ParseException)

24/03/22 08:45:42 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V10
24/03/22 08:45:42 INFO HiveSessionImpl: Operation log session directory is created: /tmp/root/operation_logs/4d373392-cc24-45fd-b9b7-4e27eeb48292
24/03/22 08:45:42 INFO SparkExecuteStatementOperation: Submitting query 'SELECT * FROM "data"."data_part" WHERE 1=0' with b5e0d91c-6d3f-4a79-9bd6-d78233150e56
24/03/22 08:45:42 INFO SparkExecuteStatementOperation: Running query with b5e0d91c-6d3f-4a79-9bd6-d78233150e56
24/03/22 08:45:42 INFO DAGScheduler: Asked to cancel job group b5e0d91c-6d3f-4a79-9bd6-d78233150e56
24/03/22 08:45:42 ERROR SparkExecuteStatementOperation: Error executing query with b5e0d91c-6d3f-4a79-9bd6-d78233150e56, currentState RUNNING, 
org.apache.spark.sql.catalyst.parser.ParseException: 
Syntax error at or near '"data"'(line 1, pos 14)

== SQL ==
SELECT * FROM "data"."data_part" WHERE 1=0
--------------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:620)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:620)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)

@charlesy6
Copy link
Contributor

This patch works for me too.

@xleoken xleoken force-pushed the patch branch 2 times, most recently from 4eda53e to bb58cab Compare March 25, 2024 10:35
@xleoken
Copy link
Member Author

xleoken commented Mar 26, 2024

@xleoken xleoken force-pushed the patch branch 7 times, most recently from ebb3a3e to c3ebf30 Compare April 3, 2024 07:08
@xleoken xleoken force-pushed the patch branch 2 times, most recently from 1a9a6f7 to 9fdb990 Compare April 8, 2024 01:02
@xleoken xleoken force-pushed the patch branch 11 times, most recently from b20f84b to 80d5e4f Compare April 10, 2024 08:51
@xleoken xleoken force-pushed the patch branch 12 times, most recently from b4ed745 to 2e9fc8b Compare May 15, 2024 00:34
@xleoken xleoken force-pushed the patch branch 4 times, most recently from 52dd798 to a100cc9 Compare May 24, 2024 02:26
@xleoken xleoken requested a review from kristgpt May 24, 2024 09:38
@xleoken xleoken force-pushed the patch branch 2 times, most recently from 90b0e05 to 1859d90 Compare May 28, 2024 01:13
@xleoken xleoken force-pushed the patch branch 2 times, most recently from 36c61b3 to 5f2d815 Compare June 17, 2024 01:30
@xleoken xleoken force-pushed the patch branch 2 times, most recently from e9e5778 to 3ac43da Compare June 26, 2024 05:19
Copy link

github-actions bot commented Dec 3, 2024

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 3, 2024
@github-actions github-actions bot closed this Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants