-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21354] [SQL] INPUT FILE related functions do not support more than one sources #18580
Conversation
Test build #79426 has finished for PR 18580 at commit
|
@@ -100,6 +104,10 @@ trait CheckAnalysis extends PredicateHelper { | |||
failAnalysis( | |||
s"invalid cast from ${c.child.dataType.simpleString} to ${c.dataType.simpleString}") | |||
|
|||
case e @ (_: InputFileName | _: InputFileBlockLength | _: InputFileBlockStart) | |||
if getNumLeafNodes(operator) > 1 => | |||
e.failAnalysis(s"'${e.prettyName}' does not support more than one sources") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: one sources
-> one source
.
val e = intercept[AnalysisException] { | ||
sql(s"SELECT *, $func() FROM tab1 JOIN tab2 ON tab1.id = tab2.id") | ||
}.getMessage | ||
assert(e.contains(s"'$func' does not support more than one sources")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opps. Sorry. It seems we need to consider UNION ALL. UNION ALL works correctly before.
sql("select *, input_file_name() from ((select * from tab1) union all (select * from tab2))").show(false)
@dongjoon-hyun What is the output of Hive for your case? |
The following is the output of the current Spark. scala> spark.range(10).write.saveAsTable("t1")
scala> spark.range(100,110).write.saveAsTable("t2")
scala> sql("select *, input_file_name() from t1").show(false)
+---+-------------------------------------------------------------------------------------------------------------------+
|id |input_file_name() |
+---+-------------------------------------------------------------------------------------------------------------------+
|3 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|4 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|8 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|9 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|0 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00000-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|1 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00001-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|2 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00002-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|5 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00004-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|6 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00005-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|7 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00006-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
+---+-------------------------------------------------------------------------------------------------------------------+
scala> sql("select *, input_file_name() from t2").show(false)
+---+-------------------------------------------------------------------------------------------------------------------+
|id |input_file_name() |
+---+-------------------------------------------------------------------------------------------------------------------+
|103|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|104|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|108|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|109|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|100|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00000-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|101|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00001-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|102|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00002-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|105|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00004-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|106|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00005-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|107|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00006-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
+---+-------------------------------------------------------------------------------------------------------------------+
scala> sql("select *, input_file_name() from ((select * from t1) union all (select * from t2))").show(false)
+---+-------------------------------------------------------------------------------------------------------------------+
|id |input_file_name() |
+---+-------------------------------------------------------------------------------------------------------------------+
|3 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|4 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|8 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|9 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|0 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00000-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|1 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00001-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|2 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00002-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|5 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00004-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|6 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00005-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|7 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00006-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|103|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|104|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|108|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|109|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|100|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00000-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|101|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00001-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|102|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00002-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|105|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00004-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|106|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00005-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|107|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00006-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
+---+-------------------------------------------------------------------------------------------------------------------+ |
In case of |
Does Hive report an error? |
Let me check that. BTW, I think Spark is better than Hive. :) |
This is the error and difference. HIVE
SPARK
|
For union, does Hive output an error? |
It was the same error,
|
cc @cloud-fan |
@@ -74,6 +74,15 @@ trait CheckAnalysis extends PredicateHelper { | |||
} | |||
} | |||
|
|||
private def getNumInputFileBlockSources(operator: LogicalPlan): Int = { | |||
operator match { | |||
case _: LeafNode => 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we only consider file data source leaf node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unable to check it in CheckAnalysis
. Both HadoopRDD
and FileScanRDD
have the same issues. To block both, we need to add the check as another rule.
Test build #79496 has finished for PR 18580 at commit
|
Retest this please. |
Test build #79576 has finished for PR 18580 at commit
|
case o => | ||
val numInputFileBlockSources = o.children.map(checkNumInputFileBlockSources(e, _)).sum | ||
if (numInputFileBlockSources > 1) { | ||
e.failAnalysis(s"'${e.prettyName}' does not support more than one sources") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check it as early as possible; otherwise, Union
might eat it.
LGTM, pending jenkins |
Test build #79651 has finished for PR 18580 at commit
|
thanks, merging to master! |
What changes were proposed in this pull request?
The build-in functions
input_file_name
,input_file_block_start
,input_file_block_length
do not support more than one sources, like what Hive does. Currently, Spark does not block it and the outputs are ambiguous/non-deterministic. It could be from any side.This PR blocks it and issues an error.
How was this patch tested?
Added a test case