Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21354] [SQL] INPUT FILE related functions do not support more than one sources #18580

Closed
wants to merge 4 commits into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

The build-in functions input_file_name, input_file_block_start, input_file_block_length do not support more than one sources, like what Hive does. Currently, Spark does not block it and the outputs are ambiguous/non-deterministic. It could be from any side.

hive> select *, INPUT__FILE__NAME FROM t1, t2;
FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One Tables/Subqueries

This PR blocks it and issues an error.

How was this patch tested?

Added a test case

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79426 has finished for PR 18580 at commit 596ea17.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -100,6 +104,10 @@ trait CheckAnalysis extends PredicateHelper {
failAnalysis(
s"invalid cast from ${c.child.dataType.simpleString} to ${c.dataType.simpleString}")

case e @ (_: InputFileName | _: InputFileBlockLength | _: InputFileBlockStart)
if getNumLeafNodes(operator) > 1 =>
e.failAnalysis(s"'${e.prettyName}' does not support more than one sources")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one sources -> one source.

val e = intercept[AnalysisException] {
sql(s"SELECT *, $func() FROM tab1 JOIN tab2 ON tab1.id = tab2.id")
}.getMessage
assert(e.contains(s"'$func' does not support more than one sources"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opps. Sorry. It seems we need to consider UNION ALL. UNION ALL works correctly before.

sql("select *, input_file_name() from ((select * from tab1) union all (select * from tab2))").show(false)

@gatorsmile
Copy link
Member Author

@dongjoon-hyun What is the output of Hive for your case?

@dongjoon-hyun
Copy link
Member

The following is the output of the current Spark.

scala> spark.range(10).write.saveAsTable("t1")

scala> spark.range(100,110).write.saveAsTable("t2")

scala> sql("select *, input_file_name() from t1").show(false)
+---+-------------------------------------------------------------------------------------------------------------------+
|id |input_file_name()                                                                                                  |
+---+-------------------------------------------------------------------------------------------------------------------+
|3  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|4  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|8  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|9  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|0  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00000-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|1  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00001-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|2  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00002-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|5  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00004-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|6  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00005-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|7  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00006-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
+---+-------------------------------------------------------------------------------------------------------------------+


scala> sql("select *, input_file_name() from t2").show(false)
+---+-------------------------------------------------------------------------------------------------------------------+
|id |input_file_name()                                                                                                  |
+---+-------------------------------------------------------------------------------------------------------------------+
|103|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|104|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|108|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|109|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|100|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00000-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|101|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00001-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|102|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00002-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|105|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00004-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|106|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00005-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|107|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00006-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
+---+-------------------------------------------------------------------------------------------------------------------+


scala> sql("select *, input_file_name() from ((select * from t1) union all (select * from t2))").show(false)
+---+-------------------------------------------------------------------------------------------------------------------+
|id |input_file_name()                                                                                                  |
+---+-------------------------------------------------------------------------------------------------------------------+
|3  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|4  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|8  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|9  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|0  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00000-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|1  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00001-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|2  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00002-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|5  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00004-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|6  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00005-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|7  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00006-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|103|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|104|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|108|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|109|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|100|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00000-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|101|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00001-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|102|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00002-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|105|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00004-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|106|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00005-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|107|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00006-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
+---+-------------------------------------------------------------------------------------------------------------------+

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 10, 2017

In case of UNION ALL, everything looks normal. So, it would be a case of regression of Spark.

@gatorsmile
Copy link
Member Author

Does Hive report an error?

@dongjoon-hyun
Copy link
Member

Let me check that. BTW, I think Spark is better than Hive. :)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 10, 2017

This is the error and difference.

HIVE

hive> select *, INPUT__FILE__NAME from (select * from t1) T;
FAILED: SemanticException [Error 10004]: Line 1:10 Invalid table alias or column reference 'INPUT__FILE__NAME': (possible column names are: _c0)

SPARK

scala> sql("select *, input_file_name() from (select * from t1)").show(false)
+---+-------------------------------------------------------------------------------------------------------------------+
|id |input_file_name()                                                                                                  |
+---+-------------------------------------------------------------------------------------------------------------------+
|3  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|4  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|8  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|9  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|0  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00000-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|1  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00001-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|2  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00002-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|5  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00004-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|6  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00005-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|7  |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00006-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
+---+-------------------------------------------------------------------------------------------------------------------+

@gatorsmile
Copy link
Member Author

For union, does Hive output an error?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 10, 2017

It was the same error, Invalid table alias or column reference 'INPUT__FILE__NAME': (possible column names are: _c0). So, I reduce that into a simple above example.

hive> select *, INPUT__FILE__NAME from (select * from t1 union all select * from t1) T;
FAILED: SemanticException [Error 10004]: Line 1:10 Invalid table alias or column reference 'INPUT__FILE__NAME': (possible column names are: _c0)

@gatorsmile
Copy link
Member Author

cc @cloud-fan

@@ -74,6 +74,15 @@ trait CheckAnalysis extends PredicateHelper {
}
}

private def getNumInputFileBlockSources(operator: LogicalPlan): Int = {
operator match {
case _: LeafNode => 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we only consider file data source leaf node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unable to check it in CheckAnalysis. Both HadoopRDD and FileScanRDD have the same issues. To block both, we need to add the check as another rule.

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79496 has finished for PR 18580 at commit 6b48a9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 13, 2017

Test build #79576 has finished for PR 18580 at commit 6b48a9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case o =>
val numInputFileBlockSources = o.children.map(checkNumInputFileBlockSources(e, _)).sum
if (numInputFileBlockSources > 1) {
e.failAnalysis(s"'${e.prettyName}' does not support more than one sources")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check it as early as possible; otherwise, Union might eat it.

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@SparkQA
Copy link

SparkQA commented Jul 17, 2017

Test build #79651 has finished for PR 18580 at commit c4de2b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in e398c28 Jul 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants