[SPARK-20854][SQL] Extend hint syntax to support expressions #18086

bogdanrdc · 2017-05-24T11:17:51Z

What changes were proposed in this pull request?

SQL hint syntax:

support expressions such as strings, numbers, etc. instead of only identifiers as it is currently.
support multiple hints, which was missing compared to the DataFrame syntax.

DataFrame API:

support any parameters in DataFrame.hint instead of just strings

How was this patch tested?

Existing tests. New tests in PlanParserSuite. New suite DataFrameHintSuite.

SparkQA · 2017-05-24T13:41:42Z

Test build #77300 has finished for PR 18086 at commit 5439468.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UnresolvedHint(name: String, parameters: Seq[Any], child: LogicalPlan)

rxin · 2017-05-25T10:12:24Z

cc @gatorsmile @cloud-fan @hvanhovell

cloud-fan · 2017-05-25T11:31:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

@@ -533,13 +533,16 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
  }

  /**
-   * Add a [[UnresolvedHint]] to a logical plan.
+   * Add a [[UnresolvedHint]]s to a logical plan.


cloud-fan · 2017-05-25T11:31:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

   */
  private def withHints(
      ctx: HintContext,
      query: LogicalPlan): LogicalPlan = withOrigin(ctx) {
-    val stmt = ctx.hintStatement
-    UnresolvedHint(stmt.hintName.getText, stmt.parameters.asScala.map(_.getText), query)
+    var plan = query


using foldLeft instead of having a var?

Honestly I think foldLeft is almost always a bad idea ...

I used foldRight somewhere too. Why is it a bad idea?

i always find a loop simpler to reason about ...

cloud-fan · 2017-05-25T11:33:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

@@ -25,7 +25,7 @@ import org.apache.spark.sql.internal.SQLConf
 * should be removed This node will be eliminated post analysis.
 * A pair of (name, parameters).
 */
-case class UnresolvedHint(name: String, parameters: Seq[String], child: LogicalPlan)
+case class UnresolvedHint(name: String, parameters: Seq[Any], child: LogicalPlan)


shall we use Expression as type?

If we use Expression then either:

Dataset.hint parameters should be Expression too, in which case you can't do df.hint("hint", 1, 2, "c") you'd have to do df.hint("hint", Literal(1), Literal(2), Literal("c")) or a shortcut if there is

Dataset.hint accepts Any but then has to convert Any to Expressions. One problem here is that Seq(1,2,3) can't be converted to Literal. So you have to use df.hint("hint", Array(1,2,3))

The disadvantage of have Any in UnresolvedHint is that to resolve the hint you have to check both for String and Literal(String) but the API is easier to use.

we can keep Any in the API(df.hint(xxx)), but use Expression in UnresolvedHint, what do you think?

One useful hint parameter is a list of columns.
Something like df.hint("hint", $"table", Seq($"col1", $"col2", $"col3"))

In this case UnresolvedHint could be called like this:
UnresolvedHint(name: String, parameters: Seq(Expression, Seq[Expression]), child)

But if UnresolvedHint.parameters is Seq[Expression] then it's not possible to have this kind of hint.

SparkQA · 2017-05-26T09:00:00Z

Test build #77421 has finished for PR 18086 at commit d386cdf.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HasMinSupport(Params):
class HasNumPartitions(Params):
class HasMinConfidence(Params):
case class AnalysisBarrier(child: LogicalPlan) extends LeafNode
case class ResolvedHint(child: LogicalPlan, hints: HintInfo = HintInfo())
case class HintInfo(

SparkQA · 2017-05-26T13:32:59Z

Test build #77424 has finished for PR 18086 at commit 6e40301.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-28T02:56:52Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

@@ -371,7 +371,7 @@ querySpecification
       (RECORDREADER recordReader=STRING)?
       fromClause?
       (WHERE where=booleanExpression)?)
-    | ((kind=SELECT hint? setQuantifier? namedExpressionSeq fromClause?
+    | ((kind=SELECT (hints+=hint)* setQuantifier? namedExpressionSeq fromClause?


In Hive and Oracle, multiple hints are put in the same /*+ */.

This patch supports both.

@gatorsmile does hive support multiple /*+ */?

Nope. Hive does not support multiple /*+ */

It does not hurt anything if we support more hint styles, as long as they are user-friendly.

gatorsmile · 2017-05-28T03:35:49Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

@@ -381,12 +381,12 @@ querySpecification
    ;

 hint
-    : '/*+' hintStatement '*/'
+    : '/*+' hintStatements+=hintStatement (hintStatements+=hintStatement)* '*/'


In the same block /*+ */, multiple hints are separated by commas in Hive. However, in Oracle, it is separated by spaces.

I added support for optional comma

gatorsmile · 2017-05-28T03:41:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

@@ -25,7 +25,7 @@ import org.apache.spark.sql.internal.SQLConf
 * should be removed This node will be eliminated post analysis.
 * A pair of (name, parameters).


This needs an update.

gatorsmile · 2017-05-28T03:42:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala

+            case tableName: String => tableName
+            case tableId: UnresolvedAttribute => tableId.name
+            case unsupported => throw new AnalysisException("Broadcast hint parameter should be " +
+              s" identifier or string but was $unsupported (${unsupported.getClass}")


Nit: s" identifier or string -> s"an identifier or string

gatorsmile · 2017-05-28T03:48:41Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameHintSuite.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql


Normally, we move such a test suite to org.apache.spark.sql.catalyst. We just need to add hint into org.apache.spark.sql.catalyst.dsl.

I added a new test for dsl. I also want a test that calls df.hint

gatorsmile · 2017-05-28T03:49:03Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala

+      parsePlan("SELECT /*+ HINT1(a, 1) hint2(b, 2) */ * from t"),
+      UnresolvedHint("hint2", Seq($"b", Literal(2)),
+        UnresolvedHint("HINT1", Seq($"a", Literal(1)),
+        table("t").select(star())


Nit: Indent

gatorsmile · 2017-05-28T03:53:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

@@ -25,7 +25,7 @@ import org.apache.spark.sql.internal.SQLConf
 * should be removed This node will be eliminated post analysis.
 * A pair of (name, parameters).
 */
-case class UnresolvedHint(name: String, parameters: Seq[String], child: LogicalPlan)
+case class UnresolvedHint(name: String, parameters: Seq[Any], child: LogicalPlan)


To support multiple parameters in hint, does it make sense to do it like df.hint("hint", "1, 2, c")? We can use our Parser to parse this parameter string.

I think that could be something extra. The DF API should accept scala expressions too: function calls (df.hint("hint", getInterestingValues()))

… optional

SparkQA · 2017-05-30T14:38:44Z

Test build #77531 has finished for PR 18086 at commit 8daa05e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-30T16:08:09Z

why rename DataFrameSuite?

gatorsmile · 2017-05-30T17:57:18Z

LGTM pending Jenkins

SparkQA · 2017-05-30T18:50:39Z

Test build #77536 has finished for PR 18086 at commit 09635a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataFrameSuite extends QueryTest with SharedSQLContext

cloud-fan · 2017-05-31T03:59:38Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DSLHintSuite.scala

+      r1.hint("hint1"),
+      UnresolvedHint("hint1", Seq(),
+        r1
+      )


nit: can we collapse it to the previous line?

cloud-fan · 2017-05-31T03:59:45Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DSLHintSuite.scala

+      r1.hint("hint1", 1, $"a"),
+      UnresolvedHint("hint1", Seq(1, $"a"),
+        r1
+      )


cloud-fan · 2017-05-31T04:00:13Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DSLHintSuite.scala

+      r1.hint("hint1", Seq(1, 2, 3), Seq($"a", $"b", $"c")),
+      UnresolvedHint("hint1", Seq(Seq(1, 2, 3), Seq($"a", $"b", $"c")),
+        r1
+      )


cloud-fan · 2017-05-31T04:11:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

@@ -407,7 +407,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
        val withWindow = withDistinct.optionalMap(windows)(withWindows)

        // Hint
-        withWindow.optionalMap(hint)(withHints)
+        hints.asScala.foldRight(withWindow)(withHints)


why we construct the hint from right to left?

so that select /*+ hint1() /* /*+ hint2() */produces Hint1(Hint2(plan)) and not Hint2(Hint1(plan)). withHints adds a Hint on top so the last one folded is the top most.

cloud-fan · 2017-05-31T04:19:56Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameHintSuite.scala

+
+  private def check(df: Dataset[_], expected: LogicalPlan) = {
+    comparePlans(
+      EliminateBarriers(df.queryExecution.logical),


that PR has been reverted, can you rebase?

SparkQA · 2017-06-01T11:59:04Z

Test build #77636 has finished for PR 18086 at commit 7776ae6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Uuid() extends LeafExpression

cloud-fan · 2017-06-01T22:56:51Z

thanks, merging to master/2.2!

SQL hint syntax: * support expressions such as strings, numbers, etc. instead of only identifiers as it is currently. * support multiple hints, which was missing compared to the DataFrame syntax. DataFrame API: * support any parameters in DataFrame.hint instead of just strings Existing tests. New tests in PlanParserSuite. New suite DataFrameHintSuite. Author: Bogdan Raducanu <bogdan@databricks.com> Closes #18086 from bogdanrdc/SPARK-20854. (cherry picked from commit 2134196) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

wzhfy · 2017-06-01T23:51:12Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala

+    )
+
+    comparePlans(
+      parsePlan("SELECT /*+ HINT1(a, array(1, 2, 3)) */ * from t"),


Is this test case redundant?

yea, @bogdanrdc can you send a follow-up PR to clean it up?

bogdanrdc added 8 commits April 20, 2017 12:59

fix + test

03a4281

reverted mistake commit

72cf1d1

erge remote-tracking branch 'upstream/master'

2c96a8d

Merge remote-tracking branch 'upstream/master'

fa11b0b

Merge remote-tracking branch 'upstream/master'

21ad3aa

new syntax + tests

84c0746

merged

dff75c8

multiple hints syntaxes + more tests

5439468

cloud-fan reviewed May 25, 2017

View reviewed changes

merged

d386cdf

fixed merged + space instead of comma for multiple hints syntax

6e40301

gatorsmile reviewed May 28, 2017

View reviewed changes

bogdanrdc added 4 commits May 30, 2017 13:59

dsl test and hint(), minor fixes, parser: made comma separating hints…

14a6150

… optional

DSLHintSuite

394d644

merged with master

1e7f95d

comma between hints optional, apply hints in order

8daa05e

reverted mistake rename of DataFrameSuite

09635a9

cloud-fan reviewed May 31, 2017

View reviewed changes

bogdanrdc added 2 commits June 1, 2017 11:36

merged with master + style fixes

3290970

Merge remote-tracking branch 'upstream/master' into SPARK-20854

7776ae6

asfgit closed this in 2134196 Jun 1, 2017

wzhfy reviewed Jun 1, 2017

View reviewed changes

		@@ -25,7 +25,7 @@ import org.apache.spark.sql.internal.SQLConf
		* should be removed This node will be eliminated post analysis.
		* A pair of (name, parameters).

[SPARK-20854][SQL] Extend hint syntax to support expressions #18086

[SPARK-20854][SQL] Extend hint syntax to support expressions #18086

Conversation

bogdanrdc commented May 24, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 24, 2017

rxin commented May 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogdanrdc May 29, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented May 26, 2017

SparkQA commented May 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile May 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 30, 2017

cloud-fan commented May 30, 2017

gatorsmile commented May 30, 2017

SparkQA commented May 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 1, 2017

cloud-fan commented Jun 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogdanrdc May 29, 2017 •

edited

Loading

gatorsmile May 30, 2017 •

edited

Loading