Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20854][SQL] Extend hint syntax to support expressions #18086

Closed
wants to merge 17 commits into from

Conversation

bogdanrdc
Copy link
Contributor

What changes were proposed in this pull request?

SQL hint syntax:

  • support expressions such as strings, numbers, etc. instead of only identifiers as it is currently.
  • support multiple hints, which was missing compared to the DataFrame syntax.

DataFrame API:

  • support any parameters in DataFrame.hint instead of just strings

How was this patch tested?

Existing tests. New tests in PlanParserSuite. New suite DataFrameHintSuite.

@SparkQA
Copy link

SparkQA commented May 24, 2017

Test build #77300 has finished for PR 18086 at commit 5439468.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedHint(name: String, parameters: Seq[Any], child: LogicalPlan)

@rxin
Copy link
Contributor

rxin commented May 25, 2017

@@ -533,13 +533,16 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
}

/**
* Add a [[UnresolvedHint]] to a logical plan.
* Add a [[UnresolvedHint]]s to a logical plan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove a

*/
private def withHints(
ctx: HintContext,
query: LogicalPlan): LogicalPlan = withOrigin(ctx) {
val stmt = ctx.hintStatement
UnresolvedHint(stmt.hintName.getText, stmt.parameters.asScala.map(_.getText), query)
var plan = query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using foldLeft instead of having a var?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think foldLeft is almost always a bad idea ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used foldRight somewhere too. Why is it a bad idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i always find a loop simpler to reason about ...

@@ -25,7 +25,7 @@ import org.apache.spark.sql.internal.SQLConf
* should be removed This node will be eliminated post analysis.
* A pair of (name, parameters).
*/
case class UnresolvedHint(name: String, parameters: Seq[String], child: LogicalPlan)
case class UnresolvedHint(name: String, parameters: Seq[Any], child: LogicalPlan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use Expression as type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use Expression then either:

  • Dataset.hint parameters should be Expression too, in which case you can't do df.hint("hint", 1, 2, "c") you'd have to do df.hint("hint", Literal(1), Literal(2), Literal("c")) or a shortcut if there is
  • Dataset.hint accepts Any but then has to convert Any to Expressions. One problem here is that Seq(1,2,3) can't be converted to Literal. So you have to use df.hint("hint", Array(1,2,3))

The disadvantage of have Any in UnresolvedHint is that to resolve the hint you have to check both for String and Literal(String) but the API is easier to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep Any in the API(df.hint(xxx)), but use Expression in UnresolvedHint, what do you think?

Copy link
Contributor Author

@bogdanrdc bogdanrdc May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One useful hint parameter is a list of columns.
Something like df.hint("hint", $"table", Seq($"col1", $"col2", $"col3"))

In this case UnresolvedHint could be called like this:
UnresolvedHint(name: String, parameters: Seq(Expression, Seq[Expression]), child)

But if UnresolvedHint.parameters is Seq[Expression] then it's not possible to have this kind of hint.

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77421 has finished for PR 18086 at commit d386cdf.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HasMinSupport(Params):
  • class HasNumPartitions(Params):
  • class HasMinConfidence(Params):
  • case class AnalysisBarrier(child: LogicalPlan) extends LeafNode
  • case class ResolvedHint(child: LogicalPlan, hints: HintInfo = HintInfo())
  • case class HintInfo(

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77424 has finished for PR 18086 at commit 6e40301.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -371,7 +371,7 @@ querySpecification
(RECORDREADER recordReader=STRING)?
fromClause?
(WHERE where=booleanExpression)?)
| ((kind=SELECT hint? setQuantifier? namedExpressionSeq fromClause?
| ((kind=SELECT (hints+=hint)* setQuantifier? namedExpressionSeq fromClause?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Hive and Oracle, multiple hints are put in the same /*+ */.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch supports both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile does hive support multiple /*+ */?

Copy link
Member

@gatorsmile gatorsmile May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Hive does not support multiple /*+ */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not hurt anything if we support more hint styles, as long as they are user-friendly.

@@ -381,12 +381,12 @@ querySpecification
;

hint
: '/*+' hintStatement '*/'
: '/*+' hintStatements+=hintStatement (hintStatements+=hintStatement)* '*/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same block /*+ */, multiple hints are separated by commas in Hive. However, in Oracle, it is separated by spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added support for optional comma

@@ -25,7 +25,7 @@ import org.apache.spark.sql.internal.SQLConf
* should be removed This node will be eliminated post analysis.
* A pair of (name, parameters).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an update.

case tableName: String => tableName
case tableId: UnresolvedAttribute => tableId.name
case unsupported => throw new AnalysisException("Broadcast hint parameter should be " +
s" identifier or string but was $unsupported (${unsupported.getClass}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s" identifier or string -> s"an identifier or string

* limitations under the License.
*/

package org.apache.spark.sql
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we move such a test suite to org.apache.spark.sql.catalyst. We just need to add hint into org.apache.spark.sql.catalyst.dsl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new test for dsl. I also want a test that calls df.hint

parsePlan("SELECT /*+ HINT1(a, 1) hint2(b, 2) */ * from t"),
UnresolvedHint("hint2", Seq($"b", Literal(2)),
UnresolvedHint("HINT1", Seq($"a", Literal(1)),
table("t").select(star())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Indent

@@ -25,7 +25,7 @@ import org.apache.spark.sql.internal.SQLConf
* should be removed This node will be eliminated post analysis.
* A pair of (name, parameters).
*/
case class UnresolvedHint(name: String, parameters: Seq[String], child: LogicalPlan)
case class UnresolvedHint(name: String, parameters: Seq[Any], child: LogicalPlan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support multiple parameters in hint, does it make sense to do it like df.hint("hint", "1, 2, c")? We can use our Parser to parse this parameter string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could be something extra. The DF API should accept scala expressions too: function calls (df.hint("hint", getInterestingValues()))

@SparkQA
Copy link

SparkQA commented May 30, 2017

Test build #77531 has finished for PR 18086 at commit 8daa05e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

why rename DataFrameSuite?

@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented May 30, 2017

Test build #77536 has finished for PR 18086 at commit 09635a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataFrameSuite extends QueryTest with SharedSQLContext

r1.hint("hint1"),
UnresolvedHint("hint1", Seq(),
r1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we collapse it to the previous line?

r1.hint("hint1", 1, $"a"),
UnresolvedHint("hint1", Seq(1, $"a"),
r1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

r1.hint("hint1", Seq(1, 2, 3), Seq($"a", $"b", $"c")),
UnresolvedHint("hint1", Seq(Seq(1, 2, 3), Seq($"a", $"b", $"c")),
r1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -407,7 +407,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
val withWindow = withDistinct.optionalMap(windows)(withWindows)

// Hint
withWindow.optionalMap(hint)(withHints)
hints.asScala.foldRight(withWindow)(withHints)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we construct the hint from right to left?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that select /*+ hint1() /* /*+ hint2() */produces Hint1(Hint2(plan)) and not Hint2(Hint1(plan)). withHints adds a Hint on top so the last one folded is the top most.


private def check(df: Dataset[_], expected: LogicalPlan) = {
comparePlans(
EliminateBarriers(df.queryExecution.logical),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that PR has been reverted, can you rebase?

@SparkQA
Copy link

SparkQA commented Jun 1, 2017

Test build #77636 has finished for PR 18086 at commit 7776ae6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Uuid() extends LeafExpression

@asfgit asfgit closed this in 2134196 Jun 1, 2017
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2!

asfgit pushed a commit that referenced this pull request Jun 1, 2017
SQL hint syntax:
* support expressions such as strings, numbers, etc. instead of only identifiers as it is currently.
* support multiple hints, which was missing compared to the DataFrame syntax.

DataFrame API:
* support any parameters in DataFrame.hint instead of just strings

Existing tests. New tests in PlanParserSuite. New suite DataFrameHintSuite.

Author: Bogdan Raducanu <bogdan@databricks.com>

Closes #18086 from bogdanrdc/SPARK-20854.

(cherry picked from commit 2134196)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
)

comparePlans(
parsePlan("SELECT /*+ HINT1(a, array(1, 2, 3)) */ * from t"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test case redundant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, @bogdanrdc can you send a follow-up PR to clean it up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants