Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30625][SQL] Support escape as third parameter of the like function #27355

Closed
wants to merge 4 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jan 24, 2020

What changes were proposed in this pull request?

In the PR, I propose to transform the Like expression to TernaryExpression, and add third parameter escape. So, the like function will have feature parity with LIKE ... ESCAPE syntax supported by 187f3c1.

Why are the changes needed?

The like functions can be called with 2 or 3 parameters, and functionally equivalent to LIKE and LIKE ... ESCAPE SQL expressions.

Does this PR introduce any user-facing change?

Yes, before like fails with the exception:

spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
Error in query: Invalid number of arguments for function like. Expected: 2; Found: 3; line 1 pos 7

After:

spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
true

How was this patch tested?

  • Add new example for the like function which is checked by SQLQuerySuite
  • Run RegexpExpressionsSuite and ExpressionParserSuite.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, actually, I expected a smaller change.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 24, 2020

Actually I did bigger change but reverted my refactoring in RLike. The change could be smaller if I didn't rename expression parameters but I thought case class Like(left: Expression, right: Expression, escapeChar: Char) looks slightly confusing since it is not binary expression any more, and right is not most right parameter.

@SparkQA
Copy link

SparkQA commented Jan 24, 2020

Test build #117356 has finished for PR 27355 at commit 203d1fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Like(str: Expression, pattern: Expression, escape: Expression)
  • case class RLike(left: Expression, right: Expression)

override def inputTypes: Seq[DataType] = Seq(StringType, StringType, StringType)
override def children: Seq[Expression] = Seq(str, pattern, escape)

private lazy val escapeChar: Char = if (escape.foldable) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need lazy here? I personally think its better to check error conditions as soon as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I remove lazy, I get the exception:

makeCopy, tree: str#13580 LIKE pattern#13581 ESCAPE '@'
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: str#13580 LIKE pattern#13581 ESCAPE '@'
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
	at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:435)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:408)
...
Caused by: org.apache.spark.sql.AnalysisException: The 'escape' parameter must be a string literal.;
	at org.apache.spark.sql.catalyst.expressions.Like.<init>(regexpExpressions.scala:135)

See escape is PrettyAttribute:
Screen Shot 2020-01-26 at 19 51 34

Copy link
Member Author

@MaxGekk MaxGekk Jan 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the error when I run new test in DataFrameFunctionsSuite, and I remember I got the error on other tests as well. PrettyAttribute is not foldable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, ok. Thanks for the check. The current one looks ok to me.

s"The 'escape' parameter must be a string literal of one char but it is $s.")
}
} else {
throw new AnalysisException("The 'escape' parameter must be a string literal.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for this path and line 131 (error cases)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@maropu
Copy link
Member

maropu commented Jan 26, 2020

The refactoring still looks fine to me.

@SparkQA
Copy link

SparkQA commented Jan 26, 2020

Test build #117428 has finished for PR 27355 at commit 39e4bd2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @MaxGekk and @maropu .
Merged to master.

@rednaxelafx
Copy link
Contributor

rednaxelafx commented Feb 10, 2020

Sorry for posting this review post-hoc, but I strongly believe this PR should be reverted for now, and re-land after redesign.

There are both design and implementation issues with this PR:

  1. Design: in order to get to feature parity between the LIKE syntax and the like built-in function, the latter should only accept Char argument for the escapeChar parameter, instead of the full-blown Expression -- the LIKE syntax only accepts a literal anyway. c.f.

    | NOT? kind=LIKE pattern=valueExpression (ESCAPE escapeChar=STRING)?

    By not changing the escapeChar parameter to an Expression, we can avoid most of the complexity (and bugs) in this PR.

  2. Implementation. The shim of escapeChar = ... to adapt the Expression argument back to a Char is very bad. See this example:

Seq(("abc", "%bc", "/")).toDF("s", "p", "esc").selectExpr("like(s, p, esc)").explain(true)

and on a freshly built Apache Spark master, you'll get:

== Parsed Logical Plan ==
'Project [unresolvedalias('like('s, 'p, 'esc), Some(org.apache.spark.sql.Column$$Lambda$2247/782587714@5cf3924c))]
+- Project [_1#3 AS s#10, _2#4 AS p#11, _3#5 AS esc#12]
   +- LocalRelation [_1#3, _2#4, _3#5]

== Analyzed Logical Plan ==
s LIKE p: boolean
org.apache.spark.sql.AnalysisException: The 'escape' parameter must be a string literal.;
== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: The 'escape' parameter must be a string literal.;
== Physical Plan ==
org.apache.spark.sql.AnalysisException: The 'escape' parameter must be a string literal.;

It doesn't look like supporting full-blown Expression for escapeChar is worth the cost here.

EDIT: to make my comment clear: I do think it's reasonable to allow the builtin like() function have feature parity with what we support in the syntax, i.e. accepting a literal argument for escapeChar. So the underlying intent of this PR sounds good to me. It's just the details that don't look quite right.

In order to support a 3-arg version of like() builtin function, we don't really have to change the signature of the primary constructor of Like. Instead, adding a constructor overload that takes the 3rd argument as an Expression and asserting that it only supports Literal or foldable would do.
That way we can have feature parity without most of the changes in this PR.

@gatorsmile
Copy link
Member

Which systems support this syntax?

@dongjoon-hyun
Copy link
Member

Thank you for pointing that, @rednaxelafx .

@dongjoon-hyun
Copy link
Member

According to the above comment, @MaxGekk is trying to revert this first from 3.0.0 and will target 3.1.0 later.

s"The 'escape' parameter must be a string literal of one char but it is $s.")
}
} else {
throw new AnalysisException("The 'escape' parameter must be a string literal.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of thing should be done in checkInputDataTypes.

dongjoon-hyun pushed a commit that referenced this pull request Feb 11, 2020
… `like` function

### What changes were proposed in this pull request?

In the PR, I propose to revert the commit 8aebc80.

### Why are the changes needed?
See the concerns #27355 (comment)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites.

Closes #27531 from MaxGekk/revert-like-3-args.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
HyukjinKwon pushed a commit that referenced this pull request Feb 12, 2020
… `like` function

In the PR, I propose to revert the commit 8aebc80.

See the concerns #27355 (comment)

No

By existing test suites.

Closes #27531 from MaxGekk/revert-like-3-args.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
… `like` function

### What changes were proposed in this pull request?

In the PR, I propose to revert the commit 8aebc80.

### Why are the changes needed?
See the concerns apache#27355 (comment)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites.

Closes apache#27531 from MaxGekk/revert-like-3-args.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@MaxGekk MaxGekk deleted the like-3-args branch June 5, 2020 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants