-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22771][SQL] Concatenate binary inputs into a binary output #19977
Conversation
Test build #84904 has finished for PR 19977 at commit
|
Test build #84910 has finished for PR 19977 at commit
|
Could you confirm whether Hive behaves the same? |
ok |
checked;
|
|
||
override def inputTypes: Seq[AbstractDataType] = | ||
Seq.fill(children.size)(if (isBinaryMode) BinaryType else StringType) | ||
override def dataType: DataType = if (isBinaryMode) BinaryType else StringType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we worry about backward compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, should be. Any existing option for keeping back compatibility? Or, how about adding a new option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conf is needed for sure. We also need a Migration Guide
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to add a new option for this case only? If we keep adding new options for each case, options could blow up?
@@ -50,15 +51,23 @@ import org.apache.spark.unsafe.types.{ByteArray, UTF8String} | |||
""") | |||
case class Concat(children: Seq[Expression]) extends Expression with ImplicitCastInputTypes { | |||
|
|||
override def inputTypes: Seq[AbstractDataType] = Seq.fill(children.size)(StringType) | |||
override def dataType: DataType = StringType | |||
private lazy val isBinaryMode = children.nonEmpty && children.forall(_.dataType == BinaryType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all inputs are binary, concat also outputs binary.
Is this true in Hive and others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will check some patterns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pg
and hive
have the same;
postgres=# create table t1(a bytea, b bytea, c varchar, d varchar);
postgres=# create view v1 as select a || b || c || d from t1;
postgres=# \d v1
View "public.view41_1"
Column | Type | Modifiers
----------+------+-----------
?column? | text |
hive> create table t1(a binary, b binary, c text, d test);
hive> create view v1 as select a || b || c || d from t1;
hive> describe v1;
_c0 string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for confirming it! Below is the behavior of DB2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, thanks for the info!
I checked the db2 behaviour and I found db2 seems to have a bit different casting rule.
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0000736.html?view=kc
IIUC, in db2, the type of concat(binary, string) is binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also checked mysql: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_concat
recap:
hive, postgresql: concat(binary, string) => string
mysql, db2: conat(binary, string) => binary
Test build #84950 has finished for PR 19977 at commit
|
Test build #84951 has finished for PR 19977 at commit
|
Test build #84953 has finished for PR 19977 at commit
|
Test build #84963 has finished for PR 19977 at commit
|
Test build #84987 has finished for PR 19977 at commit
|
retest this please |
Test build #84996 has finished for PR 19977 at commit
|
retest this please |
Test build #85000 has finished for PR 19977 at commit
|
oh... |
retest this please |
Test build #85005 has finished for PR 19977 at commit
|
retest this please |
Test build #85031 has finished for PR 19977 at commit
|
@@ -1035,6 +1035,12 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val ConcatBinaryModeEnabled = buildConf("spark.sql.expression.concat.binaryMode.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> spark.sql.typeCoercion.concatBinaryAsString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update after reviews finished
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe CONCAT_BINARY_AS_STRING_ENABLED
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> spark.sql.function.concatBinaryAsString
@maropu No need to re-trigger it. The failure is not caused by this PR. |
Will review it tomorrow. Thanks! |
I found different behaviours in a string functions
|
You mean answers of mysql is unexpected? I think it's common these dbs get different behaviors, while Spark mainly follows Hive. |
Test build #85451 has finished for PR 19977 at commit
|
Test build #85454 has finished for PR 19977 at commit
|
@@ -24,3 +24,17 @@ select left("abcd", 2), left("abcd", 5), left("abcd", '2'), left("abcd", null); | |||
select left(null, -2), left("abcd", -2), left("abcd", 0), left("abcd", 'a'); | |||
select right("abcd", 2), right("abcd", 5), right("abcd", '2'), right("abcd", null); | |||
select right(null, -2), right("abcd", -2), right("abcd", 0), right("abcd", 'a'); | |||
|
|||
-- turn on concatBinaryAsString | |||
set spark.sql.function.concatBinaryAsString=false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turn on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most of other dbms-like systems concat binary inputs as binary, IMO turning off by default is okay to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant you said turn on in the comment (L28).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh....
LGTM |
Test build #85473 has finished for PR 19977 at commit
|
Try
|
ah, ok. good catch. I'll fix soon. |
Test build #85495 has finished for PR 19977 at commit
|
@@ -653,7 +660,11 @@ object CombineConcats extends Rule[LogicalPlan] { | |||
} | |||
|
|||
def apply(plan: LogicalPlan): LogicalPlan = plan.transformExpressionsDown { | |||
case concat: Concat if concat.children.exists(_.isInstanceOf[Concat]) => | |||
case concat: Concat if concat.children.exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a dedicated helper function for the if condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Test build #85511 has finished for PR 19977 at commit
|
retest this please |
Test build #85535 has finished for PR 19977 at commit
|
LGTM Thanks! Merged to master. |
thanks, I'll fix |
…tDataTypes ## What changes were proposed in this pull request? This pr is a follow-up to fix a bug left in #19977. ## How was this patch tested? Added tests in `StringExpressionsSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20149 from maropu/SPARK-22771-FOLLOWUP. (cherry picked from commit 6f68316) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…tDataTypes ## What changes were proposed in this pull request? This pr is a follow-up to fix a bug left in #19977. ## How was this patch tested? Added tests in `StringExpressionsSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20149 from maropu/SPARK-22771-FOLLOWUP.
## What changes were proposed in this pull request? This pr modified `elt` to output binary for binary inputs. `elt` in the current master always output data as a string. But, in some databases (e.g., MySQL), if all inputs are binary, `elt` also outputs binary (Also, this might be a small surprise). This pr is related to #19977. ## How was this patch tested? Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20135 from maropu/SPARK-22937. (cherry picked from commit e8af7e8) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request? This pr modified `elt` to output binary for binary inputs. `elt` in the current master always output data as a string. But, in some databases (e.g., MySQL), if all inputs are binary, `elt` also outputs binary (Also, this might be a small surprise). This pr is related to #19977. ## How was this patch tested? Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20135 from maropu/SPARK-22937.
What changes were proposed in this pull request?
This pr modified
concat
to concat binary inputs into a single binary output.concat
in the current master always output data as a string. But, in some databases (e.g., PostgreSQL), if all inputs are binary,concat
also outputs binary.How was this patch tested?
Added tests in
SQLQueryTestSuite
andTypeCoercionSuite
.