Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expression: avoid padding 0 when implicitly cast to binary #35053

Merged
merged 23 commits into from
Jun 21, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5baa9b2
fix(expression): avoid padding 0 when implicitly cast to binary
Willendless May 30, 2022
1b91a40
fix: avoid doing cast when agg charset is binary
Willendless Jun 6, 2022
54209f3
chore: remove old workaround
Willendless Jun 6, 2022
8a7b314
Merge branch 'master' into fix-issue-34823
zanmato1984 Jun 9, 2022
9800be0
Merge branch 'pingcap:master' into fix-issue-34823
Willendless Jun 11, 2022
2d2990c
fix: use VarString type of cast
Willendless Jun 11, 2022
11b4d2b
test: add between expr binary collation in single side test
Willendless Jun 11, 2022
56a1190
Merge remote-tracking branch 'origin' into fix-issue-34823
Willendless Jun 15, 2022
cd70155
feat: use VarString type of cast for in expression
Willendless Jun 16, 2022
db1248f
test: add unit test for in expression
Willendless Jun 16, 2022
bdb5de1
chore: migrate tests to explain test
Willendless Jun 17, 2022
1f0da0b
Merge branch 'pingcap:master' into fix-issue-34823
Willendless Jun 17, 2022
8cb262c
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 20, 2022
e010a6b
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 20, 2022
bc17196
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 20, 2022
e5b5134
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 20, 2022
078019a
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 20, 2022
8e668ab
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 20, 2022
427c947
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 20, 2022
d1c04e5
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 20, 2022
10588ae
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 20, 2022
bce6632
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 20, 2022
08278cd
Merge branch 'master' into fix-issue-34823
ti-chi-bot Jun 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion expression/builtin_cast.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ import (

"github.com/pingcap/errors"
"github.com/pingcap/tidb/parser/ast"
"github.com/pingcap/tidb/parser/charset"
"github.com/pingcap/tidb/parser/model"
"github.com/pingcap/tidb/parser/mysql"
"github.com/pingcap/tidb/parser/terror"
Expand Down Expand Up @@ -1830,7 +1831,10 @@ func BuildCastCollationFunction(ctx sessionctx.Context, expr Expression, ec *Exp
if expr.GetType().EvalType() != types.ETString {
return expr
}
if expr.GetType().GetCollate() == ec.Collation {
// Avoid padding 0 when cast character string to binary,
// which might affect string comparision result.
// See https://github.com/pingcap/tidb/issues/34823 for details.
if expr.GetType().GetCollate() == ec.Collation || ec.Charset == charset.CharsetBin {
Copy link
Contributor

@xiongjiwei xiongjiwei Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can not just return if the target charset is binary.

because we had support charset gbk, if charset of t1.a is gbk it will have a to_binary function, after we return directly, it will be missed I think.

mysql> desc select * from t1, t2 where t1.a between t2.b and t2.c;
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| id                          | estRows | task      | access object | operator info                                                                                                                                 |
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| HashJoin_8                  | 4.00    | root      |               | CARTESIAN inner join, other cond:ge(cast(to_binary(test.t1.a), binary(20)), test.t2.b), le(cast(to_binary(test.t1.a), binary(20)), test.t2.c) |
| ├─TableReader_13(Build)     | 2.00    | root      |               | data:TableFullScan_12                                                                                                                         |
| │ └─TableFullScan_12        | 2.00    | cop[tikv] | table:t2      | keep order:false, stats:pseudo                                                                                                                |
| └─TableReader_11(Probe)     | 2.00    | root      |               | data:TableFullScan_10                                                                                                                         |
|   └─TableFullScan_10        | 2.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                                                                                                |
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+

see why we have to_binary #29736

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;

The result is wrong because t3.a doesn't convert to binary collation.

Copy link
Contributor Author

@Willendless Willendless Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For gbk, tidb will build a to_binary is because gbk doesn't belong to legacy charset and when argument is legacy charset (for example the default charset utf8mb4), there is no need to build to_binary function according to the code HandleBinaryLiteral.

My change actually will still add the to_binary function when the code build function for le/ge.

mysql> drop table t1;
Query OK, 0 rows affected (0.01 sec)

mysql> create table t1(a char(20)) default charset gbk;
Query OK, 0 rows affected (0.00 sec)

mysql> desc select * from t1, t2 where t1.a between t2.b and t2.c;
+-------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------------------------+
| id                            | estRows  | task      | access object | operator info                                                                                             |
+-------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------------------------+
| Projection_9                  | 20000.00 | root      |               | test.t1.a, test.t2.b, test.t2.c                                                                           |
| └─HashJoin_11                 | 20000.00 | root      |               | CARTESIAN inner join, other cond:ge(to_binary(test.t1.a), test.t2.b), le(to_binary(test.t1.a), test.t2.c) |
|   ├─TableReader_13(Build)     | 2.00     | root      |               | data:TableFullScan_12                                                                                     |
|   │ └─TableFullScan_12        | 2.00     | cop[tikv] | table:t2      | keep order:false, stats:pseudo                                                                            |
|   └─TableReader_15(Probe)     | 10000.00 | root      |               | data:TableFullScan_14                                                                                     |
|     └─TableFullScan_14        | 10000.00 | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                                                            |
+-------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for

create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;

It is the expected behavior that t3.a doesn't convert to binary collation since its charset (utf8mb4) is legacy...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for

create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;

It is the expected behavior that t3.a doesn't convert to binary collation since its charset (utf8mb4) is legacy...

The point is that we need to make t3.a <= t4.c use the binary collation.

Copy link
Contributor Author

@Willendless Willendless Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the behavior of to_binary. If it will pad 0 for other charsets then forcing t3.a <= t4.c to use the binary collation might still cause mismatch between tidb and mysql. And mysql actually doesn't do conversion when target charset is binary.

Issue #34823 shows that the return result of mysql is not the same as using cast(test.t1.a, binary(20)) since it will pad 0 at the end of a and affect the comparison result.

create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;

This one is indeed a problem..

There are only following cases:

a b c result
binary any any cast in BuildCastCollationFunction is not necessary
not binary binary not binary should use binary collation but should not padding 0
not binary not binary binary same as above
not binary binary binary not necessary
not binary not binary not binary not involved in this pr

return expr
}
tp := expr.GetType().Clone()
Expand Down
16 changes: 16 additions & 0 deletions planner/core/expression_rewriter_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,22 @@ func TestBetweenExprCollation(t *testing.T) {
tk.MustGetErrMsg("select * from t1 where a between 'B' collate utf8mb4_general_ci and c collate utf8mb4_unicode_ci;", "[expression:1270]Illegal mix of collations (latin1_bin,IMPLICIT), (utf8mb4_general_ci,EXPLICIT), (utf8mb4_unicode_ci,EXPLICIT) for operation 'BETWEEN'")
}

func TestIssue34823(t *testing.T) {
store, clean := testkit.CreateMockStore(t)
defer clean()
tk := testkit.NewTestKit(t, store)
tk.MustExec("use test;")
tk.MustExec("drop table if exists t1;")
tk.MustExec("drop table if exists t2;")
tk.MustExec("create table t1(a char(20));")
tk.MustExec("create table t2(b binary(20), c binary(20));")
tk.MustExec("insert into t1 value('-1');")
tk.MustExec("insert into t2 value(0x2D31, 0x67);")
tk.MustExec("insert into t2 value(0x2D31, 0x73);")
tk.MustQuery("select a from t1, t2 where t1.a between t2.b and t2.c;").Check(testkit.Rows())
tk.MustQuery("select a from t1, t2 where cast(t1.a as binary(20)) between t2.b and t2.c;").Check(testkit.Rows("-1", "-1"))
}

func TestInsertOnDuplicateLazyMoreThan1Row(t *testing.T) {
store, clean := testkit.CreateMockStore(t)
defer clean()
Expand Down