Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two table's comparison in char(20) between binary(20) and binary(20) is different from MySQL #34823

Closed
River2000i opened this issue May 19, 2022 · 9 comments · Fixed by #35053
Assignees
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. severity/major sig/execution SIG execution type/bug The issue is confirmed as a bug.

Comments

@River2000i
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

drop table if exists t1;drop table if exists t2;
create table t1(a char(20));
create table t2(b binary(20), c binary(20));
insert into t1 value('-1');
insert into t2 value(0x2D31, 0x67);
insert into t2 value(0x2D31, 0x73);
select * from t1, t2 where t1.a between t2.b and t2.c;

2. What did you expect to see? (Required)

mysql> select * from t1, t2 where t1.a between t2.b and t2.c;
Empty set (0.01 sec)

3. What did you see instead (Required)

mysql> select * from t1, t2 where t1.a between t2.b and t2.c;
+------+--------------------------------------------+--------------------------------------------+
| a    | b                                          | c                                          |
+------+--------------------------------------------+--------------------------------------------+
| -1   | 0x2D31000000000000000000000000000000000000 | 0x7300000000000000000000000000000000000000 |
| -1   | 0x2D31000000000000000000000000000000000000 | 0x6700000000000000000000000000000000000000 |
+------+--------------------------------------------+--------------------------------------------+
2 rows in set (0.06 sec)

Tips : If it is a single table query, the result is the same as mysql.

drop table if exists t;
create table t(a char(20), b binary(20), c binary(20));
insert into t value('-1', 0x2D31, 0x67);
insert into t value('-1', 0x2D31, 0x73);
select a from t where a between b and c;
mysql> select a from t where a between b and c;
Empty set (0.05 sec)

4. What is your TiDB version? (Required)

Release Version: v6.1.0-nightly
Edition: Community
Git Commit Hash: 828a255
Git Branch: heads/refs/tags/v6.1.0-nightly
UTC Build Time: 2022-05-17 23:02:16
GoVersion: go1.18.2
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

@River2000i River2000i added the type/bug The issue is confirmed as a bug. label May 19, 2022
@River2000i River2000i changed the title Two table's comparison between char(20) in binary(20) and binary(20) is different from MySQL Two table's comparison in char(20) between binary(20) and binary(20) is different from MySQL May 20, 2022
@zanmato1984
Copy link
Contributor

I'm elevating the severity to major as it seems a typical wrong result.

@ti-chi-bot ti-chi-bot added may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.0 may-affects-6.1 labels May 23, 2022
@guo-shaoge
Copy link
Collaborator

guo-shaoge commented May 23, 2022

We got bug in type deducing for between expr.

As MySQL's behavior, we should not add cast(test.t1.a, binary(20)) when comparing t1.a with t2.b. Non-block issue.

MySQL [test]> explain select * from t1, t2 where t1.a >= t2.b;
+------------------------------+---------+-----------+---------------+-----------------------------------------------------------+
| id                           | estRows | task      | access object | operator info                                             |
+------------------------------+---------+-----------+---------------+-----------------------------------------------------------+
| HashJoin_9                   | 2.00    | root      |               | CARTESIAN inner join, other cond:ge(test.t1.a, test.t2.b) |
| ├─TableReader_12(Build)      | 1.00    | root      |               | data:Selection_11                                         |
| │ └─Selection_11             | 1.00    | cop[tikv] |               | not(isnull(test.t1.a))                                    |
| │   └─TableFullScan_10       | 1.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                            |
| └─TableReader_15(Probe)      | 2.00    | root      |               | data:Selection_14                                         |
|   └─Selection_14             | 2.00    | cop[tikv] |               | not(isnull(test.t2.b))                                    |
|     └─TableFullScan_13       | 2.00    | cop[tikv] | table:t2      | keep order:false, stats:pseudo                            |
+------------------------------+---------+-----------+---------------+-----------------------------------------------------------+
7 rows in set (0.000 sec)

MySQL [test]> explain select * from t1, t2 where t1.a between t2.b and t2.c;
+-----------------------------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------+
| id                          | estRows | task      | access object | operator info                                                                                                           |
+-----------------------------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------+
| HashJoin_9                  | 2.00    | root      |               | CARTESIAN inner join, other cond:ge(cast(test.t1.a, binary(20)), test.t2.b), le(cast(test.t1.a, binary(20)), test.t2.c) |
| ├─TableReader_11(Build)     | 1.00    | root      |               | data:TableFullScan_10                                                                                                   |
| │ └─TableFullScan_10        | 1.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                                                                          |
| └─TableReader_13(Probe)     | 2.00    | root      |               | data:TableFullScan_12                                                                                                   |
|   └─TableFullScan_12        | 2.00    | cop[tikv] | table:t2      | keep order:false, stats:pseudo                                                                                          |
+-----------------------------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------+
5 rows in set (0.000 sec)

@Willendless
Copy link
Contributor

Willendless commented May 23, 2022

For the single table query above, even though the output is desirable, according to the execution plan

+---------------------------+---------+-----------+---------------+------------------------------------------------------------------------------------+
| id                        | estRows | task      | access object | operator info                                                                      |
+---------------------------+---------+-----------+---------------+------------------------------------------------------------------------------------+
| Projection_4              | 2.40    | root      |               | test.t.a                                                                           |
| └─TableReader_7           | 2.40    | root      |               | data:Selection_6                                                                   |
|   └─Selection_6           | 2.40    | cop[tikv] |               | ge(cast(test.t.a, binary(20)), test.t.b), le(cast(test.t.a, binary(20)), test.t.c) |
|     └─TableFullScan_5     | 3.00    | cop[tikv] | table:t       | keep order:false, stats:pseudo                                                     |
+---------------------------+---------+-----------+---------------+------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)

it can be found that tikv coprocessor is the one who actually executes the ge/le operator, while in the origin query it is the tidb's task. By prohibiting the push down behavior using insert into mysql.expr_pushdown_blacklist values('cast', 'tikv',''); and admin reload expr_pushdown_blacklist;, the execution plan turns to be

+---------------------------+---------+-----------+---------------+------------------------------------------------------------------------------------+
| id                        | estRows | task      | access object | operator info                                                                      |
+---------------------------+---------+-----------+---------------+------------------------------------------------------------------------------------+
| Projection_4              | 2.40    | root      |               | test.t.a                                                                           |
| └─Selection_7             | 2.40    | root      |               | ge(cast(test.t.a, binary(20)), test.t.b), le(cast(test.t.a, binary(20)), test.t.c) |
|   └─TableReader_6         | 3.00    | root      |               | data:TableFullScan_5                                                               |
|     └─TableFullScan_5     | 3.00    | cop[tikv] | table:t       | keep order:false, stats:pseudo                                                     |
+---------------------------+---------+-----------+---------------+------------------------------------------------------------------------------------+
4 rows in set, 2 warnings (0.00 sec)

And now the result is undesirable again.

mysql> select a from t where a between b and c;

+------+
| a    |
+------+
| -1   |
| -1   |
+------+
2 rows in set (0.01 sec)

Therefore, operator execution in tikv side and tidb side is inconsistent.

@Willendless
Copy link
Contributor

Willendless commented May 23, 2022

On the other hand, by opening a client using --column-type-info, it can be found that the default collation for char(20) is utf8mb4_bin(tidb) instead of utf8mb4_general_ci(mysql). Now the bug might be proved by the following two queries slightly modified from the above single query.

In mysqldb

mysql> select a from t where (a collate utf8mb4_bin) between b and c;
Field   1:  `a`
Catalog:    `def`
Database:   `test`
Table:      `t`
Org_table:  `t`
Type:       STRING
Collation:  utf8mb4_general_ci (45)
Length:     80
Max_length: 2
Decimals:   0
Flags:      BINARY


+------+
| a    |
+------+
| -1   |
| -1   |
+------+
2 rows in set (0.00 sec)

In tidb (push down forbidden)

mysql> select a from t where (a collate utf8mb4_general_ci) between b and c;
Field   1:  `a`
Catalog:    `def`
Database:   `test`
Table:      `t`
Org_table:  `t`
Type:       STRING
Collation:  utf8mb4_bin (46)
Length:     80
Max_length: 0
Decimals:   0
Flags:


0 rows in set (0.04 sec)

Besides, if user sets the default collation to be utf8mb4_bin. According to charset-collation-coercibility, If one of the sides has a Unicode character set, and another side has a non-Unicode character set, the side with Unicode character set wins, and automatic character set conversion is applied to the non-Unicode side. Then the type inference phase may instead cast binary to char, like select a from t where a between cast(b as char(20)) and cast(c as char(20));.

@zanmato1984
Copy link
Contributor

Thanks @Willendless for the deep investigation. Could you summarize your findings in like 2 or 3 sentences?

@guo-shaoge
Copy link
Collaborator

guo-shaoge commented May 24, 2022

Results are different when cast is disabled to pushdown:

drop table if exists t;
create table t(a char(20), b binary(20), c binary(20));
insert into t value('-1', 0x2D31, 0x67);
insert into t value('-1', 0x2D31, 0x73);
select a from t where a between b and c;


MySQL [test]> select a from t where a between b and c;
Empty set (12 min 27.892 sec)

MySQL [test]> insert into mysql.expr_pushdown_blacklist values('cast', 'tikv','');
Query OK, 1 row affected (0.001 sec)

MySQL [test]> admin reload expr_pushdown_blacklist;
Query OK, 0 rows affected (0.001 sec)

MySQL [test]> select a from t where a between b and c;
+------+
| a    |
+------+
| -1   |
| -1   |
+------+
2 rows in set (0.001 sec)

Why :

  1. We use builtinFunc.collationInfo.collation instead of builtinFunc.tp.collate when pushdown cast expr. Check scalarFuncToPBExpr
  2. Unfortunately, the collate info in builtinFunc.collationInfo.collation is wrong for cast expr. Check deriveCollation

In a word, we deduce wrong collation info for cast expr. I open a new issue for this #34921.

@jebter jebter added the affects-6.1 This bug affects the 6.1.x(LTS) versions. label May 24, 2022
@Willendless
Copy link
Contributor

Willendless commented May 24, 2022

Thanks @Willendless for the deep investigation. Could you summarize your findings in like 2 or 3 sentences?

Sure, basically I have three findings.

  1. First, as @guo-shaoge said, ge(cast(test.t.a, binary(20)), test.t.b), le(cast(test.t.a, binary(20)), test.t.c) operator execution is inconsistent in tidb side and tikv coprocessor side.

  2. Secondly, the default collation for char(20) should be utf8mb4_general_ci instead of current utf8mb4_bin. And the default value for both collation_server/collation_database should be utf8mb4_0900_ai_ci.

// tidb

mysql> SELECT @@collation_server, @@collation_database;
+--------------------+----------------------+
| @@collation_server | @@collation_database |
+--------------------+----------------------+
| utf8mb4_bin        | utf8mb4_bin          |
+--------------------+----------------------+
1 row in set (0.00 sec)

// mysql

mysql> SELECT  @@collation_server, @@collation_database;
+--------------------+----------------------+
| @@collation_server | @@collation_database |
+--------------------+----------------------+
| utf8mb4_0900_ai_ci | utf8mb4_0900_ai_ci   |
+--------------------+----------------------+
1 row in set (0.01 sec)
  1. Thirdly, I doubt the implicit cast(test.t.a, binary(20)) is wrong, since charset-collation-coercibility stated that we should cast the the non-Unicode charset to Unicode charset instead of the other way around.

@zanmato1984
Copy link
Contributor

I have some concerns on your bullet 2. I think the default collation is a configurable item in tidb server. So would you investigate more on how to config the sever's default collation and do your experiments again?

@Willendless
Copy link
Contributor

Willendless commented May 26, 2022

summaries

  • In mysql, while executing query select a from t where a between b and c;, it will implicitly do the following type conversion select a from t where cast(a as binary) between b and c; instead of cast(a as binary(20)) as what tidb did.
  • In mysql, while executing query select a from t where a between b and c;, it will not do implicit type conversion since it is not necessary. See String::needs_conversion() in mysql-server for details.
  • The reason why select a from t where (a collate utf8mb4_bin) between b and c; and select a from t where (a collate utf8mb4_general_ci) between b and c; will have different output from the origin query in mysql is that collate clause will force mysql to convert binary string b and c to char string, namely implicitly does cast(b as char) or cast(b as char(20)).

experiments

I basically did the following experiments in mysql to prove my above arguments. And I use concat() to check the implicit type conversion behavior of mysql.

For point 1,

mysql> select concat(a, b) from t;

+------------------------------------------------+
| concat(a, b)                                   |
+------------------------------------------------+
| 0x2D312D31000000000000000000000000000000000000 |
| 0x2D312D31000000000000000000000000000000000000 |
+------------------------------------------------+
2 rows in set (0.01 sec)

mysql> select concat(cast(a as binary(20)), b) from t;

+------------------------------------------------------------------------------------+
| concat(cast(a as binary(20)), b)                                                   |
+------------------------------------------------------------------------------------+
| 0x2D310000000000000000000000000000000000002D31000000000000000000000000000000000000 |
| 0x2D310000000000000000000000000000000000002D31000000000000000000000000000000000000 |
+------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

mysql> select concat(cast(a as binary), b) from t;

+------------------------------------------------------------+
| concat(cast(a as binary), b)                               |
+------------------------------------------------------------+
| 0x2D312D31000000000000000000000000000000000000             |
| 0x2D312D31000000000000000000000000000000000000             |
+------------------------------------------------------------+
2 rows in set (0.00 sec)

mysql> select * from t where a = b;

0 rows in set (0.00 sec)

mysql> select * from t where cast(a as binary(20)) = b;

+------+--------------------------------------------+--------------------------------------------+
| a    | b                                          | c                                          |
+------+--------------------------------------------+--------------------------------------------+
| -1   | 0x2D31000000000000000000000000000000000000 | 0x6700000000000000000000000000000000000000 |
| -1   | 0x2D31000000000000000000000000000000000000 | 0x7300000000000000000000000000000000000000 |
+------+--------------------------------------------+--------------------------------------------+
2 rows in set (0.00 sec)

mysql> select * from t where cast(a as binary) = b;

0 rows in set (0.00 sec)

Note that I use '=' operator in the last three queries. I will explain this in the final section.

For point 2,

mysql> select concat((a collate utf8mb4_bin), b), length(concat((a collate utf8mb4_bin), b)) from t;

+------------------------------------+--------------------------------------------+
| concat((a collate utf8mb4_bin), b) | length(concat((a collate utf8mb4_bin), b)) |
+------------------------------------+--------------------------------------------+
| -1-1                               |                                         22 |
| -1-1                               |                                         22 |
+------------------------------------+--------------------------------------------+
2 rows in set (0.00 sec)

mysql> select concat((a collate utf8mb4_general_ci), b), length(concat((a collate utf8mb4_general_ci), b)) from t;

+-------------------------------------------+---------------------------------------------------+
| concat((a collate utf8mb4_general_ci), b) | length(concat((a collate utf8mb4_general_ci), b)) |
+-------------------------------------------+---------------------------------------------------+
| -1-1                                      |                                                22 |
| -1-1                                      |                                                22 |
+-------------------------------------------+---------------------------------------------------+
2 rows in set (0.00 sec)

mysql> select concat(a, cast(b as char)), length(concat(a, cast(b as char))) from t;

+----------------------------+------------------------------------+
| concat(a, cast(b as char)) | length(concat(a, cast(b as char))) |
+----------------------------+------------------------------------+
| -1-1                       |                                 22 |
| -1-1                       |                                 22 |
+----------------------------+------------------------------------+
2 rows in set (0.01 sec)

mysql> select concat(a, cast(b as char(20))), length(concat(a, cast(b as char(20)))) from t;

+--------------------------------+----------------------------------------+
| concat(a, cast(b as char(20))) | length(concat(a, cast(b as char(20)))) |
+--------------------------------+----------------------------------------+
| -1-1                           |                                     22 |
| -1-1                           |                                     22 |
+--------------------------------+----------------------------------------+
2 rows in set (0.00 sec)

mysql> select * from t where (a collate utf8mb4_bin)  >= b;

+------+--------------------------------------------+--------------------------------------------+
| a    | b                                          | c                                          |
+------+--------------------------------------------+--------------------------------------------+
| -1   | 0x2D31000000000000000000000000000000000000 | 0x6700000000000000000000000000000000000000 |
| -1   | 0x2D31000000000000000000000000000000000000 | 0x7300000000000000000000000000000000000000 |
+------+--------------------------------------------+--------------------------------------------+
2 rows in set (0.00 sec)

mysql> select * from t where a  >= cast(b as char);

+------+--------------------------------------------+--------------------------------------------+
| a    | b                                          | c                                          |
+------+--------------------------------------------+--------------------------------------------+
| -1   | 0x2D31000000000000000000000000000000000000 | 0x6700000000000000000000000000000000000000 |
| -1   | 0x2D31000000000000000000000000000000000000 | 0x7300000000000000000000000000000000000000 |
+------+--------------------------------------------+--------------------------------------------+
2 rows in set (0.01 sec)

mysql> select * from t where a  >= cast(b as char(20));

+------+--------------------------------------------+--------------------------------------------+
| a    | b                                          | c                                          |
+------+--------------------------------------------+--------------------------------------------+
| -1   | 0x2D31000000000000000000000000000000000000 | 0x6700000000000000000000000000000000000000 |
| -1   | 0x2D31000000000000000000000000000000000000 | 0x7300000000000000000000000000000000000000 |
+------+--------------------------------------------+--------------------------------------------+
2 rows in set (0.00 sec)

explanation

Several questions can be answered based on my above theory

  1. Why using cast(a as binary(20)) will lead to different output from the origin query?

Because by casting string in column a to binary(20), "0x2D31" will be extracted and then padded with "0" until its length is equal to 20, which will be exactly the same as the value in column b. And this is why I use = in experiments of point 1.

  1. Why using a collate utf8mb4_bin will lead to different output from cast(a as binary(20))?

According to mysql documentation, 10.8.4 Collation Coercibility in Expressions, by using collate clause, the coercibility of column a becomes 0, namely automatic character set conversion will be applied to column b before doing comparison between column b and a. Therefore, binary string in b will be converted to character string first, which means that each of its padding 0 will be converted to \0(ASCII NUL character). This can be further proved by

mysql> select * from t where concat(substring(a, 1, 2), '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0') = cast(b as char(20));

+------+--------------------------------------------+--------------------------------------------+
| a    | b                                          | c                                          |
+------+--------------------------------------------+--------------------------------------------+
| -1   | 0x2D31000000000000000000000000000000000000 | 0x6700000000000000000000000000000000000000 |
| -1   | 0x2D31000000000000000000000000000000000000 | 0x7300000000000000000000000000000000000000 |
+------+--------------------------------------------+--------------------------------------------+
2 rows in set (0.01 sec)

@ChenPeng2013 ChenPeng2013 added affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 labels Jun 10, 2022
@ti-chi-bot ti-chi-bot removed may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.0 labels Jun 10, 2022
@wjhuang2016 wjhuang2016 removed may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. labels Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. severity/major sig/execution SIG execution type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants