-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expression: avoid padding 0 when implicitly cast to binary #35053
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Code Coverage Details: https://codecov.io/github/pingcap/tidb/commit/08278cd94532850feb188317977d90f7a19f84ee |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/run-mysql-test |
expression/builtin_cast.go
Outdated
// Avoid padding 0 when cast character string to binary, | ||
// which might affect string comparision result. | ||
// See https://github.com/pingcap/tidb/issues/34823 for details. | ||
if expr.GetType().GetCollate() == ec.Collation || ec.Charset == charset.CharsetBin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can not just return if the target charset is binary
.
because we had support charset gbk
, if charset of t1.a
is gbk
it will have a to_binary
function, after we return directly, it will be missed I think.
mysql> desc select * from t1, t2 where t1.a between t2.b and t2.c;
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| HashJoin_8 | 4.00 | root | | CARTESIAN inner join, other cond:ge(cast(to_binary(test.t1.a), binary(20)), test.t2.b), le(cast(to_binary(test.t1.a), binary(20)), test.t2.c) |
| ├─TableReader_13(Build) | 2.00 | root | | data:TableFullScan_12 |
| │ └─TableFullScan_12 | 2.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo |
| └─TableReader_11(Probe) | 2.00 | root | | data:TableFullScan_10 |
| └─TableFullScan_10 | 2.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo |
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
see why we have to_binary
#29736
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;
The result is wrong because t3.a doesn't convert to binary collation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For gbk
, tidb will build a to_binary
is because gbk
doesn't belong to legacy charset and when argument is legacy charset (for example the default charset utf8mb4), there is no need to build to_binary
function according to the code HandleBinaryLiteral.
My change actually will still add the to_binary
function when the code build function for le/ge
.
mysql> drop table t1;
Query OK, 0 rows affected (0.01 sec)
mysql> create table t1(a char(20)) default charset gbk;
Query OK, 0 rows affected (0.00 sec)
mysql> desc select * from t1, t2 where t1.a between t2.b and t2.c;
+-------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+-------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------------------------+
| Projection_9 | 20000.00 | root | | test.t1.a, test.t2.b, test.t2.c |
| └─HashJoin_11 | 20000.00 | root | | CARTESIAN inner join, other cond:ge(to_binary(test.t1.a), test.t2.b), le(to_binary(test.t1.a), test.t2.c) |
| ├─TableReader_13(Build) | 2.00 | root | | data:TableFullScan_12 |
| │ └─TableFullScan_12 | 2.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo |
| └─TableReader_15(Probe) | 10000.00 | root | | data:TableFullScan_14 |
| └─TableFullScan_14 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo |
+-------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for
create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;
It is the expected behavior that t3.a doesn't convert to binary collation since its charset (utf8mb4) is legacy...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for
create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci; create table t3(a char(20) ) collate utf8mb4_general_ci; insert into t3 values ("a"); insert into t4 values (0x0, "A"); select * from t3, t4 where t3.a between t4.b and t4.c;It is the expected behavior that t3.a doesn't convert to binary collation since its charset (utf8mb4) is legacy...
The point is that we need to make t3.a <= t4.c use the binary collation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the behavior of to_binary
. If it will pad 0 for other charsets then forcing t3.a <= t4.c to use the binary collation might still cause mismatch between tidb and mysql. And mysql actually doesn't do conversion when target charset is binary
.
Issue #34823 shows that the return result of mysql is not the same as using cast(test.t1.a, binary(20))
since it will pad 0 at the end of a and affect the comparison result.
create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;
This one is indeed a problem..
There are only following cases:
a | b | c | result |
---|---|---|---|
binary | any | any | cast in BuildCastCollationFunction is not necessary |
not binary | binary | not binary | should use binary collation but should not padding 0 |
not binary | not binary | binary | same as above |
not binary | binary | binary | not necessary |
not binary | not binary | not binary | not involved in this pr |
PTAL @xiongjiwei @wjhuang2016 |
/run-mysql-test |
/run-mysql-test |
/run-mysql-test |
4 similar comments
/run-mysql-test |
/run-mysql-test |
/run-mysql-test |
/run-mysql-test |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
cherry pick to release-5.4 in PR #35557 |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
cherry pick to release-6.0 in PR #35558 |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
cherry pick to release-6.1 in PR #35559 |
TiDB MergeCI notify🔴 Bad News! [2] CI still failing after this pr merged.
|
* upstream/master: sessionctx: support encoding and decoding session variables (pingcap#35531) planner: add batch_point_get access object (pingcap#35230) sessionctx: set skipInit false for TiDBOptProjectionPushDown and TiDBOptAggPushDown (pingcap#35491) *: add support for disabling noop variables (pingcap#35496) lightning: add store write limiter (pingcap#35193) expression: avoid padding 0 when implicitly cast to binary (pingcap#35053) types: fix creating partition tables fail in ANSI_QUOTES mode (pingcap#35379) executor: add the missed runtime stats when the index merge partial task returns 0 row (pingcap#35297) statistics: batch insert topn and bucket when saving table stats (pingcap#35326) parser: Add support for INTERVAL expr unit + expr (pingcap#30253) (pingcap#35390) config: add missing nodelay example (pingcap#35255) *: Introduce `OptimisticTxnContextProvider` for optimistic txn (pingcap#35131) statistics: fix panic when using wrong globalStats.Indices key (pingcap#35424) *: fix store token is up to the limit in test (pingcap#35374) *: enable more flaky and update bazel config (pingcap#35500) ddl: expose getPrimaryKey() as a public method of model.TableInfo (pingcap#35512) expression, util: add `KeyWithoutTrimRightSpace` for collator (pingcap#35475) infoschema: try on each PD member until one succeeds or all fail (pingcap#35285)
What problem does this PR solve?
Issue Number: close #34823
Problem Summary:
For
between and
expression, implicitly casting string charset to binary string will pad 0 to the origin string, which causes different string comparison results from mysql.We can instead usemysql.TypeVarString
to avoid this problem. In order to match the behavior of mysql, we can instead avoid doing implicit casting here. See String::needs_conversion() in mysql-server for details.Modification rationale:
For one thing, we still need to build
cast
function to retain binary collation information till execution phase. For another, we can not keep using fixed length representation otherwise 0 padding will affect later string comparison result. Therefore, we need to use VarString type of cast here.// previously
// now
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.