Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expression: avoid padding 0 when implicitly cast to binary #35053

Merged
merged 23 commits into from
Jun 21, 2022
Merged

expression: avoid padding 0 when implicitly cast to binary #35053

merged 23 commits into from
Jun 21, 2022

Conversation

Willendless
Copy link
Contributor

@Willendless Willendless commented May 30, 2022

What problem does this PR solve?

Issue Number: close #34823

Problem Summary:

For between and expression, implicitly casting string charset to binary string will pad 0 to the origin string, which causes different string comparison results from mysql. We can instead use mysql.TypeVarStringto avoid this problem. In order to match the behavior of mysql, we can instead avoid doing implicit casting here. See String::needs_conversion() in mysql-server for details.

Modification rationale:

For one thing, we still need to build cast function to retain binary collation information till execution phase. For another, we can not keep using fixed length representation otherwise 0 padding will affect later string comparison result. Therefore, we need to use VarString type of cast here.

drop table if exists t1;drop table if exists t2;
create table t1(a char(20));
create table t2(b binary(20), c binary(20));
insert into t1 value('-1');
insert into t2 value(0x2D31, 0x67);
insert into t2 value(0x2D31, 0x73);
select * from t1, t2 where t1.a between t2.b and t2.c;

// previously

mysql> select * from t1, t2 where t1.a between t2.b and t2.c;
+------+--------------------------------------------+--------------------------------------------+
| a    | b                                          | c                                          |
+------+--------------------------------------------+--------------------------------------------+
| -1   | 0x2D31000000000000000000000000000000000000 | 0x7300000000000000000000000000000000000000 |
| -1   | 0x2D31000000000000000000000000000000000000 | 0x6700000000000000000000000000000000000000 |
+------+--------------------------------------------+--------------------------------------------+
2 rows in set (0.06 sec)

// now

mysql> select * from t1, t2 where t1.a between t2.b and t2.c;
Empty set (0.01 sec)

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented May 30, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • wjhuang2016
  • zanmato1984

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue labels May 30, 2022
@sre-bot
Copy link
Contributor

sre-bot commented May 31, 2022

@Willendless Willendless changed the title fix(expression): avoid padding 0 when implicitly cast to binary [WIP] fix(expression): avoid padding 0 when implicitly cast to binary May 31, 2022
@ti-chi-bot ti-chi-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 31, 2022
@Willendless Willendless changed the title [WIP] fix(expression): avoid padding 0 when implicitly cast to binary fix(expression): avoid padding 0 when implicitly cast to binary Jun 8, 2022
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2022
@Willendless Willendless changed the title fix(expression): avoid padding 0 when implicitly cast to binary expression: avoid padding 0 when implicitly cast to binary Jun 8, 2022
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 9, 2022
@wjhuang2016
Copy link
Member

/run-mysql-test

// Avoid padding 0 when cast character string to binary,
// which might affect string comparision result.
// See https://github.com/pingcap/tidb/issues/34823 for details.
if expr.GetType().GetCollate() == ec.Collation || ec.Charset == charset.CharsetBin {
Copy link
Contributor

@xiongjiwei xiongjiwei Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can not just return if the target charset is binary.

because we had support charset gbk, if charset of t1.a is gbk it will have a to_binary function, after we return directly, it will be missed I think.

mysql> desc select * from t1, t2 where t1.a between t2.b and t2.c;
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| id                          | estRows | task      | access object | operator info                                                                                                                                 |
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| HashJoin_8                  | 4.00    | root      |               | CARTESIAN inner join, other cond:ge(cast(to_binary(test.t1.a), binary(20)), test.t2.b), le(cast(to_binary(test.t1.a), binary(20)), test.t2.c) |
| ├─TableReader_13(Build)     | 2.00    | root      |               | data:TableFullScan_12                                                                                                                         |
| │ └─TableFullScan_12        | 2.00    | cop[tikv] | table:t2      | keep order:false, stats:pseudo                                                                                                                |
| └─TableReader_11(Probe)     | 2.00    | root      |               | data:TableFullScan_10                                                                                                                         |
|   └─TableFullScan_10        | 2.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                                                                                                |
+-----------------------------+---------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+

see why we have to_binary #29736

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;

The result is wrong because t3.a doesn't convert to binary collation.

Copy link
Contributor Author

@Willendless Willendless Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For gbk, tidb will build a to_binary is because gbk doesn't belong to legacy charset and when argument is legacy charset (for example the default charset utf8mb4), there is no need to build to_binary function according to the code HandleBinaryLiteral.

My change actually will still add the to_binary function when the code build function for le/ge.

mysql> drop table t1;
Query OK, 0 rows affected (0.01 sec)

mysql> create table t1(a char(20)) default charset gbk;
Query OK, 0 rows affected (0.00 sec)

mysql> desc select * from t1, t2 where t1.a between t2.b and t2.c;
+-------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------------------------+
| id                            | estRows  | task      | access object | operator info                                                                                             |
+-------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------------------------+
| Projection_9                  | 20000.00 | root      |               | test.t1.a, test.t2.b, test.t2.c                                                                           |
| └─HashJoin_11                 | 20000.00 | root      |               | CARTESIAN inner join, other cond:ge(to_binary(test.t1.a), test.t2.b), le(to_binary(test.t1.a), test.t2.c) |
|   ├─TableReader_13(Build)     | 2.00     | root      |               | data:TableFullScan_12                                                                                     |
|   │ └─TableFullScan_12        | 2.00     | cop[tikv] | table:t2      | keep order:false, stats:pseudo                                                                            |
|   └─TableReader_15(Probe)     | 10000.00 | root      |               | data:TableFullScan_14                                                                                     |
|     └─TableFullScan_14        | 10000.00 | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                                                            |
+-------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for

create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;

It is the expected behavior that t3.a doesn't convert to binary collation since its charset (utf8mb4) is legacy...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for

create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;

It is the expected behavior that t3.a doesn't convert to binary collation since its charset (utf8mb4) is legacy...

The point is that we need to make t3.a <= t4.c use the binary collation.

Copy link
Contributor Author

@Willendless Willendless Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the behavior of to_binary. If it will pad 0 for other charsets then forcing t3.a <= t4.c to use the binary collation might still cause mismatch between tidb and mysql. And mysql actually doesn't do conversion when target charset is binary.

Issue #34823 shows that the return result of mysql is not the same as using cast(test.t1.a, binary(20)) since it will pad 0 at the end of a and affect the comparison result.

create table t4(b binary(20), c char(20)) collate utf8mb4_general_ci;
create table t3(a char(20) ) collate utf8mb4_general_ci;
insert into t3 values ("a");
insert into t4 values (0x0, "A");
select * from t3, t4 where t3.a between t4.b and t4.c;

This one is indeed a problem..

There are only following cases:

a b c result
binary any any cast in BuildCastCollationFunction is not necessary
not binary binary not binary should use binary collation but should not padding 0
not binary not binary binary same as above
not binary binary binary not necessary
not binary not binary not binary not involved in this pr

@ti-chi-bot ti-chi-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 11, 2022
@Willendless
Copy link
Contributor Author

Willendless commented Jun 14, 2022

PTAL @xiongjiwei @wjhuang2016

@zanmato1984
Copy link
Contributor

/run-mysql-test

@zanmato1984
Copy link
Contributor

/run-mysql-test

@zanmato1984
Copy link
Contributor

/run-mysql-test

4 similar comments
@zanmato1984
Copy link
Contributor

/run-mysql-test

@zanmato1984
Copy link
Contributor

/run-mysql-test

@zanmato1984
Copy link
Contributor

/run-mysql-test

@zanmato1984
Copy link
Contributor

/run-mysql-test

@ti-chi-bot ti-chi-bot merged commit 49603af into pingcap:master Jun 21, 2022
ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jun 21, 2022
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-5.4 in PR #35557

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jun 21, 2022
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-6.0 in PR #35558

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jun 21, 2022
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-6.1 in PR #35559

@sre-bot
Copy link
Contributor

sre-bot commented Jun 21, 2022

TiDB MergeCI notify

🔴 Bad News! [2] CI still failing after this pr merged.
These failed integration tests don't seem to be introduced by the current PR.

CI Name Result Duration Compare with Parent commit
idc-jenkins-ci-tidb/integration-common-test 🔴 failed 1, success 10, total 11 23 min Existing failure
idc-jenkins-ci-tidb/common-test 🔴 failed 2, success 10, total 12 13 min Existing failure
idc-jenkins-ci/integration-cdc-test 🟢 all 34 tests passed 24 min Existing passed
idc-jenkins-ci-tidb/integration-ddl-test 🟢 all 6 tests passed 7 min 15 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-2 🟢 all 28 tests passed 6 min 24 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-1 🟢 all 26 tests passed 5 min 54 sec Existing passed
idc-jenkins-ci-tidb/tics-test 🟢 all 1 tests passed 5 min 24 sec Existing passed
idc-jenkins-ci-tidb/mybatis-test 🟢 all 1 tests passed 3 min 14 sec Existing passed
idc-jenkins-ci-tidb/integration-compatibility-test 🟢 all 1 tests passed 3 min 3 sec Existing passed
idc-jenkins-ci-tidb/plugin-test 🟢 build success, plugin test success 4min Existing passed

morgo added a commit to morgo/tidb that referenced this pull request Jun 28, 2022
* upstream/master:
  sessionctx: support encoding and decoding session variables (pingcap#35531)
  planner: add batch_point_get access object (pingcap#35230)
  sessionctx: set skipInit false for TiDBOptProjectionPushDown and TiDBOptAggPushDown (pingcap#35491)
  *: add support for disabling noop variables (pingcap#35496)
  lightning: add store write limiter (pingcap#35193)
  expression: avoid padding 0 when implicitly cast to binary (pingcap#35053)
  types: fix creating partition tables fail in ANSI_QUOTES mode (pingcap#35379)
  executor: add the missed runtime stats when the index merge partial task returns 0 row (pingcap#35297)
  statistics: batch insert topn and bucket when saving table stats (pingcap#35326)
  parser: Add support for INTERVAL expr unit + expr (pingcap#30253) (pingcap#35390)
  config: add missing nodelay example (pingcap#35255)
  *: Introduce `OptimisticTxnContextProvider` for optimistic txn (pingcap#35131)
  statistics: fix panic when using wrong globalStats.Indices key (pingcap#35424)
  *: fix store token is up to the limit in test (pingcap#35374)
  *: enable more flaky and update bazel config (pingcap#35500)
  ddl: expose getPrimaryKey() as a public method of model.TableInfo (pingcap#35512)
  expression, util: add `KeyWithoutTrimRightSpace` for collator (pingcap#35475)
  infoschema: try on each PD member until one succeeds or all fail (pingcap#35285)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Two table's comparison in char(20) between binary(20) and binary(20) is different from MySQL
7 participants