forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-47693][SQL] Add optimization for lowercase comparison of UTF8S…
…tring used in UTF8_BINARY_LCASE collation ### What changes were proposed in this pull request? Current collation [benchmarks](https://github.com/apache/spark/blob/e9f204ae93061a862e4da52c128eaf3512a66c7b/sql/core/benchmarks/CollationBenchmark-results.txt) indicate that `UTF8_BINARY_LCASE` collation comparisons are order of magnitude slower (~7-10x) than plain binary comparisons. Improve the performance by optimizing lowercase comparison function for `UTF8String` instances instead of performing full lowercase conversion before binary comparison. Optimization is based on similar method used in `toLowerCase` where we check character by character if conversion is valid under ASCII and fallback to slow comparison of native strings. In latter case, we only take into consideration suffixes that are left to compare. Benchmarks from `CollationBenchmark` ran locally show substantial performance increase: ``` [info] collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] -------------------------------------------------------------------------------------------------------------------------- [info] UTF8_BINARY_LCASE 7199 7209 14 0.0 71988.8 1.0X [info] UNICODE 3925 3929 5 0.0 39250.4 1.8X [info] UTF8_BINARY 3935 3950 21 0.0 39351.2 1.8X [info] UNICODE_CI 45248 51404 8706 0.0 452484.7 0.2X ``` ### Why are the changes needed? To improve performance of comparisons of strings under UTF8_BINARY_LCASE collation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests to `UTF8StringSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45816 from nikolamand-db/SPARK-47693. Authored-by: Nikola Mandic <nikola.mandic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Loading branch information
1 parent
d817c9a
commit 627f608
Showing
8 changed files
with
223 additions
and
96 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,27 +1,27 @@ | ||
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure | ||
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
-------------------------------------------------------------------------------------------------------------------------- | ||
UTF8_BINARY_LCASE 29904 29937 47 0.0 299036.1 1.0X | ||
UNICODE 3886 3893 10 0.0 38863.0 7.7X | ||
UTF8_BINARY 3945 3945 0 0.0 39449.6 7.6X | ||
UNICODE_CI 45321 45330 12 0.0 453210.3 0.7X | ||
UTF8_BINARY_LCASE 6910 6912 3 0.0 69099.7 1.0X | ||
UNICODE 4367 4368 1 0.0 43669.6 1.6X | ||
UTF8_BINARY 4361 4364 4 0.0 43606.5 1.6X | ||
UNICODE_CI 46480 46526 66 0.0 464795.7 0.1X | ||
|
||
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure | ||
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
--------------------------------------------------------------------------------------------------------------------------- | ||
UTF8_BINARY_LCASE 29807 29818 17 0.0 298065.0 1.0X | ||
UNICODE 45704 45723 27 0.0 457036.2 0.7X | ||
UTF8_BINARY 6460 6464 7 0.0 64597.9 4.6X | ||
UNICODE_CI 45498 45508 14 0.0 454977.6 0.7X | ||
UTF8_BINARY_LCASE 6522 6526 4 0.0 65223.9 1.0X | ||
UNICODE 45792 45797 7 0.0 457922.3 0.1X | ||
UTF8_BINARY 7092 7112 29 0.0 70921.7 0.9X | ||
UNICODE_CI 47548 47564 22 0.0 475476.7 0.1X | ||
|
||
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure | ||
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
UTF8_BINARY_LCASE 23553 23595 59 0.0 235531.8 1.0X | ||
UNICODE 197303 197309 8 0.0 1973034.1 0.1X | ||
UTF8_BINARY 14389 14391 2 0.0 143891.2 1.6X | ||
UNICODE_CI 166880 166885 7 0.0 1668799.5 0.1X | ||
UTF8_BINARY_LCASE 11716 11716 1 0.0 117157.9 1.0X | ||
UNICODE 180133 180137 5 0.0 1801332.1 0.1X | ||
UTF8_BINARY 10476 10477 1 0.0 104757.4 1.1X | ||
UNICODE_CI 148171 148190 28 0.0 1481705.6 0.1X | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,27 +1,27 @@ | ||
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure | ||
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
-------------------------------------------------------------------------------------------------------------------------- | ||
UTF8_BINARY_LCASE 34122 34152 42 0.0 341224.2 1.0X | ||
UNICODE 4520 4522 2 0.0 45201.8 7.5X | ||
UTF8_BINARY 4524 4526 2 0.0 45243.0 7.5X | ||
UNICODE_CI 52706 52711 7 0.0 527056.1 0.6X | ||
UTF8_BINARY_LCASE 7692 7731 55 0.0 76919.2 1.0X | ||
UNICODE 4378 4379 0 0.0 43784.6 1.8X | ||
UTF8_BINARY 4382 4396 19 0.0 43821.6 1.8X | ||
UNICODE_CI 48344 48360 23 0.0 483436.5 0.2X | ||
|
||
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure | ||
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
--------------------------------------------------------------------------------------------------------------------------- | ||
UTF8_BINARY_LCASE 33467 33474 10 0.0 334671.7 1.0X | ||
UNICODE 51168 51168 1 0.0 511677.4 0.7X | ||
UTF8_BINARY 5561 5593 45 0.0 55610.9 6.0X | ||
UNICODE_CI 51929 51955 36 0.0 519291.8 0.6X | ||
UTF8_BINARY_LCASE 9819 9820 0 0.0 98194.9 1.0X | ||
UNICODE 49507 49518 17 0.0 495066.2 0.2X | ||
UTF8_BINARY 7354 7365 17 0.0 73536.3 1.3X | ||
UNICODE_CI 52149 52163 20 0.0 521489.4 0.2X | ||
|
||
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure | ||
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
UTF8_BINARY_LCASE 22079 22083 5 0.0 220786.7 1.0X | ||
UNICODE 177636 177709 103 0.0 1776363.9 0.1X | ||
UTF8_BINARY 11954 11956 3 0.0 119536.7 1.8X | ||
UNICODE_CI 158014 158038 35 0.0 1580135.7 0.1X | ||
UTF8_BINARY_LCASE 18110 18127 24 0.0 181103.9 1.0X | ||
UNICODE 171375 171435 85 0.0 1713752.3 0.1X | ||
UTF8_BINARY 14012 14030 26 0.0 140116.7 1.3X | ||
UNICODE_CI 153847 153901 76 0.0 1538471.1 0.1X | ||
|
27 changes: 27 additions & 0 deletions
27
sql/core/benchmarks/CollationNonASCIIBenchmark-jdk21-results.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
-------------------------------------------------------------------------------------------------------------------------- | ||
UTF8_BINARY_LCASE 18244 18258 20 0.0 456096.4 1.0X | ||
UNICODE 498 498 0 0.1 12440.3 36.7X | ||
UTF8_BINARY 499 500 1 0.1 12467.7 36.6X | ||
UNICODE_CI 13429 13443 19 0.0 335725.4 1.4X | ||
|
||
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
--------------------------------------------------------------------------------------------------------------------------- | ||
UTF8_BINARY_LCASE 18377 18399 31 0.0 459430.5 1.0X | ||
UNICODE 14238 14240 3 0.0 355957.4 1.3X | ||
UTF8_BINARY 975 976 1 0.0 24371.3 18.9X | ||
UNICODE_CI 13819 13826 10 0.0 345482.6 1.3X | ||
|
||
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
UTF8_BINARY_LCASE 9183 9230 67 0.0 229564.0 1.0X | ||
UNICODE 38937 38952 22 0.0 973421.3 0.2X | ||
UTF8_BINARY 1376 1376 0 0.0 34397.5 6.7X | ||
UNICODE_CI 32881 32882 1 0.0 822027.4 0.3X | ||
|
27 changes: 27 additions & 0 deletions
27
sql/core/benchmarks/CollationNonASCIIBenchmark-results.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
-------------------------------------------------------------------------------------------------------------------------- | ||
UTF8_BINARY_LCASE 17881 17885 6 0.0 447017.7 1.0X | ||
UNICODE 493 495 2 0.1 12328.9 36.3X | ||
UTF8_BINARY 493 494 1 0.1 12331.4 36.3X | ||
UNICODE_CI 13731 13737 8 0.0 343284.6 1.3X | ||
|
||
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
--------------------------------------------------------------------------------------------------------------------------- | ||
UTF8_BINARY_LCASE 18041 18047 8 0.0 451030.2 1.0X | ||
UNICODE 14023 14047 34 0.0 350573.9 1.3X | ||
UTF8_BINARY 1387 1397 14 0.0 34680.4 13.0X | ||
UNICODE_CI 14232 14242 14 0.0 355808.4 1.3X | ||
|
||
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
UTF8_BINARY_LCASE 10494 10499 6 0.0 262360.0 1.0X | ||
UNICODE 40410 40422 17 0.0 1010261.8 0.3X | ||
UTF8_BINARY 2035 2035 1 0.0 50877.8 5.2X | ||
UNICODE_CI 31470 31493 32 0.0 786752.4 0.3X | ||
|
Oops, something went wrong.