-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change mapping of SQL VARCHAR
from Utf8
to Utf8View
#15096
Comments
Please add comments if you find other needed items / issues |
To begin this project so that we can implement it incrementally, I suggest we create a new config option like |
Thank you @alamb , this is a great suggestion! And we finally can make it default to true when we finish all tasks! |
I will try to create more sub-tasks related to this effort! |
I also testing the tcph when it use the utf8view default, here is the result: ./benchmarks/bench.sh compare main issue_14909
Comparing main and issue_14909
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ issue_14909 ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1 │ 72.78ms │ 70.32ms │ no change │
│ QQuery 2 │ 27.64ms │ 26.30ms │ no change │
│ QQuery 3 │ 37.75ms │ 38.22ms │ no change │
│ QQuery 4 │ 27.97ms │ 28.27ms │ no change │
│ QQuery 5 │ 52.59ms │ 49.36ms │ +1.07x faster │
│ QQuery 6 │ 20.81ms │ 20.66ms │ no change │
│ QQuery 7 │ 70.44ms │ 75.06ms │ 1.07x slower │
│ QQuery 8 │ 48.32ms │ 49.02ms │ no change │
│ QQuery 9 │ 62.60ms │ 63.14ms │ no change │
│ QQuery 10 │ 55.94ms │ 58.75ms │ 1.05x slower │
│ QQuery 11 │ 19.44ms │ 21.21ms │ 1.09x slower │
│ QQuery 12 │ 36.59ms │ 37.42ms │ no change │
│ QQuery 13 │ 34.05ms │ 34.88ms │ no change │
│ QQuery 14 │ 26.50ms │ 26.77ms │ no change │
│ QQuery 15 │ 42.97ms │ 45.06ms │ no change │
│ QQuery 16 │ 19.25ms │ 20.02ms │ no change │
│ QQuery 17 │ 73.64ms │ 68.81ms │ +1.07x faster │
│ QQuery 18 │ 96.62ms │ 95.08ms │ no change │
│ QQuery 19 │ 46.77ms │ 45.75ms │ no change │
│ QQuery 20 │ 45.54ms │ 40.98ms │ +1.11x faster │
│ QQuery 21 │ 95.29ms │ 95.19ms │ no change │
│ QQuery 22 │ 18.34ms │ 17.99ms │ no change │
└──────────────┴─────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main) │ 1031.87ms │
│ Total Time (issue_14909) │ 1028.26ms │
│ Average Time (main) │ 46.90ms │
│ Average Time (issue_14909) │ 46.74ms │
│ Queries Faster │ 3 │
│ Queries Slower │ 3 │
│ Queries with No Change │ 16 │
└────────────────────────────┴───────────┘ |
Create the ticket for avro:
|
New sub_task:
PR: #15152 |
New sub_task:
|
New sub_task:
Submitted a PR: |
New sub_task:
|
Yes, 100% |
Submitted the PR for review: |
New sub_task:
|
Updated: Most of the tasks are resolved, i am trying to do more performance investigation and testing if we default change to Utf8View for all varchar. |
Also updated the latest clickbench for the current main compare the default mapping varchar to utf8view: Small improvement, i think becasue it's parquet format, mostly we already load it as the Utf8View for benchmark: Using --profile release-nonlto result: ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ tmp ┃ tmp ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.36ms │ 0.32ms │ +1.13x faster │
│ QQuery 1 │ 27.55ms │ 28.53ms │ no change │
│ QQuery 2 │ 54.73ms │ 58.73ms │ 1.07x slower │
│ QQuery 3 │ 49.67ms │ 51.85ms │ no change │
│ QQuery 4 │ 321.47ms │ 333.88ms │ no change │
│ QQuery 5 │ 374.77ms │ 370.26ms │ no change │
│ QQuery 6 │ 26.06ms │ 27.14ms │ no change │
│ QQuery 7 │ 29.89ms │ 28.19ms │ +1.06x faster │
│ QQuery 8 │ 397.90ms │ 372.24ms │ +1.07x faster │
│ QQuery 9 │ 583.40ms │ 599.98ms │ no change │
│ QQuery 10 │ 148.50ms │ 147.43ms │ no change │
│ QQuery 11 │ 163.20ms │ 165.56ms │ no change │
│ QQuery 12 │ 399.31ms │ 407.08ms │ no change │
│ QQuery 13 │ 568.61ms │ 576.26ms │ no change │
│ QQuery 14 │ 389.05ms │ 374.46ms │ no change │
│ QQuery 15 │ 375.66ms │ 370.85ms │ no change │
│ QQuery 16 │ 720.94ms │ 719.03ms │ no change │
│ QQuery 17 │ 662.21ms │ 638.33ms │ no change │
│ QQuery 18 │ 1694.34ms │ 1507.92ms │ +1.12x faster │
│ QQuery 19 │ 41.26ms │ 42.08ms │ no change │
│ QQuery 20 │ 619.60ms │ 549.74ms │ +1.13x faster │
│ QQuery 21 │ 779.77ms │ 691.91ms │ +1.13x faster │
│ QQuery 22 │ 1411.33ms │ 1375.59ms │ no change │
│ QQuery 23 │ 3891.02ms │ 3946.51ms │ no change │
│ QQuery 24 │ 252.52ms │ 247.12ms │ no change │
│ QQuery 25 │ 252.81ms │ 248.90ms │ no change │
│ QQuery 26 │ 264.57ms │ 276.89ms │ no change │
│ QQuery 27 │ 842.86ms │ 854.72ms │ no change │
│ QQuery 28 │ 6461.67ms │ 6410.47ms │ no change │
│ QQuery 29 │ 379.88ms │ 359.71ms │ +1.06x faster │
│ QQuery 30 │ 352.77ms │ 332.15ms │ +1.06x faster │
│ QQuery 31 │ 366.75ms │ 371.79ms │ no change │
│ QQuery 32 │ 1273.77ms │ 1427.04ms │ 1.12x slower │
│ QQuery 33 │ 1601.21ms │ 1599.55ms │ no change │
│ QQuery 34 │ 1605.00ms │ 1701.18ms │ 1.06x slower │
│ QQuery 35 │ 532.88ms │ 576.30ms │ 1.08x slower │
│ QQuery 36 │ 109.92ms │ 115.27ms │ no change │
│ QQuery 37 │ 57.43ms │ 57.22ms │ no change │
│ QQuery 38 │ 78.78ms │ 80.17ms │ no change │
│ QQuery 39 │ 196.90ms │ 197.13ms │ no change │
│ QQuery 40 │ 26.52ms │ 25.77ms │ no change │
│ QQuery 41 │ 25.57ms │ 25.51ms │ no change │
│ QQuery 42 │ 30.03ms │ 28.90ms │ no change │
└──────────────┴───────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (tmp) │ 28442.44ms │
│ Total Time (tmp) │ 28319.66ms │
│ Average Time (tmp) │ 661.45ms │
│ Average Time (tmp) │ 658.60ms │
│ Queries Faster │ 8 │
│ Queries Slower │ 4 │
│ Queries with No Change │ 31 │
└────────────────────────┴────────────┘
```rust
Using run --release result:
```rust
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ default_enable_utf8view ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.34ms │ 0.33ms │ no change │
│ QQuery 1 │ 44.10ms │ 43.49ms │ no change │
│ QQuery 2 │ 77.24ms │ 76.79ms │ no change │
│ QQuery 3 │ 84.97ms │ 77.06ms │ +1.10x faster │
│ QQuery 4 │ 523.60ms │ 554.50ms │ 1.06x slower │
│ QQuery 5 │ 665.56ms │ 661.98ms │ no change │
│ QQuery 6 │ 38.69ms │ 39.05ms │ no change │
│ QQuery 7 │ 47.27ms │ 46.57ms │ no change │
│ QQuery 8 │ 702.86ms │ 682.91ms │ no change │
│ QQuery 9 │ 780.11ms │ 770.69ms │ no change │
│ QQuery 10 │ 194.09ms │ 172.24ms │ +1.13x faster │
│ QQuery 11 │ 199.83ms │ 191.13ms │ no change │
│ QQuery 12 │ 696.13ms │ 688.11ms │ no change │
│ QQuery 13 │ 890.35ms │ 1001.57ms │ 1.12x slower │
│ QQuery 14 │ 732.92ms │ 648.96ms │ +1.13x faster │
│ QQuery 15 │ 689.57ms │ 633.55ms │ +1.09x faster │
│ QQuery 16 │ 1415.16ms │ 1468.50ms │ no change │
│ QQuery 17 │ 1297.66ms │ 1319.06ms │ no change │
│ QQuery 18 │ 3272.62ms │ 2857.06ms │ +1.15x faster │
│ QQuery 19 │ 75.13ms │ 82.66ms │ 1.10x slower │
│ QQuery 20 │ 743.88ms │ 705.83ms │ +1.05x faster │
│ QQuery 21 │ 929.50ms │ 897.88ms │ no change │
│ QQuery 22 │ 2576.76ms │ 2506.14ms │ no change │
│ QQuery 23 │ 4943.09ms │ 4916.55ms │ no change │
│ QQuery 24 │ 392.47ms │ 384.78ms │ no change │
│ QQuery 25 │ 386.58ms │ 388.37ms │ no change │
│ QQuery 26 │ 423.42ms │ 417.96ms │ no change │
│ QQuery 27 │ 1050.88ms │ 976.19ms │ +1.08x faster │
│ QQuery 28 │ 8269.73ms │ 8791.73ms │ 1.06x slower │
│ QQuery 29 │ 439.96ms │ 442.74ms │ no change │
│ QQuery 30 │ 583.71ms │ 541.02ms │ +1.08x faster │
│ QQuery 31 │ 632.33ms │ 629.25ms │ no change │
│ QQuery 32 │ 2580.37ms │ 2523.11ms │ no change │
│ QQuery 33 │ 2810.58ms │ 2848.06ms │ no change │
│ QQuery 34 │ 3075.43ms │ 3108.88ms │ no change │
│ QQuery 35 │ 856.76ms │ 891.39ms │ no change │
│ QQuery 36 │ 152.80ms │ 150.15ms │ no change │
│ QQuery 37 │ 117.99ms │ 118.82ms │ no change │
│ QQuery 38 │ 110.05ms │ 112.33ms │ no change │
│ QQuery 39 │ 267.64ms │ 279.12ms │ no change │
│ QQuery 40 │ 41.34ms │ 45.52ms │ 1.10x slower │
│ QQuery 41 │ 42.04ms │ 41.70ms │ no change │
│ QQuery 42 │ 50.40ms │ 45.41ms │ +1.11x faster │
└──────────────┴────────────────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main) │ 43905.93ms │
│ Total Time (default_enable_utf8view) │ 43779.11ms │
│ Average Time (main) │ 1021.07ms │
│ Average Time (default_enable_utf8view) │ 1018.12ms │
│ Queries Faster │ 9 │
│ Queries Slower │ 5 │
│ Queries with No Change │ 29 │
└────────────────────────────────────────┴────────────┘ |
Yes I would expect no change for the clickbench benchmark as it doesn't use SQL |
Is your feature request related to a problem or challenge?
DataFusion uses Arrow types internally. Thus when planning SQL queries there is a mapping from SQL types to Arrow Types. The current mapping for character types is shown in the docs https://datafusion.apache.org/user-guide/sql/data_types.html#character-types
CHAR
Utf8
VARCHAR
Utf8
TEXT
Utf8
STRING
Utf8
So this means that when you do something like
create table foo(x varchar);
thex
column is Utf8When reading parquet files however, a different type,
Utf8View
is used as it is faster in most cases.This can be seen in this example:
Thus there is a discrepancy when creating external tables with a schema (
VARCHAR
) as that will use Utf8 rather than UTF8ViewI believe this is the root cause of the issue @zhuqi-lucas filed:
schema_force_view_type
configuration not working forCREATE EXTERNAL TABLE
#14909Describe the solution you'd like
I think we should consider changing the default SQL mapping from
VARCHAR
-->Utf8View
Describe alternatives you've considered
There are a few subtasks required before we can merge it:
Utf8View
) #15403Additional context
You can see some of the history related to using string view / Utf8View here:
StringView
in DataFusion #11752The text was updated successfully, but these errors were encountered: