Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change mapping of SQL VARCHAR from Utf8 to Utf8View #15096

Open
3 of 6 tasks
Tracked by #15072 ...
alamb opened this issue Mar 8, 2025 · 17 comments
Open
3 of 6 tasks
Tracked by #15072 ...

Change mapping of SQL VARCHAR from Utf8 to Utf8View #15096

alamb opened this issue Mar 8, 2025 · 17 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 8, 2025

Is your feature request related to a problem or challenge?

DataFusion uses Arrow types internally. Thus when planning SQL queries there is a mapping from SQL types to Arrow Types. The current mapping for character types is shown in the docs https://datafusion.apache.org/user-guide/sql/data_types.html#character-types

SQL DataType Arrow DataType
CHAR Utf8
VARCHAR Utf8
TEXT Utf8
STRING Utf8

So this means that when you do something like create table foo(x varchar); the x column is Utf8

DataFusion CLI v46.0.0
> create table foo(x varchar);

0 row(s) fetched.
Elapsed 0.019 seconds.

> describe foo;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| x           | Utf8      | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.008 seconds.

When reading parquet files however, a different type, Utf8View is used as it is faster in most cases.

This can be seen in this example:

DataFusion CLI v46.0.0
> describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID               | Int64     | NO          |
| JavaEnable            | Int16     | NO          |
| Title                 | Utf8View  | NO          |
...
+-----------------------+-----------+-------------+
105 row(s) fetched.
Elapsed 0.032 seconds.

Thus there is a discrepancy when creating external tables with a schema (VARCHAR) as that will use Utf8 rather than UTF8View

I believe this is the root cause of the issue @zhuqi-lucas filed:

Describe the solution you'd like

I think we should consider changing the default SQL mapping from VARCHAR --> Utf8View

Describe alternatives you've considered

There are a few subtasks required before we can merge it:

Additional context

You can see some of the history related to using string view / Utf8View here:

@alamb
Copy link
Contributor Author

alamb commented Mar 8, 2025

Please add comments if you find other needed items / issues

@alamb
Copy link
Contributor Author

alamb commented Mar 8, 2025

To begin this project so that we can implement it incrementally, I suggest we create a new config option like default_sql_to_view_types or something so we can test / work through issues before turning it on by default

@zhuqi-lucas
Copy link
Contributor

To begin this project so that we can implement it incrementally, I suggest we create a new config option like default_sql_to_view_types or something so we can test / work through issues before turning it on by default

Thank you @alamb , this is a great suggestion! And we finally can make it default to true when we finish all tasks!

@zhuqi-lucas
Copy link
Contributor

I will try to create more sub-tasks related to this effort!

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 9, 2025

  • Config: Add support default sql varchar to view types

Submitted a PR for the first step of our incremental work:
#15104

cc @alamb Thanks!

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 9, 2025

I also testing the tcph when it use the utf8view default, here is the result:

./benchmarks/bench.sh compare main issue_14909
Comparing main and issue_14909
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃    main ┃ issue_14909 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 172.78ms │     70.32ms │     no change │
│ QQuery 227.64ms │     26.30ms │     no change │
│ QQuery 337.75ms │     38.22ms │     no change │
│ QQuery 427.97ms │     28.27ms │     no change │
│ QQuery 552.59ms │     49.36ms │ +1.07x faster │
│ QQuery 620.81ms │     20.66ms │     no change │
│ QQuery 770.44ms │     75.06ms │  1.07x slower │
│ QQuery 848.32ms │     49.02ms │     no change │
│ QQuery 962.60ms │     63.14ms │     no change │
│ QQuery 1055.94ms │     58.75ms │  1.05x slower │
│ QQuery 1119.44ms │     21.21ms │  1.09x slower │
│ QQuery 1236.59ms │     37.42ms │     no change │
│ QQuery 1334.05ms │     34.88ms │     no change │
│ QQuery 1426.50ms │     26.77ms │     no change │
│ QQuery 1542.97ms │     45.06ms │     no change │
│ QQuery 1619.25ms │     20.02ms │     no change │
│ QQuery 1773.64ms │     68.81ms │ +1.07x faster │
│ QQuery 1896.62ms │     95.08ms │     no change │
│ QQuery 1946.77ms │     45.75ms │     no change │
│ QQuery 2045.54ms │     40.98ms │ +1.11x faster │
│ QQuery 2195.29ms │     95.19ms │     no change │
│ QQuery 2218.34ms │     17.99ms │     no change │
└──────────────┴─────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)1031.87ms │
│ Total Time (issue_14909)1028.26ms │
│ Average Time (main)46.90ms │
│ Average Time (issue_14909)46.74ms │
│ Queries Faster3 │
│ Queries Slower3 │
│ Queries with No Change16 │
└────────────────────────────┴───────────┘

@zhuqi-lucas
Copy link
Contributor

Create the ticket for avro:

  • Support Utf8View for avro #7262

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 11, 2025

New sub_task:

  • topk functionality for aggregates should support utf8view

PR: #15152

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 14, 2025

New sub_task:

  • Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 16, 2025

New sub_task:

  • Support Utf8View datatype for range queries

Submitted a PR:
#15257

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 16, 2025

New sub_task:

  • Performance improvement: Binary operators native support for Utf8View

@alamb
Copy link
Contributor Author

alamb commented Mar 17, 2025

erformance improvement: Binary operators native support for Utf8View

Yes, 100%

@zhuqi-lucas
Copy link
Contributor

erformance improvement: Binary operators native support for Utf8View

Yes, 100%

Submitted the PR for review:
#15275

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 21, 2025

New sub_task:

  • Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream

@zhuqi-lucas
Copy link
Contributor

Updated:

Most of the tasks are resolved, i am trying to do more performance investigation and testing if we default change to Utf8View for all varchar.

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 28, 2025

Also updated the latest clickbench for the current main compare the default mapping varchar to utf8view:

Small improvement, i think becasue it's parquet format, mostly we already load it as the Utf8View for benchmark:

Using --profile release-nonlto result:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       tmp ┃       tmp ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 00.36ms │    0.32ms │ +1.13x faster │
│ QQuery 127.55ms │   28.53ms │     no change │
│ QQuery 254.73ms │   58.73ms │  1.07x slower │
│ QQuery 349.67ms │   51.85ms │     no change │
│ QQuery 4321.47ms │  333.88ms │     no change │
│ QQuery 5374.77ms │  370.26ms │     no change │
│ QQuery 626.06ms │   27.14ms │     no change │
│ QQuery 729.89ms │   28.19ms │ +1.06x faster │
│ QQuery 8397.90ms │  372.24ms │ +1.07x faster │
│ QQuery 9583.40ms │  599.98ms │     no change │
│ QQuery 10148.50ms │  147.43ms │     no change │
│ QQuery 11163.20ms │  165.56ms │     no change │
│ QQuery 12399.31ms │  407.08ms │     no change │
│ QQuery 13568.61ms │  576.26ms │     no change │
│ QQuery 14389.05ms │  374.46ms │     no change │
│ QQuery 15375.66ms │  370.85ms │     no change │
│ QQuery 16720.94ms │  719.03ms │     no change │
│ QQuery 17662.21ms │  638.33ms │     no change │
│ QQuery 181694.34ms │ 1507.92ms │ +1.12x faster │
│ QQuery 1941.26ms │   42.08ms │     no change │
│ QQuery 20619.60ms │  549.74ms │ +1.13x faster │
│ QQuery 21779.77ms │  691.91ms │ +1.13x faster │
│ QQuery 221411.33ms │ 1375.59ms │     no change │
│ QQuery 233891.02ms │ 3946.51ms │     no change │
│ QQuery 24252.52ms │  247.12ms │     no change │
│ QQuery 25252.81ms │  248.90ms │     no change │
│ QQuery 26264.57ms │  276.89ms │     no change │
│ QQuery 27842.86ms │  854.72ms │     no change │
│ QQuery 286461.67ms │ 6410.47ms │     no change │
│ QQuery 29379.88ms │  359.71ms │ +1.06x faster │
│ QQuery 30352.77ms │  332.15ms │ +1.06x faster │
│ QQuery 31366.75ms │  371.79ms │     no change │
│ QQuery 321273.77ms │ 1427.04ms │  1.12x slower │
│ QQuery 331601.21ms │ 1599.55ms │     no change │
│ QQuery 341605.00ms │ 1701.18ms │  1.06x slower │
│ QQuery 35532.88ms │  576.30ms │  1.08x slower │
│ QQuery 36109.92ms │  115.27ms │     no change │
│ QQuery 3757.43ms │   57.22ms │     no change │
│ QQuery 3878.78ms │   80.17ms │     no change │
│ QQuery 39196.90ms │  197.13ms │     no change │
│ QQuery 4026.52ms │   25.77ms │     no change │
│ QQuery 4125.57ms │   25.51ms │     no change │
│ QQuery 4230.03ms │   28.90ms │     no change │
└──────────────┴───────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (tmp)28442.44ms │
│ Total Time (tmp)28319.66ms │
│ Average Time (tmp)661.45ms │
│ Average Time (tmp)658.60ms │
│ Queries Faster8 │
│ Queries Slower4 │
│ Queries with No Change31 │
└────────────────────────┴────────────┘
```rust


Using run --release result:


```rust
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main                   ┃ default_enable_utf8view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 00.34ms │                  0.33ms │     no change │
│ QQuery 144.10ms │                 43.49ms │     no change │
│ QQuery 277.24ms │                 76.79ms │     no change │
│ QQuery 384.97ms │                 77.06ms │ +1.10x faster │
│ QQuery 4523.60ms │                554.50ms │  1.06x slower │
│ QQuery 5665.56ms │                661.98ms │     no change │
│ QQuery 638.69ms │                 39.05ms │     no change │
│ QQuery 747.27ms │                 46.57ms │     no change │
│ QQuery 8702.86ms │                682.91ms │     no change │
│ QQuery 9780.11ms │                770.69ms │     no change │
│ QQuery 10194.09ms │                172.24ms │ +1.13x faster │
│ QQuery 11199.83ms │                191.13ms │     no change │
│ QQuery 12696.13ms │                688.11ms │     no change │
│ QQuery 13890.35ms │               1001.57ms │  1.12x slower │
│ QQuery 14732.92ms │                648.96ms │ +1.13x faster │
│ QQuery 15689.57ms │                633.55ms │ +1.09x faster │
│ QQuery 161415.16ms │               1468.50ms │     no change │
│ QQuery 171297.66ms │               1319.06ms │     no change │
│ QQuery 183272.62ms │               2857.06ms │ +1.15x faster │
│ QQuery 1975.13ms │                 82.66ms │  1.10x slower │
│ QQuery 20743.88ms │                705.83ms │ +1.05x faster │
│ QQuery 21929.50ms │                897.88ms │     no change │
│ QQuery 222576.76ms │               2506.14ms │     no change │
│ QQuery 234943.09ms │               4916.55ms │     no change │
│ QQuery 24392.47ms │                384.78ms │     no change │
│ QQuery 25386.58ms │                388.37ms │     no change │
│ QQuery 26423.42ms │                417.96ms │     no change │
│ QQuery 271050.88ms │                976.19ms │ +1.08x faster │
│ QQuery 288269.73ms │               8791.73ms │  1.06x slower │
│ QQuery 29439.96ms │                442.74ms │     no change │
│ QQuery 30583.71ms │                541.02ms │ +1.08x faster │
│ QQuery 31632.33ms │                629.25ms │     no change │
│ QQuery 322580.37ms │               2523.11ms │     no change │
│ QQuery 332810.58ms │               2848.06ms │     no change │
│ QQuery 343075.43ms │               3108.88ms │     no change │
│ QQuery 35856.76ms │                891.39ms │     no change │
│ QQuery 36152.80ms │                150.15ms │     no change │
│ QQuery 37117.99ms │                118.82ms │     no change │
│ QQuery 38110.05ms │                112.33ms │     no change │
│ QQuery 39267.64ms │                279.12ms │     no change │
│ QQuery 4041.34ms │                 45.52ms │  1.10x slower │
│ QQuery 4142.04ms │                 41.70ms │     no change │
│ QQuery 4250.40ms │                 45.41ms │ +1.11x faster │
└──────────────┴────────────────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)43905.93ms │
│ Total Time (default_enable_utf8view)43779.11ms │
│ Average Time (main)1021.07ms │
│ Average Time (default_enable_utf8view)1018.12ms │
│ Queries Faster9 │
│ Queries Slower5 │
│ Queries with No Change29 │
└────────────────────────────────────────┴────────────┘

@alamb
Copy link
Contributor Author

alamb commented Mar 28, 2025

Small improvement, i think becasue it's parquet format, mostly we already load it as the Utf8View for benchmark:

Yes I would expect no change for the clickbench benchmark as it doesn't use SQL VARCHAR 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants