Introduce Clustered Table to delta spec #2264

dabao521 · 2023-11-02T21:25:01Z

Which Delta project/connector is this regarding?

Description

We propose to introduce a new feature Clustered Table (clustering) to the Delta spec. The Clustered Table feature facilitates the physical clustering of rows that share similar values on a predefined set of clustering columns. This enhances query performance when selective filters are applied to these clustering columns through data skipping. More details can be found in the github issue #1874.

How was this patch tested?

N/A

Does this PR introduce any user-facing changes?

No

rahulsmahadev

LGTM!

PROTOCOL.md

bart-samwel · 2023-11-13T14:12:42Z

PROTOCOL.md

-  - A clustering implementation must only cluster files that belong to the implementation or files that do not have the `CLUSTERED_BY` tag (i.e., unclustered).
+- Writers must write out [per-file statistics](#per-file-statistics) and per-column statistics for clustering columns in `add` action. 
+  If a new column is included in the clustering columns list, it is required for all table files to have statistics for these added columns.
+- When a clustering implementation clusters files, writers must incorporate a `tag`  with `CLUSTERED_BY` as the key and the name of the clustering implementation as the corresponding value in the `add` actions for the clustered files.


Is a tag the right mechanism for this? Tags are kind of an escape hatch for things that aren't in the protocol. But if clustering becomes a first-class citizen and it is part of the core protocol, then this can be a proper field of the addFile. WDYT?

That is a good point. I just revised the proposal to introduce the first-class field clusteringProvider into AddFile action. Please take a look it makes more sense now. Thanks!

dabao521

Hi @bart-samwel , @ryan-johnson-databricks , @imback82 , I updated the spec and removed the CLUSTERED_BY tag and use a first-citizen field clusteringProvider in AddFile action. Can you take another look and let me know if anything I am missing. Thanks!

PROTOCOL.md

dabao521 · 2023-11-13T22:56:23Z

PROTOCOL.md

-  - A clustering implementation must only cluster files that belong to the implementation or files that do not have the `CLUSTERED_BY` tag (i.e., unclustered).
+- Writers must write out [per-file statistics](#per-file-statistics) and per-column statistics for clustering columns in `add` action. 
+  If a new column is included in the clustering columns list, it is required for all table files to have statistics for these added columns.
+- When a clustering implementation clusters files, writers must incorporate a `tag`  with `CLUSTERED_BY` as the key and the name of the clustering implementation as the corresponding value in the `add` actions for the clustered files.


That is a good point. I just revised the proposal to introduce the first-class field clusteringProvider into AddFile action. Please take a look it makes more sense now. Thanks!

PROTOCOL.md

ryan-johnson-databricks

Sorry for the late review... I was too slow I guess.

ryan-johnson-databricks · 2023-11-14T21:37:29Z

PROTOCOL.md

+
+The Clustered Table feature facilitates the physical clustering of rows that share similar values on a predefined set of clustering columns.
+This enhances query performance when selective filters are applied to these clustering columns through data skipping.
+Clustering columns must be specified during the initial definition of a clustered table, and they can be modified after the table has been created.


Why must clustering columns be specified during initial table definition, if they can anyway be modified after? Seems like either clustering is optional and can be added/changed at any time... or it's required an cannot change?

So should ALTER TABLE ... CLUSTER BY automatically add the "clustering" table feature (only for unpartitioned tables)? I think that's the decision we need to make.

Is this something we can extend later on (basically removing this restriction)?

Does the SQL syntax need to be part of the Delta spec? Seems like more of a Delta client thing? As in, there's nothing in the spec that requires nor forbids changing the clustering after a table was created... and the decision of one client to add/change clustering shouldn't impact other clients' ability to consume the resulting table? If we think there could be interactions, we should capture those in the spec.

For example, maybe it's ok to add/remove clustering columns but not cluster by a completely different set of columns? (but if we can add and remove... it just takes a few steps to completely change the column set).

I think the question here is whether we should allow adding the clustering table feature after table creation. The current wording, I think, is saying that we can only add the clustering table feature during table creation. Any strong opinion here?

As long as the clustering table feature is present, you are free to change clustering columns in any way (as long as they are in the schema).

Hi @ryan-johnson-databricks , let me know what do you think about the @imback82 's comment thanks

On a second thought, we shouldn't restrict adding the "clustering" table feature only during table creation.

I think it should say the clustering columns can be set during the table creation, or they can be altered subsequent to the table creation.

Then, we can update the "Enablement" section and add that the "clustering" feature can be added during creation or later as long as the table has no partition columns.

Thanks @imback82 for the input and I have reworded it in the followup PR. Please take a look.

ryan-johnson-databricks · 2023-11-14T21:44:32Z

PROTOCOL.md

+{
+  "domainMetadata": {
+    "domain": "delta.clustering",
+    "configuration": "{\"clusteringColumns\":[{\"physicalName\":[\"col-daadafd7-7c20-4697-98f8-bff70199b1f9\"]}]}",


This seems... baroque... object of array of object of array of string?

{ "clusteringColumns": [ { "physicalName": [ "col-daadafd7-7c20-4697-98f8-bff70199b1f9" ] } ] }

The spec already says physical column names should be used, so can we just do e.g.:

"configuration": "{\"clusteringColumns\":[\"col-daadafd7-7c20-4697-98f8-bff70199b1f9\", \"col-5abe0e80-cf57-47ac-9ffc-a861a3d1077e\"]}",

Also, it might be more readable to give the actual json form, and then state+show how we embed it (with escaping) as a string for the configuration field?

{ "clusteringColumns": [ "col-daadafd7-7c20-4697-98f8-bff70199b1f9", "col-5abe0e80-cf57-47ac-9ffc-a861a3d1077e" ] }

good idea, will update according to the suggestion!

done, and addressed in the followup PR.

ryan-johnson-databricks · 2023-11-14T21:58:21Z

PROTOCOL.md

@@ -1735,6 +1777,7 @@ The following examples uses a table with two partition columns: "date" and "regi
 |    |-- tags: map<string,string>
 |    |-- baseRowId: long
 |    |-- defaultRowCommitVersion: long
+|    |-- clusteringProvider: string


This example is given for a "table with two partition columns" -- Does the spec allow a table to be both partitioned and clustered? (I would have guessed not, but I don't see any language that forbids it).

If we do want to allow clustering we probably need to define how the two features interact. For example, I don't think it should be legal to cluster by and partition by an overlapping subset of columns?

It should be either partitioned or clustered. I think we should explicitly mention this in the spec. WDYT?

Definitely. Which may mean this example needs to be updated (split) to show both cases?

I added a new writer rule that disallows partitioned table here. Let's discuss in the follow up PR. thanks

imback82 · 2023-11-14T23:59:53Z

PROTOCOL.md

@@ -414,6 +417,7 @@ The following is an example `add` action:
    "dataChange": true,
    "baseRowId": 4071,
    "defaultRowCommitVersion": 41,
+    "clusteringProvider": "liquid",


As https://github.com/delta-io/delta/pull/2264/files#r1393446762 says, we cannot have both clusteringProvider and partitionValues set at the same time.

Sorry, where is that? I can't find it anywhere in this file?
https://github.com/dabao521/delta/blob/fe34ad1764a0ac212ab8c22e3d469e7f6541f598/PROTOCOL.md
(searching for all instances of partitionValues)

Search for "The following is an example add action"

I added a new writer rule that disallows partitioned table here. Let's discuss in the follow up PR. thanks

dabao521 · 2023-11-15T02:11:34Z

Sorry for the late review... I was too slow I guess.

Hi @ryan-johnson-databricks , @imback82 . Thanks for the review. I will create a follow up PR to address pending comments.

#1874 requests Liquid clustering, and this PR starts the first step to introduce ClusteringTableFeature and CLUSTERED_BY tags. When creating a clustered table, The feature clustering must exist in the table protocol's writerFeatures. When a clustering implementation clusters files, writers must incorporate a tag with CLUSTERED_BY as the key and the name of the clustering implementation as the corresponding value in add action. More detail can be found in the Delta protocol change PR #2264 The next step is to pave the way to integrate the table feature and clusterby tags when defining and clustering a clustered table. Closes #2281 GitOrigin-RevId: e210b491a324a0794ec9f3a9236bb1932a6677e3

Follow up comments for ClusteringTable feature #2264 We had late comments after the original PR merged, and this PR is addressing those missed comments. Closes #2294 GitOrigin-RevId: 12b53a4788a8ba7403cb72f60714bd380c742b04

Introduce Clustered Table to delta spec

3b2ca4c

tdas changed the title ~~[DELTA-OSS-EXTERNAL] Introduce Clustered Table to delta spec~~ Introduce Clustered Table to delta spec Nov 2, 2023

rahulsmahadev approved these changes Nov 2, 2023

View reviewed changes

ryan-johnson-databricks approved these changes Nov 3, 2023

View reviewed changes

PROTOCOL.md Outdated Show resolved Hide resolved

ryan-johnson-databricks reviewed Nov 3, 2023

View reviewed changes

PROTOCOL.md Outdated Show resolved Hide resolved

fix comments

43892b6

dabao521 requested a review from imback82 November 5, 2023 22:00

imback82 approved these changes Nov 6, 2023

View reviewed changes

bart-samwel reviewed Nov 7, 2023

View reviewed changes

PROTOCOL.md Show resolved Hide resolved

PROTOCOL.md Outdated Show resolved Hide resolved

PROTOCOL.md Outdated Show resolved Hide resolved

PROTOCOL.md Outdated Show resolved Hide resolved

PROTOCOL.md Outdated Show resolved Hide resolved

dabao521 mentioned this pull request Nov 9, 2023

[Clustered Table] Introduce ClusteringTableFeature and CLUSTERED_BY file tag #2281

Closed

5 tasks

fix comments

f5953b4

dabao521 requested a review from bart-samwel November 9, 2023 22:56

bart-samwel reviewed Nov 13, 2023

View reviewed changes

dabao521 added 2 commits November 13, 2023 19:21

address comments

30134d3

fix link

fe34ad1

dabao521 commented Nov 14, 2023

View reviewed changes

dabao521 requested review from bart-samwel, ryan-johnson-databricks and imback82 November 14, 2023 03:26

tdas closed this in 74cd1c9 Nov 14, 2023

ryan-johnson-databricks reviewed Nov 14, 2023

View reviewed changes

imback82 reviewed Nov 14, 2023

View reviewed changes

dabao521 mentioned this pull request Nov 15, 2023

[Delta Protocol] Follow up comments for ClusteringTable feature #2294

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Clustered Table to delta spec #2264

Introduce Clustered Table to delta spec #2264

dabao521 commented Nov 2, 2023 •

edited

Loading

rahulsmahadev left a comment

bart-samwel Nov 13, 2023

dabao521 Nov 13, 2023

dabao521 left a comment

dabao521 Nov 13, 2023

ryan-johnson-databricks left a comment

ryan-johnson-databricks Nov 14, 2023

imback82 Nov 14, 2023 •

edited

Loading

ryan-johnson-databricks Nov 14, 2023

imback82 Nov 15, 2023

dabao521 Nov 15, 2023

imback82 Nov 15, 2023

dabao521 Nov 15, 2023

ryan-johnson-databricks Nov 14, 2023

imback82 Nov 14, 2023

dabao521 Nov 15, 2023

ryan-johnson-databricks Nov 14, 2023

imback82 Nov 14, 2023

ryan-johnson-databricks Nov 14, 2023 •

edited

Loading

dabao521 Nov 15, 2023

imback82 Nov 14, 2023

ryan-johnson-databricks Nov 15, 2023

imback82 Nov 15, 2023

dabao521 Nov 15, 2023

dabao521 commented Nov 15, 2023

Introduce Clustered Table to delta spec #2264

Introduce Clustered Table to delta spec #2264

Conversation

dabao521 commented Nov 2, 2023 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

rahulsmahadev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dabao521 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dabao521 commented Nov 15, 2023

dabao521 commented Nov 2, 2023 •

edited

Loading

imback82 Nov 14, 2023 •

edited

Loading

ryan-johnson-databricks Nov 14, 2023 •

edited

Loading