-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Clustered Table to delta spec #2264
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
PROTOCOL.md
Outdated
- A clustering implementation must only cluster files that belong to the implementation or files that do not have the `CLUSTERED_BY` tag (i.e., unclustered). | ||
- Writers must write out [per-file statistics](#per-file-statistics) and per-column statistics for clustering columns in `add` action. | ||
If a new column is included in the clustering columns list, it is required for all table files to have statistics for these added columns. | ||
- When a clustering implementation clusters files, writers must incorporate a `tag` with `CLUSTERED_BY` as the key and the name of the clustering implementation as the corresponding value in the `add` actions for the clustered files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a tag the right mechanism for this? Tags are kind of an escape hatch for things that aren't in the protocol. But if clustering becomes a first-class citizen and it is part of the core protocol, then this can be a proper field of the addFile
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. I just revised the proposal to introduce the first-class field clusteringProvider
into AddFile
action. Please take a look it makes more sense now. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bart-samwel , @ryan-johnson-databricks , @imback82 , I updated the spec and removed the CLUSTERED_BY
tag and use a first-citizen field clusteringProvider
in AddFile
action. Can you take another look and let me know if anything I am missing. Thanks!
PROTOCOL.md
Outdated
- A clustering implementation must only cluster files that belong to the implementation or files that do not have the `CLUSTERED_BY` tag (i.e., unclustered). | ||
- Writers must write out [per-file statistics](#per-file-statistics) and per-column statistics for clustering columns in `add` action. | ||
If a new column is included in the clustering columns list, it is required for all table files to have statistics for these added columns. | ||
- When a clustering implementation clusters files, writers must incorporate a `tag` with `CLUSTERED_BY` as the key and the name of the clustering implementation as the corresponding value in the `add` actions for the clustered files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. I just revised the proposal to introduce the first-class field clusteringProvider
into AddFile
action. Please take a look it makes more sense now. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review... I was too slow I guess.
|
||
The Clustered Table feature facilitates the physical clustering of rows that share similar values on a predefined set of clustering columns. | ||
This enhances query performance when selective filters are applied to these clustering columns through data skipping. | ||
Clustering columns must be specified during the initial definition of a clustered table, and they can be modified after the table has been created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why must clustering columns be specified during initial table definition, if they can anyway be modified after? Seems like either clustering is optional and can be added/changed at any time... or it's required an cannot change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should ALTER TABLE ... CLUSTER BY
automatically add the "clustering" table feature (only for unpartitioned tables)? I think that's the decision we need to make.
Is this something we can extend later on (basically removing this restriction)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the SQL syntax need to be part of the Delta spec? Seems like more of a Delta client thing? As in, there's nothing in the spec that requires nor forbids changing the clustering after a table was created... and the decision of one client to add/change clustering shouldn't impact other clients' ability to consume the resulting table? If we think there could be interactions, we should capture those in the spec.
For example, maybe it's ok to add/remove clustering columns but not cluster by a completely different set of columns? (but if we can add and remove... it just takes a few steps to completely change the column set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question here is whether we should allow adding the clustering
table feature after table creation. The current wording, I think, is saying that we can only add the clustering
table feature during table creation. Any strong opinion here?
As long as the clustering
table feature is present, you are free to change clustering columns in any way (as long as they are in the schema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ryan-johnson-databricks , let me know what do you think about the @imback82 's comment thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, we shouldn't restrict adding the "clustering" table feature only during table creation.
I think it should say the clustering columns can be set during the table creation, or they can be altered subsequent to the table creation.
Then, we can update the "Enablement" section and add that the "clustering" feature can be added during creation or later as long as the table has no partition columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
"domainMetadata": { | ||
"domain": "delta.clustering", | ||
"configuration": "{\"clusteringColumns\":[{\"physicalName\":[\"col-daadafd7-7c20-4697-98f8-bff70199b1f9\"]}]}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems... baroque... object of array of object of array of string?
{
"clusteringColumns": [
{
"physicalName": [
"col-daadafd7-7c20-4697-98f8-bff70199b1f9"
]
}
]
}
The spec already says physical column names should be used, so can we just do e.g.:
"configuration": "{\"clusteringColumns\":[\"col-daadafd7-7c20-4697-98f8-bff70199b1f9\", \"col-5abe0e80-cf57-47ac-9ffc-a861a3d1077e\"]}",
Also, it might be more readable to give the actual json form, and then state+show how we embed it (with escaping) as a string for the configuration
field?
{
"clusteringColumns": [
"col-daadafd7-7c20-4697-98f8-bff70199b1f9",
"col-5abe0e80-cf57-47ac-9ffc-a861a3d1077e"
]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, will update according to the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, and addressed in the followup PR.
@@ -1735,6 +1777,7 @@ The following examples uses a table with two partition columns: "date" and "regi | |||
| |-- tags: map<string,string> | |||
| |-- baseRowId: long | |||
| |-- defaultRowCommitVersion: long | |||
| |-- clusteringProvider: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is given for a "table with two partition columns" -- Does the spec allow a table to be both partitioned and clustered? (I would have guessed not, but I don't see any language that forbids it).
If we do want to allow clustering we probably need to define how the two features interact. For example, I don't think it should be legal to cluster by and partition by an overlapping subset of columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be either partitioned or clustered. I think we should explicitly mention this in the spec. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely. Which may mean this example needs to be updated (split) to show both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new writer rule that disallows partitioned table here. Let's discuss in the follow up PR. thanks
@@ -414,6 +417,7 @@ The following is an example `add` action: | |||
"dataChange": true, | |||
"baseRowId": 4071, | |||
"defaultRowCommitVersion": 41, | |||
"clusteringProvider": "liquid", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As https://github.com/delta-io/delta/pull/2264/files#r1393446762 says, we cannot have both clusteringProvider
and partitionValues
set at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, where is that? I can't find it anywhere in this file?
https://github.com/dabao521/delta/blob/fe34ad1764a0ac212ab8c22e3d469e7f6541f598/PROTOCOL.md
(searching for all instances of partitionValues
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Search for "The following is an example add action"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new writer rule that disallows partitioned table here. Let's discuss in the follow up PR. thanks
Hi @ryan-johnson-databricks , @imback82 . Thanks for the review. I will create a follow up PR to address pending comments. |
#1874 requests Liquid clustering, and this PR starts the first step to introduce ClusteringTableFeature and CLUSTERED_BY tags. When creating a clustered table, The feature clustering must exist in the table protocol's writerFeatures. When a clustering implementation clusters files, writers must incorporate a tag with CLUSTERED_BY as the key and the name of the clustering implementation as the corresponding value in add action. More detail can be found in the Delta protocol change PR #2264 The next step is to pave the way to integrate the table feature and clusterby tags when defining and clustering a clustered table. Closes #2281 GitOrigin-RevId: e210b491a324a0794ec9f3a9236bb1932a6677e3
Which Delta project/connector is this regarding?
Description
We propose to introduce a new feature Clustered Table (
clustering
) to the Delta spec. The Clustered Table feature facilitates the physical clustering of rows that share similar values on a predefined set of clustering columns. This enhances query performance when selective filters are applied to these clustering columns through data skipping. More details can be found in the github issue #1874.How was this patch tested?
N/A
Does this PR introduce any user-facing changes?
No