Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Clustered Table to delta spec #2264

Closed
wants to merge 5 commits into from

Conversation

dabao521
Copy link
Contributor

@dabao521 dabao521 commented Nov 2, 2023

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

We propose to introduce a new feature Clustered Table (clustering) to the Delta spec. The Clustered Table feature facilitates the physical clustering of rows that share similar values on a predefined set of clustering columns. This enhances query performance when selective filters are applied to these clustering columns through data skipping. More details can be found in the github issue #1874.

How was this patch tested?

N/A

Does this PR introduce any user-facing changes?

No

@tdas tdas changed the title [DELTA-OSS-EXTERNAL] Introduce Clustered Table to delta spec Introduce Clustered Table to delta spec Nov 2, 2023
Copy link
Collaborator

@rahulsmahadev rahulsmahadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
@dabao521 dabao521 requested a review from imback82 November 5, 2023 22:00
PROTOCOL.md Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated Show resolved Hide resolved
@dabao521 dabao521 requested a review from bart-samwel November 9, 2023 22:56
PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated
- A clustering implementation must only cluster files that belong to the implementation or files that do not have the `CLUSTERED_BY` tag (i.e., unclustered).
- Writers must write out [per-file statistics](#per-file-statistics) and per-column statistics for clustering columns in `add` action.
If a new column is included in the clustering columns list, it is required for all table files to have statistics for these added columns.
- When a clustering implementation clusters files, writers must incorporate a `tag` with `CLUSTERED_BY` as the key and the name of the clustering implementation as the corresponding value in the `add` actions for the clustered files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a tag the right mechanism for this? Tags are kind of an escape hatch for things that aren't in the protocol. But if clustering becomes a first-class citizen and it is part of the core protocol, then this can be a proper field of the addFile. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. I just revised the proposal to introduce the first-class field clusteringProvider into AddFile action. Please take a look it makes more sense now. Thanks!

Copy link
Contributor Author

@dabao521 dabao521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bart-samwel , @ryan-johnson-databricks , @imback82 , I updated the spec and removed the CLUSTERED_BY tag and use a first-citizen field clusteringProvider in AddFile action. Can you take another look and let me know if anything I am missing. Thanks!

PROTOCOL.md Outdated Show resolved Hide resolved
PROTOCOL.md Outdated
- A clustering implementation must only cluster files that belong to the implementation or files that do not have the `CLUSTERED_BY` tag (i.e., unclustered).
- Writers must write out [per-file statistics](#per-file-statistics) and per-column statistics for clustering columns in `add` action.
If a new column is included in the clustering columns list, it is required for all table files to have statistics for these added columns.
- When a clustering implementation clusters files, writers must incorporate a `tag` with `CLUSTERED_BY` as the key and the name of the clustering implementation as the corresponding value in the `add` actions for the clustered files.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. I just revised the proposal to introduce the first-class field clusteringProvider into AddFile action. Please take a look it makes more sense now. Thanks!

PROTOCOL.md Show resolved Hide resolved
Copy link
Collaborator

@ryan-johnson-databricks ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review... I was too slow I guess.


The Clustered Table feature facilitates the physical clustering of rows that share similar values on a predefined set of clustering columns.
This enhances query performance when selective filters are applied to these clustering columns through data skipping.
Clustering columns must be specified during the initial definition of a clustered table, and they can be modified after the table has been created.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why must clustering columns be specified during initial table definition, if they can anyway be modified after? Seems like either clustering is optional and can be added/changed at any time... or it's required an cannot change?

Copy link
Contributor

@imback82 imback82 Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should ALTER TABLE ... CLUSTER BY automatically add the "clustering" table feature (only for unpartitioned tables)? I think that's the decision we need to make.

Is this something we can extend later on (basically removing this restriction)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the SQL syntax need to be part of the Delta spec? Seems like more of a Delta client thing? As in, there's nothing in the spec that requires nor forbids changing the clustering after a table was created... and the decision of one client to add/change clustering shouldn't impact other clients' ability to consume the resulting table? If we think there could be interactions, we should capture those in the spec.

For example, maybe it's ok to add/remove clustering columns but not cluster by a completely different set of columns? (but if we can add and remove... it just takes a few steps to completely change the column set).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question here is whether we should allow adding the clustering table feature after table creation. The current wording, I think, is saying that we can only add the clustering table feature during table creation. Any strong opinion here?

As long as the clustering table feature is present, you are free to change clustering columns in any way (as long as they are in the schema).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ryan-johnson-databricks , let me know what do you think about the @imback82 's comment thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, we shouldn't restrict adding the "clustering" table feature only during table creation.

I think it should say the clustering columns can be set during the table creation, or they can be altered subsequent to the table creation.

Then, we can update the "Enablement" section and add that the "clustering" feature can be added during creation or later as long as the table has no partition columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @imback82 for the input and I have reworded it in the followup PR. Please take a look.

{
"domainMetadata": {
"domain": "delta.clustering",
"configuration": "{\"clusteringColumns\":[{\"physicalName\":[\"col-daadafd7-7c20-4697-98f8-bff70199b1f9\"]}]}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems... baroque... object of array of object of array of string?

{ 
  "clusteringColumns": [
    {
      "physicalName": [
        "col-daadafd7-7c20-4697-98f8-bff70199b1f9"
      ]
    }
  ]
}

The spec already says physical column names should be used, so can we just do e.g.:

"configuration": "{\"clusteringColumns\":[\"col-daadafd7-7c20-4697-98f8-bff70199b1f9\", \"col-5abe0e80-cf57-47ac-9ffc-a861a3d1077e\"]}",

Also, it might be more readable to give the actual json form, and then state+show how we embed it (with escaping) as a string for the configuration field?

{
  "clusteringColumns": [
    "col-daadafd7-7c20-4697-98f8-bff70199b1f9", 
    "col-5abe0e80-cf57-47ac-9ffc-a861a3d1077e"
  ]
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, will update according to the suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, and addressed in the followup PR.

@@ -1735,6 +1777,7 @@ The following examples uses a table with two partition columns: "date" and "regi
| |-- tags: map<string,string>
| |-- baseRowId: long
| |-- defaultRowCommitVersion: long
| |-- clusteringProvider: string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is given for a "table with two partition columns" -- Does the spec allow a table to be both partitioned and clustered? (I would have guessed not, but I don't see any language that forbids it).

If we do want to allow clustering we probably need to define how the two features interact. For example, I don't think it should be legal to cluster by and partition by an overlapping subset of columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be either partitioned or clustered. I think we should explicitly mention this in the spec. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. Which may mean this example needs to be updated (split) to show both cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new writer rule that disallows partitioned table here. Let's discuss in the follow up PR. thanks

@@ -414,6 +417,7 @@ The following is an example `add` action:
"dataChange": true,
"baseRowId": 4071,
"defaultRowCommitVersion": 41,
"clusteringProvider": "liquid",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As https://github.com/delta-io/delta/pull/2264/files#r1393446762 says, we cannot have both clusteringProvider and partitionValues set at the same time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, where is that? I can't find it anywhere in this file?
https://github.com/dabao521/delta/blob/fe34ad1764a0ac212ab8c22e3d469e7f6541f598/PROTOCOL.md
(searching for all instances of partitionValues)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search for "The following is an example add action"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new writer rule that disallows partitioned table here. Let's discuss in the follow up PR. thanks

@dabao521
Copy link
Contributor Author

Sorry for the late review... I was too slow I guess.

Hi @ryan-johnson-databricks , @imback82 . Thanks for the review. I will create a follow up PR to address pending comments.

allisonport-db pushed a commit that referenced this pull request Nov 16, 2023
#1874 requests Liquid clustering, and this PR starts the first step to introduce ClusteringTableFeature and CLUSTERED_BY tags.

When creating a clustered table, The feature clustering must exist in the table protocol's writerFeatures.

When a clustering implementation clusters files, writers must incorporate a tag with CLUSTERED_BY as the key and the name of the clustering implementation as the corresponding value in add action.

More detail can be found in the Delta protocol change PR #2264

The next step is to pave the way to integrate the table feature and clusterby tags when defining and clustering a clustered table.
Closes #2281

GitOrigin-RevId: e210b491a324a0794ec9f3a9236bb1932a6677e3
allisonport-db pushed a commit that referenced this pull request Nov 21, 2023
Follow up comments for ClusteringTable feature #2264

We had late comments after the original PR merged, and this PR is addressing those missed comments.

Closes #2294

GitOrigin-RevId: 12b53a4788a8ba7403cb72f60714bd380c742b04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants