-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nested fields count towards limit of stats calculation of first 32 columns #3172
Comments
I believe nested fields are included in the count of fields used for statistics. The Databricks liquid clustering docs refer to the data types that can be leveraged as a cluster key and these columns must have statistics captured. https://learn.microsoft.com/en-us/azure/databricks/delta/clustering#choose-clustering-keys Hence the reason arrays and maps do not capture stats. Additional info for choosing stat columns, which can include the nested fields in a struct. I'm not sure if there's any difference with OSS delta but I would think not. |
I have verified how it is done with Spark-Delta, and it is the first 32-column, even if the nested columns is ArrayType/StructType. So the problem is not that the from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType
nested_schema = StructType([
StructField("2", IntegerType(), True),
StructField("3", IntegerType(), True),
StructField("4", IntegerType(), True),
StructField("5", IntegerType(), True),
StructField("6", StringType(), True),
StructField("7", StringType(), True),
StructField("8", StringType(), True),
StructField("9", StringType(), True),
StructField("10", StringType(), True),
StructField("11", StringType(), True),
StructField("12", StringType(), True),
StructField("13", StringType(), True),
StructField("14", StringType(), True),
StructField("15", StringType(), True),
StructField("16", StringType(), True),
StructField("17", StringType(), True),
StructField("18", StringType(), True),
StructField("19", StringType(), True),
StructField("20", StringType(), True),
StructField("21", StringType(), True),
StructField("22", StringType(), True),
StructField("23", StringType(), True),
StructField("24", StringType(), True),
StructField("25", StringType(), True),
StructField("26", StringType(), True),
StructField("27", StringType(), True),
StructField("28", StringType(), True),
StructField("29", StringType(), True),
StructField("30", StringType(), True),
StructField("31", StringType(), True),
StructField("32", StringType(), True)
])
schema = StructType([
StructField("1", StringType(), True),
StructField("nested", ArrayType(nested_schema), True),
StructField("year", IntegerType(), True),
StructField("month", IntegerType(), True),
StructField("day", IntegerType(), True)
])
data = [("foo", [], 2024, 12, 1)]
df = spark.createDataFrame(data, schema)
df.write \
.format("delta") \
.mode("overwrite") \
.saveAsTable("my_catalog.default.my_temp_table") |
Environment
Delta-rs version:
0.24.0
Binding:
Python
Environment:
Bug
What happened:
When table stats are calculated, only the first 32 columns are included. However, if a column has nested fields, these fields count towards the 32 limit.
Additionally, the stats are not calculated if the column is a list of structs (but that is maybe intended?)
What you expected to happen:
That a nested field only count once, and table stats are calculated for the first 32 columns at the root level.
How to reproduce it:
The resulting transaction file. Here the interesting part is the
stats
field of theadd
transaction.More details:
Slack conversation:
https://delta-users.slack.com/archives/C013LCAEB98/p1738184140820519
The text was updated successfully, but these errors were encountered: