-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48994][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec #47473
Conversation
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java
Outdated
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harshmotw-db Thanks for the features! I left a few comments.
common/variant/README.md
Outdated
| string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | | ||
| binary from metadata | `17` | BINARY | Little-endian index into the metadata dictionary. Number of bytes is equal to the metadata `offset_size`. | | ||
| string from metadata | `18` | STRING | Little-endian index into the metadata dictionary. Number of bytes is equal to the metadata `offset_size`. | | ||
| year-month interval | `19` | YearMonthIntervalType(start_field, end_field) | 1 byte denoting start field (1 bit) and end field (1 bit) starting at LSB followed by 4-byte little-endian value. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the parquet types be for these intervals?
Also, we need to describe what the start/end fields are and how exactly they are encoded in the byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had mistakenly put in the equivalent spark types here earlier. I have removed the parquet types for now as I am investigating the parquet types.
The details about the start and end field are in a paragraph after this table in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the following Python script on a parquet table containing these interval types and found that these intervals are intrinsically stored as int/long and the type info is stored in the metadata. I'll update the table to reflect this.
>>> import pyarrow.parquet as pq
>>> table = pq.read_table('/home/harsh.motwani/tables/part-00000-tid-8067172485220669242-1687c1be-9e28-455a-817a-449a862b4a05-0-1-c000.snappy.parquet')
>>> table.schema
ymi0: int32 not null
ymi1: int32 not null
ymi2: int32
dti0: int64
dti1: int64
-- schema metadata --
org.apache.spark.version: '4.0.0'
org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 375
>>> table.schema.metadata
OrderedDict([(b'org.apache.spark.version', b'4.0.0'), (b'org.apache.spark.sql.parquet.row.metadata', b'{"type":"struct","fields":[{"name":"ymi0","type":"interval year to month","nullable":false,"metadata":{}},{"name":"ymi1","type":"interval year","nullable":false,"metadata":{}},{"name":"ymi2","type":"interval month","nullable":true,"metadata":{}},{"name":"dti0","type":"interval day to second","nullable":true,"metadata":{}},{"name":"dti1","type":"interval hour to minute","nullable":true,"metadata":{}}]}')])
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
Outdated
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
Outdated
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java
Show resolved
Hide resolved
nit: |
import java.math.BigDecimal; | ||
import java.util.ArrayList; | ||
|
||
// Replicating code from SparkIntervalUtils so code in the 'common' space can work with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why replicating? I think other modules depend on common/utils
and it's fine we move interval utils to this module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. However, I believe that should be a different PR as it would be a big change in itself. For the purposes of this PR, these functions only support ANSIStyle while the functions on SQL also expect Hive style sometimes.
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harshmotw-db Thanks for this feature! I left a few more questions/comments.
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java
Outdated
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java
Outdated
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/Variant.java
Show resolved
Hide resolved
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java
Show resolved
Hide resolved
Merged to master. |
.stripTrailingZeros().toPlainString()); | ||
} | ||
return prefix + String.format(formatBuilder.toString(), formatArgs.toArray()) + postfix; | ||
} catch (SparkException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try-catch
here seems a bit redundant. Why do we need to catch the SparkException
only to rethrow it?
rest %= MICROS_PER_MINUTE; | ||
} else if (startField == SECOND) { | ||
String leadZero = rest < 10 * MICROS_PER_SECOND ? "0" : ""; | ||
formatBuilder.append(leadZero + BigDecimal.valueOf(rest, 6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use chained calls, like
formatBuilder.append(leadZero).append(BigDecimal.valueOf(rest, 6).stripTrailingZeros().toPlainString());
otherwise leadZero + BigDecimal.valueOf(rest, 6).stripTrailingZeros().toPlainString()
will result in another string concatenation."
} | ||
if (startField < SECOND && SECOND <= endField) { | ||
String leadZero = rest < 10 * MICROS_PER_SECOND ? "0" : ""; | ||
formatBuilder.append(":" + leadZero + BigDecimal.valueOf(rest, 6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
} | ||
} | ||
StringBuilder formatBuilder = new StringBuilder(sign); | ||
ArrayList<Long> formatArgs = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List<Long> formatArgs = new ArrayList<>();
|
||
import org.apache.spark.SparkException; | ||
|
||
import java.math.BigDecimal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import order rule in the newly added Java file should be consistent with that of the Scala files.
String firstStr = "-" + (startField == DAY ? Long.toString(MAX_DAY) : | ||
(startField == HOUR ? Long.toString(MAX_HOUR) : | ||
(startField == MINUTE ? Long.toString(MAX_MINUTE) : | ||
Long.toString(MAX_SECOND) + ".775808"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_SECOND + ".775808"
… the Variant Spec ### What changes were proposed in this pull request? This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side. ### Why are the changes needed? The variant spec should be compatible with all SQL Standard data types. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to cast interval types to variants and vice versa. ### How was this patch tested? Unit tests in VariantExpressionSuite.scala and test_types.py ### Was this patch authored or co-authored using generative AI tooling? Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code. Generated-by: perplexity.ai Closes apache#47473 from harshmotw-db/variant_interval. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@LuciferYang Thanks for the comments! I'll address them in a future follow up PR. |
… the Variant Spec ### What changes were proposed in this pull request? This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side. ### Why are the changes needed? The variant spec should be compatible with all SQL Standard data types. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to cast interval types to variants and vice versa. ### How was this patch tested? Unit tests in VariantExpressionSuite.scala and test_types.py ### Was this patch authored or co-authored using generative AI tooling? Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code. Generated-by: perplexity.ai Closes apache#47473 from harshmotw-db/variant_interval. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Hi @LuciferYang I have made your requested changes in this PR. |
### What changes were proposed in this pull request? The minor post-merge comments from #47473 have been addressed ### Why are the changes needed? Improved code style. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing VariantExpressionSuite passes ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47792 from harshmotw-db/harshmotw-db/PR_fix. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? The minor post-merge comments from apache#47473 have been addressed ### Why are the changes needed? Improved code style. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing VariantExpressionSuite passes ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47792 from harshmotw-db/harshmotw-db/PR_fix. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… the Variant Spec ### What changes were proposed in this pull request? This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side. ### Why are the changes needed? The variant spec should be compatible with all SQL Standard data types. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to cast interval types to variants and vice versa. ### How was this patch tested? Unit tests in VariantExpressionSuite.scala and test_types.py ### Was this patch authored or co-authored using generative AI tooling? Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code. Generated-by: perplexity.ai Closes apache#47473 from harshmotw-db/variant_interval. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? The minor post-merge comments from apache#47473 have been addressed ### Why are the changes needed? Improved code style. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing VariantExpressionSuite passes ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47792 from harshmotw-db/harshmotw-db/PR_fix. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… the Variant Spec ### What changes were proposed in this pull request? This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side. ### Why are the changes needed? The variant spec should be compatible with all SQL Standard data types. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to cast interval types to variants and vice versa. ### How was this patch tested? Unit tests in VariantExpressionSuite.scala and test_types.py ### Was this patch authored or co-authored using generative AI tooling? Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code. Generated-by: perplexity.ai Closes apache#47473 from harshmotw-db/variant_interval. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? The minor post-merge comments from apache#47473 have been addressed ### Why are the changes needed? Improved code style. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing VariantExpressionSuite passes ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47792 from harshmotw-db/harshmotw-db/PR_fix. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR adds support for the YearMonthIntervalType and DayTimeIntervalType as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side.
Why are the changes needed?
The variant spec should be compatible with all SQL Standard data types.
Does this PR introduce any user-facing change?
Yes, it allows users to cast interval types to variants and vice versa.
How was this patch tested?
Unit tests in VariantExpressionSuite.scala and test_types.py
Was this patch authored or co-authored using generative AI tooling?
Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code.
Generated-by: perplexity.ai