[SPARK-48994][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec #47473

harshmotw-db · 2024-07-24T20:01:48Z

What changes were proposed in this pull request?

This PR adds support for the YearMonthIntervalType and DayTimeIntervalType as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side.

Why are the changes needed?

The variant spec should be compatible with all SQL Standard data types.

Does this PR introduce any user-facing change?

Yes, it allows users to cast interval types to variants and vice versa.

How was this patch tested?

Unit tests in VariantExpressionSuite.scala and test_types.py

Was this patch authored or co-authored using generative AI tooling?

Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code.
Generated-by: perplexity.ai

common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java

common/variant/src/main/java/org/apache/spark/types/variant/Variant.java

gene-db

@harshmotw-db Thanks for the features! I left a few comments.

common/utils/src/main/resources/error/error-conditions.json

gene-db · 2024-07-24T21:26:32Z

common/variant/README.md

+| string                      | `16`    | STRING                                        | 4 byte little-endian size, followed by UTF-8 encoded bytes                                                          |
+| binary from metadata        | `17`    | BINARY                                        | Little-endian index into the metadata dictionary. Number of bytes is equal to the metadata `offset_size`.           |
+| string from metadata        | `18`    | STRING                                        | Little-endian index into the metadata dictionary. Number of bytes is equal to the metadata `offset_size`.           |
+| year-month interval         | `19`    | YearMonthIntervalType(start_field, end_field) | 1 byte denoting start field (1 bit) and end field (1 bit) starting at LSB followed by 4-byte little-endian value.   |


What would the parquet types be for these intervals?

Also, we need to describe what the start/end fields are and how exactly they are encoded in the byte.

I had mistakenly put in the equivalent spark types here earlier. I have removed the parquet types for now as I am investigating the parquet types.

The details about the start and end field are in a paragraph after this table in this PR.

I ran the following Python script on a parquet table containing these interval types and found that these intervals are intrinsically stored as int/long and the type info is stored in the metadata. I'll update the table to reflect this.

>>> import pyarrow.parquet as pq >>> table = pq.read_table('/home/harsh.motwani/tables/part-00000-tid-8067172485220669242-1687c1be-9e28-455a-817a-449a862b4a05-0-1-c000.snappy.parquet') >>> table.schema ymi0: int32 not null ymi1: int32 not null ymi2: int32 dti0: int64 dti1: int64 -- schema metadata -- org.apache.spark.version: '4.0.0' org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 375 >>> table.schema.metadata OrderedDict([(b'org.apache.spark.version', b'4.0.0'), (b'org.apache.spark.sql.parquet.row.metadata', b'{"type":"struct","fields":[{"name":"ymi0","type":"interval year to month","nullable":false,"metadata":{}},{"name":"ymi1","type":"interval year","nullable":false,"metadata":{}},{"name":"ymi2","type":"interval month","nullable":true,"metadata":{}},{"name":"dti0","type":"interval day to second","nullable":true,"metadata":{}},{"name":"dti1","type":"interval hour to minute","nullable":true,"metadata":{}}]}')])

common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java

common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java

common/variant/src/main/java/org/apache/spark/types/variant/Variant.java

common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java

itholic · 2024-07-25T00:06:48Z

nit: [PySpark] in the title should be [PYTHON]. You can see the full list of PR categories from https://spark-prs.appspot.com/ :-)

cloud-fan · 2024-07-25T03:12:01Z

common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java

+import java.math.BigDecimal;
+import java.util.ArrayList;
+
+// Replicating code from SparkIntervalUtils so code in the 'common' space can work with


why replicating? I think other modules depend on common/utils and it's fine we move interval utils to this module.

I agree. However, I believe that should be a different PR as it would be a big change in itself. For the purposes of this PR, these functions only support ANSIStyle while the functions on SQL also expect Hive style sometimes.

common/variant/src/main/java/org/apache/spark/types/variant/Variant.java

gene-db

@harshmotw-db Thanks for this feature! I left a few more questions/comments.

common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java

common/variant/src/main/java/org/apache/spark/types/variant/Variant.java

common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java

common/variant/src/main/java/org/apache/spark/types/variant/Variant.java

common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java

HyukjinKwon · 2024-07-27T05:31:28Z

Merged to master.

LuciferYang · 2024-07-29T02:30:40Z

common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java

+                .stripTrailingZeros().toPlainString());
+      }
+      return prefix + String.format(formatBuilder.toString(), formatArgs.toArray()) + postfix;
+    } catch (SparkException e) {


The try-catch here seems a bit redundant. Why do we need to catch the SparkException only to rethrow it?

LuciferYang · 2024-07-29T02:33:44Z

common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java

+        rest %= MICROS_PER_MINUTE;
+      } else if (startField == SECOND) {
+        String leadZero = rest < 10 * MICROS_PER_SECOND ? "0" : "";
+        formatBuilder.append(leadZero + BigDecimal.valueOf(rest, 6)


should use chained calls, like

formatBuilder.append(leadZero).append(BigDecimal.valueOf(rest, 6).stripTrailingZeros().toPlainString());

otherwise leadZero + BigDecimal.valueOf(rest, 6).stripTrailingZeros().toPlainString() will result in another string concatenation."

LuciferYang · 2024-07-29T02:33:57Z

common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java

+      }
+      if (startField < SECOND && SECOND <= endField) {
+        String leadZero = rest < 10 * MICROS_PER_SECOND ? "0" : "";
+        formatBuilder.append(":" + leadZero + BigDecimal.valueOf(rest, 6)


LuciferYang · 2024-07-29T02:34:41Z

common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java

+        }
+      }
+      StringBuilder formatBuilder = new StringBuilder(sign);
+      ArrayList<Long> formatArgs = new ArrayList<>();


List<Long> formatArgs = new ArrayList<>();

LuciferYang · 2024-07-29T02:35:57Z

common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java

+
+import org.apache.spark.SparkException;
+
+import java.math.BigDecimal;


The import order rule in the newly added Java file should be consistent with that of the Scala files.

LuciferYang · 2024-07-29T02:37:06Z

common/utils/src/main/scala/org/apache/spark/util/DayTimeIntervalUtils.java

+          String firstStr = "-" + (startField == DAY ? Long.toString(MAX_DAY) :
+                  (startField == HOUR ? Long.toString(MAX_HOUR) :
+                          (startField == MINUTE ? Long.toString(MAX_MINUTE) :
+                                  Long.toString(MAX_SECOND) + ".775808")));


MAX_SECOND + ".775808"

… the Variant Spec ### What changes were proposed in this pull request? This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side. ### Why are the changes needed? The variant spec should be compatible with all SQL Standard data types. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to cast interval types to variants and vice versa. ### How was this patch tested? Unit tests in VariantExpressionSuite.scala and test_types.py ### Was this patch authored or co-authored using generative AI tooling? Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code. Generated-by: perplexity.ai Closes apache#47473 from harshmotw-db/variant_interval. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

harshmotw-db · 2024-07-29T16:39:14Z

@LuciferYang Thanks for the comments! I'll address them in a future follow up PR.

… the Variant Spec ### What changes were proposed in this pull request? This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side. ### Why are the changes needed? The variant spec should be compatible with all SQL Standard data types. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to cast interval types to variants and vice versa. ### How was this patch tested? Unit tests in VariantExpressionSuite.scala and test_types.py ### Was this patch authored or co-authored using generative AI tooling? Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code. Generated-by: perplexity.ai Closes apache#47473 from harshmotw-db/variant_interval. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

harshmotw-db · 2024-08-16T21:20:57Z

Hi @LuciferYang I have made your requested changes in this PR.

### What changes were proposed in this pull request? The minor post-merge comments from #47473 have been addressed ### Why are the changes needed? Improved code style. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing VariantExpressionSuite passes ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47792 from harshmotw-db/harshmotw-db/PR_fix. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? The minor post-merge comments from apache#47473 have been addressed ### Why are the changes needed? Improved code style. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing VariantExpressionSuite passes ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47792 from harshmotw-db/harshmotw-db/PR_fix. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… the Variant Spec ### What changes were proposed in this pull request? This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side. ### Why are the changes needed? The variant spec should be compatible with all SQL Standard data types. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to cast interval types to variants and vice versa. ### How was this patch tested? Unit tests in VariantExpressionSuite.scala and test_types.py ### Was this patch authored or co-authored using generative AI tooling? Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code. Generated-by: perplexity.ai Closes apache#47473 from harshmotw-db/variant_interval. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? The minor post-merge comments from apache#47473 have been addressed ### Why are the changes needed? Improved code style. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing VariantExpressionSuite passes ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47792 from harshmotw-db/harshmotw-db/PR_fix. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… the Variant Spec ### What changes were proposed in this pull request? This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side. ### Why are the changes needed? The variant spec should be compatible with all SQL Standard data types. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to cast interval types to variants and vice versa. ### How was this patch tested? Unit tests in VariantExpressionSuite.scala and test_types.py ### Was this patch authored or co-authored using generative AI tooling? Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code. Generated-by: perplexity.ai Closes apache#47473 from harshmotw-db/variant_interval. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? The minor post-merge comments from apache#47473 have been addressed ### Why are the changes needed? Improved code style. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing VariantExpressionSuite passes ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47792 from harshmotw-db/harshmotw-db/PR_fix. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

added support for interval types in the variant spec

51f8fb0

github-actions bot added SQL BUILD DOCS PYTHON DSTREAM labels Jul 24, 2024

minor fix

be0bfc2

github-actions bot removed BUILD DSTREAM labels Jul 24, 2024

harshmotw-db marked this pull request as ready for review July 24, 2024 20:03

chenhao-db reviewed Jul 24, 2024

View reviewed changes

common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java Outdated Show resolved Hide resolved

common/variant/src/main/java/org/apache/spark/types/variant/Variant.java Show resolved Hide resolved

harshmotw-db added 2 commits July 24, 2024 13:37

fix

2e0fc07

fix

f25b2f6

gene-db suggested changes Jul 24, 2024

View reviewed changes

harshmotw-db added 4 commits July 24, 2024 16:16

fixed Gene's suggestions

b7fe993

added parquet equivalent for interval types in the readme

59d8572

minor change

f073b75

minor change

e727fb1

harshmotw-db changed the title ~~[SPARK-45891][SQL][PySpark][VARIANT] Add support for interval types in the Variant Spec~~ [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec Jul 25, 2024

harshmotw-db added 3 commits July 24, 2024 17:11

minor change

fa1b481

comment change

2ce9273

changed method name

9ab3ccb

cloud-fan reviewed Jul 25, 2024

View reviewed changes

common/variant/src/main/java/org/apache/spark/types/variant/Variant.java Show resolved Hide resolved

cloud-fan approved these changes Jul 25, 2024

View reviewed changes

gene-db reviewed Jul 25, 2024

View reviewed changes

harshmotw-db added 2 commits July 25, 2024 10:19

reformat

9aecb2c

resolved Gene's comments

73cfde8

harshmotw-db requested a review from gene-db July 25, 2024 18:19

harshmotw-db added 4 commits July 25, 2024 14:02

lint fix

d3e4193

fix

417e419

fix

57660d6

fix

52fed53

HyukjinKwon approved these changes Jul 27, 2024

View reviewed changes

HyukjinKwon closed this in 388ca1e Jul 27, 2024

LuciferYang reviewed Jul 29, 2024

View reviewed changes

harshmotw-db mentioned this pull request Aug 16, 2024

[SPARK-48994][FOLLOW-UP][VARIANT] Address post-merge comments #47792

Closed

harshmotw-db changed the title ~~[SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec~~ [SPARK-48994][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48994][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec #47473

[SPARK-48994][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec #47473

harshmotw-db commented Jul 24, 2024

gene-db left a comment

gene-db Jul 24, 2024

harshmotw-db Jul 24, 2024

harshmotw-db Jul 24, 2024

itholic commented Jul 25, 2024

cloud-fan Jul 25, 2024

harshmotw-db Jul 25, 2024 •

edited

Loading

gene-db left a comment

HyukjinKwon commented Jul 27, 2024

LuciferYang Jul 29, 2024

LuciferYang Jul 29, 2024

LuciferYang Jul 29, 2024

LuciferYang Jul 29, 2024

LuciferYang Jul 29, 2024

LuciferYang Jul 29, 2024

harshmotw-db commented Jul 29, 2024

harshmotw-db commented Aug 16, 2024


		import org.apache.spark.SparkException;

		import java.math.BigDecimal;

[SPARK-48994][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec #47473

[SPARK-48994][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec #47473

Conversation

harshmotw-db commented Jul 24, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

gene-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic commented Jul 25, 2024

Choose a reason for hiding this comment

harshmotw-db Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

gene-db left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harshmotw-db commented Jul 29, 2024

harshmotw-db commented Aug 16, 2024

harshmotw-db Jul 25, 2024 •

edited

Loading