Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48994][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec #47473

Closed
wants to merge 17 commits into from

Conversation

harshmotw-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds support for the YearMonthIntervalType and DayTimeIntervalType as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side.

Why are the changes needed?

The variant spec should be compatible with all SQL Standard data types.

Does this PR introduce any user-facing change?

Yes, it allows users to cast interval types to variants and vice versa.

How was this patch tested?

Unit tests in VariantExpressionSuite.scala and test_types.py

Was this patch authored or co-authored using generative AI tooling?

Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code.
Generated-by: perplexity.ai

@harshmotw-db harshmotw-db marked this pull request as ready for review July 24, 2024 20:03
Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshmotw-db Thanks for the features! I left a few comments.

| string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
| binary from metadata | `17` | BINARY | Little-endian index into the metadata dictionary. Number of bytes is equal to the metadata `offset_size`. |
| string from metadata | `18` | STRING | Little-endian index into the metadata dictionary. Number of bytes is equal to the metadata `offset_size`. |
| year-month interval | `19` | YearMonthIntervalType(start_field, end_field) | 1 byte denoting start field (1 bit) and end field (1 bit) starting at LSB followed by 4-byte little-endian value. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the parquet types be for these intervals?

Also, we need to describe what the start/end fields are and how exactly they are encoded in the byte.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had mistakenly put in the equivalent spark types here earlier. I have removed the parquet types for now as I am investigating the parquet types.

The details about the start and end field are in a paragraph after this table in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the following Python script on a parquet table containing these interval types and found that these intervals are intrinsically stored as int/long and the type info is stored in the metadata. I'll update the table to reflect this.

>>> import pyarrow.parquet as pq
>>> table = pq.read_table('/home/harsh.motwani/tables/part-00000-tid-8067172485220669242-1687c1be-9e28-455a-817a-449a862b4a05-0-1-c000.snappy.parquet')
>>> table.schema
ymi0: int32 not null
ymi1: int32 not null
ymi2: int32
dti0: int64
dti1: int64
-- schema metadata --
org.apache.spark.version: '4.0.0'
org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 375

>>> table.schema.metadata
OrderedDict([(b'org.apache.spark.version', b'4.0.0'), (b'org.apache.spark.sql.parquet.row.metadata', b'{"type":"struct","fields":[{"name":"ymi0","type":"interval year to month","nullable":false,"metadata":{}},{"name":"ymi1","type":"interval year","nullable":false,"metadata":{}},{"name":"ymi2","type":"interval month","nullable":true,"metadata":{}},{"name":"dti0","type":"interval day to second","nullable":true,"metadata":{}},{"name":"dti1","type":"interval hour to minute","nullable":true,"metadata":{}}]}')])

@itholic
Copy link
Contributor

itholic commented Jul 25, 2024

nit: [PySpark] in the title should be [PYTHON]. You can see the full list of PR categories from https://spark-prs.appspot.com/ :-)

@harshmotw-db harshmotw-db changed the title [SPARK-45891][SQL][PySpark][VARIANT] Add support for interval types in the Variant Spec [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec Jul 25, 2024
import java.math.BigDecimal;
import java.util.ArrayList;

// Replicating code from SparkIntervalUtils so code in the 'common' space can work with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why replicating? I think other modules depend on common/utils and it's fine we move interval utils to this module.

Copy link
Contributor Author

@harshmotw-db harshmotw-db Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. However, I believe that should be a different PR as it would be a big change in itself. For the purposes of this PR, these functions only support ANSIStyle while the functions on SQL also expect Hive style sometimes.

Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshmotw-db Thanks for this feature! I left a few more questions/comments.

@harshmotw-db harshmotw-db requested a review from gene-db July 25, 2024 18:19
@HyukjinKwon
Copy link
Member

Merged to master.

.stripTrailingZeros().toPlainString());
}
return prefix + String.format(formatBuilder.toString(), formatArgs.toArray()) + postfix;
} catch (SparkException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try-catch here seems a bit redundant. Why do we need to catch the SparkException only to rethrow it?

rest %= MICROS_PER_MINUTE;
} else if (startField == SECOND) {
String leadZero = rest < 10 * MICROS_PER_SECOND ? "0" : "";
formatBuilder.append(leadZero + BigDecimal.valueOf(rest, 6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use chained calls, like

formatBuilder.append(leadZero).append(BigDecimal.valueOf(rest, 6).stripTrailingZeros().toPlainString());

otherwise leadZero + BigDecimal.valueOf(rest, 6).stripTrailingZeros().toPlainString() will result in another string concatenation."

}
if (startField < SECOND && SECOND <= endField) {
String leadZero = rest < 10 * MICROS_PER_SECOND ? "0" : "";
formatBuilder.append(":" + leadZero + BigDecimal.valueOf(rest, 6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}
}
StringBuilder formatBuilder = new StringBuilder(sign);
ArrayList<Long> formatArgs = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List<Long> formatArgs = new ArrayList<>();


import org.apache.spark.SparkException;

import java.math.BigDecimal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import order rule in the newly added Java file should be consistent with that of the Scala files.

String firstStr = "-" + (startField == DAY ? Long.toString(MAX_DAY) :
(startField == HOUR ? Long.toString(MAX_HOUR) :
(startField == MINUTE ? Long.toString(MAX_MINUTE) :
Long.toString(MAX_SECOND) + ".775808")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_SECOND + ".775808"

ilicmarkodb pushed a commit to ilicmarkodb/spark that referenced this pull request Jul 29, 2024
… the Variant Spec

### What changes were proposed in this pull request?

This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side.

### Why are the changes needed?

The variant spec should be compatible with all SQL Standard data types.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to cast interval types to variants and vice versa.

### How was this patch tested?

Unit tests in VariantExpressionSuite.scala and test_types.py

### Was this patch authored or co-authored using generative AI tooling?

Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code.
Generated-by: perplexity.ai

Closes apache#47473 from harshmotw-db/variant_interval.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@harshmotw-db
Copy link
Contributor Author

@LuciferYang Thanks for the comments! I'll address them in a future follow up PR.

fusheng9399 pushed a commit to fusheng9399/spark that referenced this pull request Aug 6, 2024
… the Variant Spec

### What changes were proposed in this pull request?

This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side.

### Why are the changes needed?

The variant spec should be compatible with all SQL Standard data types.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to cast interval types to variants and vice versa.

### How was this patch tested?

Unit tests in VariantExpressionSuite.scala and test_types.py

### Was this patch authored or co-authored using generative AI tooling?

Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code.
Generated-by: perplexity.ai

Closes apache#47473 from harshmotw-db/variant_interval.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@harshmotw-db
Copy link
Contributor Author

Hi @LuciferYang I have made your requested changes in this PR.

cloud-fan pushed a commit that referenced this pull request Aug 17, 2024
### What changes were proposed in this pull request?

The minor post-merge comments from #47473 have been addressed

### Why are the changes needed?

Improved code style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing VariantExpressionSuite passes

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47792 from harshmotw-db/harshmotw-db/PR_fix.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@harshmotw-db harshmotw-db changed the title [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec [SPARK-48994][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec Aug 26, 2024
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
### What changes were proposed in this pull request?

The minor post-merge comments from apache#47473 have been addressed

### Why are the changes needed?

Improved code style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing VariantExpressionSuite passes

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47792 from harshmotw-db/harshmotw-db/PR_fix.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
… the Variant Spec

### What changes were proposed in this pull request?

This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side.

### Why are the changes needed?

The variant spec should be compatible with all SQL Standard data types.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to cast interval types to variants and vice versa.

### How was this patch tested?

Unit tests in VariantExpressionSuite.scala and test_types.py

### Was this patch authored or co-authored using generative AI tooling?

Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code.
Generated-by: perplexity.ai

Closes apache#47473 from harshmotw-db/variant_interval.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

The minor post-merge comments from apache#47473 have been addressed

### Why are the changes needed?

Improved code style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing VariantExpressionSuite passes

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47792 from harshmotw-db/harshmotw-db/PR_fix.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
… the Variant Spec

### What changes were proposed in this pull request?

This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side.

### Why are the changes needed?

The variant spec should be compatible with all SQL Standard data types.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to cast interval types to variants and vice versa.

### How was this patch tested?

Unit tests in VariantExpressionSuite.scala and test_types.py

### Was this patch authored or co-authored using generative AI tooling?

Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code.
Generated-by: perplexity.ai

Closes apache#47473 from harshmotw-db/variant_interval.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?

The minor post-merge comments from apache#47473 have been addressed

### Why are the changes needed?

Improved code style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing VariantExpressionSuite passes

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47792 from harshmotw-db/harshmotw-db/PR_fix.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants