Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Spark assert_true and raise_error function support #5991

Open
gaoyangxiaozhu opened this issue Jun 5, 2024 · 11 comments
Open

[Core] Spark assert_true and raise_error function support #5991

gaoyangxiaozhu opened this issue Jun 5, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@gaoyangxiaozhu
Copy link
Contributor

gaoyangxiaozhu commented Jun 5, 2024

Description

spark assert_true and raise_error function not compatible with current velox type system and function register interface.
as below query a example,

val df = spark.sql("select assert_true(1 > 0) as v, id from t")
df: org.apache.spark.sql.DataFrame = [v: void, id: bigint]
df.schema
res3: org.apache.spark.sql.types.StructType = StructType(StructField(v,NullType,true),StructField(id,LongType,true))

it would generate void type and NullType in output schema which would convert to unknown type in velox, for raise_error, it expect void return type but velox function always expect a non-void return type.

One way i tried is convert raise_error from void return type to string type but it need we update not only expression itself but also the output type schema also from NullType to StringType, otherwise would encounter mem allocation issue based on my local test.

Update:
RaiseError's data type is NullType. Ditto for AssertTrue as it is replaced by IF + RaiseError.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala#L82

@zhouyuan / @PHILO-HE / @FelixYBW

@gaoyangxiaozhu gaoyangxiaozhu added the enhancement New feature or request label Jun 5, 2024
@FelixYBW
Copy link
Contributor

FelixYBW commented Jun 5, 2024

Is the two function implemented in Presto list or Spark list?

@gaoyangxiaozhu
Copy link
Contributor Author

want to use this item to track how do support for those specific spark functions, @zhouyuan / @PHILO-HE / @FelixYBW

could you put your thought here, actully the issue can be summaried as how to support NullType schema and void return type spark function.

@gaoyangxiaozhu
Copy link
Contributor Author

Is the two function implemented in Presto list or Spark list?

spark list https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.assert_true.html

@gaoyangxiaozhu
Copy link
Contributor Author

looks if we can make a generic solution to let gluten always call vanilla spark to project values for the function, it would be nice. like facebookincubator/velox#9957 encounter similar issue

@FelixYBW
Copy link
Contributor

FelixYBW commented Jun 5, 2024

I see. just talked with @zhli1142015 on the interval issue.

We need a fundamental way to test all the Spark functions. @rui-mo Can your velox_Spark fuzzer test achieve this?

@FelixYBW
Copy link
Contributor

FelixYBW commented Jun 5, 2024

looks if we can make a generic solution to let gluten always call vanilla spark to project values for the function, it would be nice. like facebookincubator/velox#9957 encounter similar issue

It's different issue from input_file_name(). If we can implement the function in Velox, we should do that firstly.

@gaoyangxiaozhu
Copy link
Contributor Author

looks if we can make a generic solution to let gluten always call vanilla spark to project values for the function, it would be nice. like facebookincubator/velox#9957 encounter similar issue

It's different issue from input_file_name(). If we can implement the function in Velox, we should do that firstly.

so please check my this try

One way i tried is convert raise_error from void return type to string type but it need we update not only expression itself but also the output type schema also from NullType to StringType, otherwise would encounter mem allocation issue based on my local test.
``

this need one change to apply if we detect there is a `NullType` in output schema, then we need convert it always `StringType`, does that change ok ? 

@FelixYBW
Copy link
Contributor

FelixYBW commented Jun 5, 2024

this need one change to apply if we detect there is a NullType in output schema, then we need convert it always StringType, does that change ok ?

No, we should use the same data type as Spark.

Can assert_true return unknown_type in velox? null_type is implemented as unknown_type in velox.

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Jun 5, 2024

@gaoyangxiaozhu, velox doesn't support UnknownType (NullType) for function output? I remember this type is supported for input.

@gaoyangxiaozhu
Copy link
Contributor Author

gaoyangxiaozhu commented Jun 5, 2024

@gaoyangxiaozhu, velox doesn't support UnknownType (NullType) for function output? I remember this type is supported for input.

got ! let me check and try! thank you @PHILO-HE / @FelixYBW

@rui-mo
Copy link
Contributor

rui-mo commented Jun 5, 2024

We need a fundamental way to test all the Spark functions. @rui-mo Can your velox_Spark fuzzer test achieve this?

@FelixYBW With this tool, the result of a Velox Spark function can be compared with the result of vanilla Spark. Currently we are working on enabling it in aggregate fuzzer test.

facebook-github-bot pushed a commit to facebookincubator/velox that referenced this issue Jul 3, 2024
Summary:
This PR adds Spark `raise_error` function. It is re-used in Spark `assert_true`
function, so after this PR `assert_true` can also be supported.

Spark document: https://spark.apache.org/docs/latest/api/sql/#raise_error
Spark code: https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala#L85C1-L92C1
Issue: apache/incubator-gluten#5991

Pull Request resolved: #10110

Reviewed By: kevinwilfong

Differential Revision: D59252851

Pulled By: bikramSingh91

fbshipit-source-id: d41347bed8265825bb415e8604179b38ecc89bc8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants