Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49275][SQL] Fix return type nullness of the xpath expression #47796

Closed
wants to merge 2 commits into from

Conversation

chenhao-db
Copy link
Contributor

What changes were proposed in this pull request?

The xpath expression incorrectly marks its return type as array of non-null strings. However, it can actually return an array containing nulls. This can cause NPE in code generation, such as query select coalesce(xpath(repeat('<a></a>', id), 'a')[0], '') from range(1, 2).

Why are the changes needed?

It avoids potential failures in queries that uses the xpath expression.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

A new unit test. It would fail without the change in the PR.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Aug 17, 2024
@chenhao-db
Copy link
Contributor Author

@HyukjinKwon @dongjoon-hyun Could you help review? Thanks!

@HyukjinKwon
Copy link
Member

I believe this was added for Hive compat. Can we check if the behaviour is the same?

@chenhao-db
Copy link
Contributor Author

@HyukjinKwon Unfortunately, the behavior is not the same. Running the query in Hive gives a different result:

0: jdbc:hive2://> select xpath('<a><b>b1</b><b>b2</b><b>b3</b><c>c1</c><c>c2</c></a>','a/b');
+------+
| _c0  |
+------+
| []   |
+------+

This is a day-1 issue since the xpath expression was introduced to Spark. The root cause is that Hive drops null array elements (source), while Spark doesn't (source).

I'm not sure about the next step. We could make Spark consistent with Hive, but that would be a breaking change for Spark.

@HyukjinKwon
Copy link
Member

What's the result before this change in Spark? If it was throwing an exception before, would better match with Hive's for now.

@HyukjinKwon
Copy link
Member

If we can justify the difference behaviour from Hive, that works to me too.

@chenhao-db
Copy link
Contributor Author

In select xpath('<a><b>b1</b><b>b2</b><b>b3</b><c>c1</c><c>c2</c></a>','a/b');, the Spark results before and after the change are the same: [null,null,null]. The change only marks the return type correctly, but doesn't change the return value.

However, in select coalesce(xpath(repeat('<a></a>', id), 'a')[0], '') from range(1, 2), the query would fail with an NPE in generated code before the change, like:

Caused by: java.lang.NullPointerException
  at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
  at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
...

But will return an empty string after the change.

@HyukjinKwon
Copy link
Member

ah gotya so it's an existing behaviour okie gotya

@MaxGekk
Copy link
Member

MaxGekk commented Sep 1, 2024

@chenhao-db Do the previous Spark versions suffer from the issue too like branch-3.5?

@chenhao-db
Copy link
Contributor Author

@MaxGekk Yes, I believe this issue has existed since xpath was introduced.

@MaxGekk
Copy link
Member

MaxGekk commented Sep 2, 2024

+1, LGTM. Merging to master/3.5.
Thank you, @chenhao-db and @HyukjinKwon for review.

@MaxGekk MaxGekk closed this in ec7570e Sep 2, 2024
@MaxGekk
Copy link
Member

MaxGekk commented Sep 2, 2024

@chenhao-db The changes cause conflicts in branch-3.5. Please, open a separate PR with backport.

chenhao-db added a commit to chenhao-db/oss that referenced this pull request Sep 2, 2024
The `xpath` expression incorrectly marks its return type as array of non-null strings. However, it can actually return an array containing nulls. This can cause NPE in code generation, such as query `select coalesce(xpath(repeat('<a></a>', id), 'a')[0], '') from range(1, 2)`.

It avoids potential failures in queries that uses the `xpath` expression.

No.

A new unit test. It would fail without the change in the PR.

No.

Closes apache#47796 from chenhao-db/fix_xpath_nullness.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Sep 4, 2024
### What changes were proposed in this pull request?

This is a cherry-pick of #47796.

The `xpath` expression incorrectly marks its return type as array of non-null strings. However, it can actually return an array containing nulls. This can cause NPE in code generation, such as query `select coalesce(xpath(repeat('<a></a>', id), 'a')[0], '') from range(1, 2)`.

### Why are the changes needed?

It avoids potential failures in queries that uses the `xpath` expression.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new unit test. It would fail without the change in the PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47959 from chenhao-db/fix_xpath_nullness_3.5.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
### What changes were proposed in this pull request?

The `xpath` expression incorrectly marks its return type as array of non-null strings. However, it can actually return an array containing nulls. This can cause NPE in code generation, such as query `select coalesce(xpath(repeat('<a></a>', id), 'a')[0], '') from range(1, 2)`.

### Why are the changes needed?

It avoids potential failures in queries that uses the `xpath` expression.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new unit test. It would fail without the change in the PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47796 from chenhao-db/fix_xpath_nullness.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

The `xpath` expression incorrectly marks its return type as array of non-null strings. However, it can actually return an array containing nulls. This can cause NPE in code generation, such as query `select coalesce(xpath(repeat('<a></a>', id), 'a')[0], '') from range(1, 2)`.

### Why are the changes needed?

It avoids potential failures in queries that uses the `xpath` expression.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new unit test. It would fail without the change in the PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47796 from chenhao-db/fix_xpath_nullness.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?

The `xpath` expression incorrectly marks its return type as array of non-null strings. However, it can actually return an array containing nulls. This can cause NPE in code generation, such as query `select coalesce(xpath(repeat('<a></a>', id), 'a')[0], '') from range(1, 2)`.

### Why are the changes needed?

It avoids potential failures in queries that uses the `xpath` expression.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new unit test. It would fail without the change in the PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47796 from chenhao-db/fix_xpath_nullness.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants