[SPARK-49451] Allow duplicate keys in parse_json. #47920

chenhao-db · 2024-08-28T21:36:45Z

What changes were proposed in this pull request?

Before the change, parse_json will throw an error if there are duplicate keys in an input JSON object. After the change, parse_json will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys.

The change is guarded by a flag and disabled by default.

Why are the changes needed?

To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance).

Does this PR introduce any user-facing change?

Yes, as described in the first section.

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

chenhao-db · 2024-08-28T23:46:45Z

@cloud-fan could you help review? thanks!

HyukjinKwon · 2024-08-29T08:06:53Z

...cala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtilsSuite.scala

@@ -89,6 +89,12 @@ class VariantExpressionEvalUtilsSuite extends SparkFunSuite {
      /* offset list */ 0, 2, 4, 6,
      /* field data */ primitiveHeader(INT1), 1, primitiveHeader(INT1), 2, shortStrHeader(1), '3'),
      Array(VERSION, 3, 0, 1, 2, 3, 'a', 'b', 'c'))
+    check("""{"a": 1, "b": 2, "c": "3", "a": 4}""", Array(objectHeader(false, 1, 1),


The standard JSON can't have multiple keys, no?

The rfc8259 says explicitly that:
"The names within an object SHOULD be unique."

I agree that a JSON object is invalid if it contains duplicate keys. However, it is not required that our implementation must throw an error for this invalid input. As stated in the RFC:

Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates.

It seems fair to follow the "many implementations".

As a side note, from_json also takes the last-win policy rather than throw an error. It is not even configurable (you cannot make it throw an error).

spark-sql (default)> select from_json('{"a": 1, "a": 2, "a": 3}', 'a int'); {"a":3} Time taken: 1.164 seconds, Fetched 1 row(s)

Alright, I am fine with having a conf that is disabled by default but it shouldn't be enabled by default.

chenhao-db · 2024-09-02T02:58:33Z

@cloud-fan @HyukjinKwon could you help review? thanks!

cloud-fan · 2024-09-02T06:24:37Z

thanks, merging to master!

### What changes were proposed in this pull request? Before the change, `parse_json` will throw an error if there are duplicate keys in an input JSON object. After the change, `parse_json` will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys. The change is guarded by a flag and disabled by default. ### Why are the changes needed? To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance). ### Does this PR introduce _any_ user-facing change? Yes, as described in the first section. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47920 from chenhao-db/allow_duplicate_keys. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…_, 'variant') ### What changes were proposed in this pull request? This PR adds support for duplicate key support in the `from_json(_, 'variant')` query pattern. Duplicate key support [has been introduced](#47920) in `parse_json`, json scans and the `from_json` expressions with nested schemas but this code path was not updated. ### Why are the changes needed? This change makes the behavior of `from_json(_, 'variant')` consistent with every other variant construction expression. ### Does this PR introduce _any_ user-facing change? It potentially allows users to use the `from_json(<input>, 'variant')` expression on json inputs with duplicate keys depending on a config. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48177 from harshmotw-db/harshmotw-db/master. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Before the change, `parse_json` will throw an error if there are duplicate keys in an input JSON object. After the change, `parse_json` will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys. The change is guarded by a flag and disabled by default. ### Why are the changes needed? To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance). ### Does this PR introduce _any_ user-facing change? Yes, as described in the first section. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47920 from chenhao-db/allow_duplicate_keys. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…_, 'variant') ### What changes were proposed in this pull request? This PR adds support for duplicate key support in the `from_json(_, 'variant')` query pattern. Duplicate key support [has been introduced](apache#47920) in `parse_json`, json scans and the `from_json` expressions with nested schemas but this code path was not updated. ### Why are the changes needed? This change makes the behavior of `from_json(_, 'variant')` consistent with every other variant construction expression. ### Does this PR introduce _any_ user-facing change? It potentially allows users to use the `from_json(<input>, 'variant')` expression on json inputs with duplicate keys depending on a config. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48177 from harshmotw-db/harshmotw-db/master. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Before the change, `parse_json` will throw an error if there are duplicate keys in an input JSON object. After the change, `parse_json` will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys. The change is guarded by a flag and disabled by default. ### Why are the changes needed? To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance). ### Does this PR introduce _any_ user-facing change? Yes, as described in the first section. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47920 from chenhao-db/allow_duplicate_keys. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…_, 'variant') ### What changes were proposed in this pull request? This PR adds support for duplicate key support in the `from_json(_, 'variant')` query pattern. Duplicate key support [has been introduced](apache#47920) in `parse_json`, json scans and the `from_json` expressions with nested schemas but this code path was not updated. ### Why are the changes needed? This change makes the behavior of `from_json(_, 'variant')` consistent with every other variant construction expression. ### Does this PR introduce _any_ user-facing change? It potentially allows users to use the `from_json(<input>, 'variant')` expression on json inputs with duplicate keys depending on a config. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48177 from harshmotw-db/harshmotw-db/master. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added the SQL label Aug 28, 2024

initial

5e09539

chenhao-db force-pushed the allow_duplicate_keys branch from 403fe5a to 5e09539 Compare August 28, 2024 21:47

HyukjinKwon reviewed Aug 29, 2024

View reviewed changes

chenhao-db requested review from MaxGekk and HyukjinKwon August 29, 2024 18:03

chenhao-db added 2 commits August 29, 2024 18:09

disable by default

c7db081

fix golden files

a032244

github-actions bot added the CONNECT label Aug 30, 2024

cloud-fan approved these changes Sep 2, 2024

View reviewed changes

cloud-fan closed this in 8879df5 Sep 2, 2024

harshmotw-db mentioned this pull request Sep 20, 2024

[SPARK-49451][FOLLOW-UP] Add support for duplicate keys in from_json(_, 'variant') #48177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49451] Allow duplicate keys in parse_json. #47920

[SPARK-49451] Allow duplicate keys in parse_json. #47920

chenhao-db commented Aug 28, 2024 •

edited

Loading

chenhao-db commented Aug 28, 2024

HyukjinKwon Aug 29, 2024

MaxGekk Aug 29, 2024

chenhao-db Aug 29, 2024 •

edited

Loading

HyukjinKwon Aug 30, 2024

chenhao-db Aug 30, 2024

chenhao-db commented Sep 2, 2024

cloud-fan commented Sep 2, 2024

[SPARK-49451] Allow duplicate keys in parse_json. #47920

[SPARK-49451] Allow duplicate keys in parse_json. #47920

Conversation

chenhao-db commented Aug 28, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

chenhao-db commented Aug 28, 2024

HyukjinKwon Aug 29, 2024

Choose a reason for hiding this comment

MaxGekk Aug 29, 2024

Choose a reason for hiding this comment

chenhao-db Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Aug 30, 2024

Choose a reason for hiding this comment

chenhao-db Aug 30, 2024

Choose a reason for hiding this comment

chenhao-db commented Sep 2, 2024

cloud-fan commented Sep 2, 2024

chenhao-db commented Aug 28, 2024 •

edited

Loading

chenhao-db Aug 29, 2024 •

edited

Loading