-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49451] Allow duplicate keys in parse_json. #47920
Conversation
403fe5a
to
5e09539
Compare
@cloud-fan could you help review? thanks! |
@@ -89,6 +89,12 @@ class VariantExpressionEvalUtilsSuite extends SparkFunSuite { | |||
/* offset list */ 0, 2, 4, 6, | |||
/* field data */ primitiveHeader(INT1), 1, primitiveHeader(INT1), 2, shortStrHeader(1), '3'), | |||
Array(VERSION, 3, 0, 1, 2, 3, 'a', 'b', 'c')) | |||
check("""{"a": 1, "b": 2, "c": "3", "a": 4}""", Array(objectHeader(false, 1, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard JSON can't have multiple keys, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rfc8259 says explicitly that:
"The names within an object SHOULD be unique."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a JSON object is invalid if it contains duplicate keys. However, it is not required that our implementation must throw an error for this invalid input. As stated in the RFC:
Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates.
It seems fair to follow the "many implementations".
As a side note, from_json
also takes the last-win policy rather than throw an error. It is not even configurable (you cannot make it throw an error).
spark-sql (default)> select from_json('{"a": 1, "a": 2, "a": 3}', 'a int');
{"a":3}
Time taken: 1.164 seconds, Fetched 1 row(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I am fine with having a conf that is disabled by default but it shouldn't be enabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
@cloud-fan @HyukjinKwon could you help review? thanks! |
thanks, merging to master! |
### What changes were proposed in this pull request? Before the change, `parse_json` will throw an error if there are duplicate keys in an input JSON object. After the change, `parse_json` will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys. The change is guarded by a flag and disabled by default. ### Why are the changes needed? To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance). ### Does this PR introduce _any_ user-facing change? Yes, as described in the first section. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47920 from chenhao-db/allow_duplicate_keys. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…_, 'variant') ### What changes were proposed in this pull request? This PR adds support for duplicate key support in the `from_json(_, 'variant')` query pattern. Duplicate key support [has been introduced](#47920) in `parse_json`, json scans and the `from_json` expressions with nested schemas but this code path was not updated. ### Why are the changes needed? This change makes the behavior of `from_json(_, 'variant')` consistent with every other variant construction expression. ### Does this PR introduce _any_ user-facing change? It potentially allows users to use the `from_json(<input>, 'variant')` expression on json inputs with duplicate keys depending on a config. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48177 from harshmotw-db/harshmotw-db/master. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Before the change, `parse_json` will throw an error if there are duplicate keys in an input JSON object. After the change, `parse_json` will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys. The change is guarded by a flag and disabled by default. ### Why are the changes needed? To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance). ### Does this PR introduce _any_ user-facing change? Yes, as described in the first section. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47920 from chenhao-db/allow_duplicate_keys. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…_, 'variant') ### What changes were proposed in this pull request? This PR adds support for duplicate key support in the `from_json(_, 'variant')` query pattern. Duplicate key support [has been introduced](apache#47920) in `parse_json`, json scans and the `from_json` expressions with nested schemas but this code path was not updated. ### Why are the changes needed? This change makes the behavior of `from_json(_, 'variant')` consistent with every other variant construction expression. ### Does this PR introduce _any_ user-facing change? It potentially allows users to use the `from_json(<input>, 'variant')` expression on json inputs with duplicate keys depending on a config. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48177 from harshmotw-db/harshmotw-db/master. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Before the change, `parse_json` will throw an error if there are duplicate keys in an input JSON object. After the change, `parse_json` will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys. The change is guarded by a flag and disabled by default. ### Why are the changes needed? To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance). ### Does this PR introduce _any_ user-facing change? Yes, as described in the first section. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47920 from chenhao-db/allow_duplicate_keys. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…_, 'variant') ### What changes were proposed in this pull request? This PR adds support for duplicate key support in the `from_json(_, 'variant')` query pattern. Duplicate key support [has been introduced](apache#47920) in `parse_json`, json scans and the `from_json` expressions with nested schemas but this code path was not updated. ### Why are the changes needed? This change makes the behavior of `from_json(_, 'variant')` consistent with every other variant construction expression. ### Does this PR introduce _any_ user-facing change? It potentially allows users to use the `from_json(<input>, 'variant')` expression on json inputs with duplicate keys depending on a config. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48177 from harshmotw-db/harshmotw-db/master. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Before the change,
parse_json
will throw an error if there are duplicate keys in an input JSON object. After the change,parse_json
will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys.The change is guarded by a flag and disabled by default.
Why are the changes needed?
To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance).
Does this PR introduce any user-facing change?
Yes, as described in the first section.
How was this patch tested?
New unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.