Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49451] Allow duplicate keys in parse_json. #47920

Closed
wants to merge 3 commits into from

Conversation

chenhao-db
Copy link
Contributor

@chenhao-db chenhao-db commented Aug 28, 2024

What changes were proposed in this pull request?

Before the change, parse_json will throw an error if there are duplicate keys in an input JSON object. After the change, parse_json will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys.

The change is guarded by a flag and disabled by default.

Why are the changes needed?

To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance).

Does this PR introduce any user-facing change?

Yes, as described in the first section.

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Aug 28, 2024
@chenhao-db chenhao-db force-pushed the allow_duplicate_keys branch from 403fe5a to 5e09539 Compare August 28, 2024 21:47
@chenhao-db
Copy link
Contributor Author

@cloud-fan could you help review? thanks!

@@ -89,6 +89,12 @@ class VariantExpressionEvalUtilsSuite extends SparkFunSuite {
/* offset list */ 0, 2, 4, 6,
/* field data */ primitiveHeader(INT1), 1, primitiveHeader(INT1), 2, shortStrHeader(1), '3'),
Array(VERSION, 3, 0, 1, 2, 3, 'a', 'b', 'c'))
check("""{"a": 1, "b": 2, "c": "3", "a": 4}""", Array(objectHeader(false, 1, 1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard JSON can't have multiple keys, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rfc8259 says explicitly that:
"The names within an object SHOULD be unique."

Copy link
Contributor Author

@chenhao-db chenhao-db Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a JSON object is invalid if it contains duplicate keys. However, it is not required that our implementation must throw an error for this invalid input. As stated in the RFC:

Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates.

It seems fair to follow the "many implementations".

As a side note, from_json also takes the last-win policy rather than throw an error. It is not even configurable (you cannot make it throw an error).

spark-sql (default)> select from_json('{"a": 1, "a": 2, "a": 3}', 'a int');
{"a":3}
Time taken: 1.164 seconds, Fetched 1 row(s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I am fine with having a conf that is disabled by default but it shouldn't be enabled by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@chenhao-db
Copy link
Contributor Author

@cloud-fan @HyukjinKwon could you help review? thanks!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 8879df5 Sep 2, 2024
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
### What changes were proposed in this pull request?

Before the change, `parse_json` will throw an error if there are duplicate keys in an input JSON object. After the change, `parse_json` will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys.

The change is guarded by a flag and disabled by default.

### Why are the changes needed?

To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance).

### Does this PR introduce _any_ user-facing change?

Yes, as described in the first section.

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47920 from chenhao-db/allow_duplicate_keys.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
HyukjinKwon pushed a commit that referenced this pull request Sep 21, 2024
…_, 'variant')

### What changes were proposed in this pull request?

This PR adds support for duplicate key support in the `from_json(_, 'variant')` query pattern. Duplicate key support [has been introduced](#47920) in `parse_json`, json scans and the `from_json` expressions with nested schemas but this code path was not updated.

### Why are the changes needed?

This change makes the behavior of `from_json(_, 'variant')` consistent with every other variant construction expression.

### Does this PR introduce _any_ user-facing change?

It potentially allows users to use the `from_json(<input>, 'variant')` expression on json inputs with duplicate keys depending on a config.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48177 from harshmotw-db/harshmotw-db/master.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

Before the change, `parse_json` will throw an error if there are duplicate keys in an input JSON object. After the change, `parse_json` will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys.

The change is guarded by a flag and disabled by default.

### Why are the changes needed?

To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance).

### Does this PR introduce _any_ user-facing change?

Yes, as described in the first section.

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47920 from chenhao-db/allow_duplicate_keys.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…_, 'variant')

### What changes were proposed in this pull request?

This PR adds support for duplicate key support in the `from_json(_, 'variant')` query pattern. Duplicate key support [has been introduced](apache#47920) in `parse_json`, json scans and the `from_json` expressions with nested schemas but this code path was not updated.

### Why are the changes needed?

This change makes the behavior of `from_json(_, 'variant')` consistent with every other variant construction expression.

### Does this PR introduce _any_ user-facing change?

It potentially allows users to use the `from_json(<input>, 'variant')` expression on json inputs with duplicate keys depending on a config.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48177 from harshmotw-db/harshmotw-db/master.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?

Before the change, `parse_json` will throw an error if there are duplicate keys in an input JSON object. After the change, `parse_json` will keep the last field with the same key. It doesn't affect other variant building expressions (creating a variant from struct/map/variant) because it is legal for them to contain duplicate keys.

The change is guarded by a flag and disabled by default.

### Why are the changes needed?

To make the data migration simpler. The user won't need to change its data if it contains duplicated keys. The behavior is inspired by https://docs.aws.amazon.com/redshift/latest/dg/super-configurations.html#parsing-options-super (reject duplicate keys or keep the last occurance).

### Does this PR introduce _any_ user-facing change?

Yes, as described in the first section.

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47920 from chenhao-db/allow_duplicate_keys.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…_, 'variant')

### What changes were proposed in this pull request?

This PR adds support for duplicate key support in the `from_json(_, 'variant')` query pattern. Duplicate key support [has been introduced](apache#47920) in `parse_json`, json scans and the `from_json` expressions with nested schemas but this code path was not updated.

### Why are the changes needed?

This change makes the behavior of `from_json(_, 'variant')` consistent with every other variant construction expression.

### Does this PR introduce _any_ user-facing change?

It potentially allows users to use the `from_json(<input>, 'variant')` expression on json inputs with duplicate keys depending on a config.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48177 from harshmotw-db/harshmotw-db/master.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants