-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve case for RowType's field name and JSON content when CAST
#21869
Preserve case for RowType's field name and JSON content when CAST
#21869
Conversation
eff756a
to
0ac2583
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd It is a big hard to understand what the change is without an example. Would you update documentation to clarify the behavior with and without the new config property?
https://prestodb.io/docs/0.285/functions/json.html
CC: @aditi-pandit @amitkdutta @spershin Folks, we need to check which behavior is implemented in Velox and whether we need to provide both options.
0ac2583
to
2df2673
Compare
Codenotify: Notifying subscribers in CODENOTIFY files for diff 6c2bf82...8b13091.
|
Hi @mbasmanova, I have added some notes and examples to presto-docs/src/main/sphinx/functions/json.rst, trying to describe the behavior with and without the new config property. Please take a look, thanks. |
@@ -71,6 +71,28 @@ Cast from JSON | |||
SELECT CAST(JSON '{"k1": [1, 23], "k2": 456}' AS MAP(VARCHAR, JSON)); -- {k1 = JSON '[1,23]', k2 = JSON '456'} | |||
SELECT CAST(JSON '[null]' AS ARRAY(JSON)); -- [JSON 'null'] | |||
|
|||
.. note:: | |||
|
|||
When casting from ``JSON`` to ``ROW``, by default the case of field names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd Thank you for adding this doc. This is helpful.
@@ -300,6 +300,7 @@ public final class SystemSessionProperties | |||
public static final String ADD_PARTIAL_NODE_FOR_ROW_NUMBER_WITH_LIMIT = "add_partial_node_for_row_number_with_limit"; | |||
public static final String REWRITE_CASE_TO_MAP_ENABLED = "rewrite_case_to_map_enabled"; | |||
public static final String FIELD_NAMES_IN_JSON_CAST_ENABLED = "field_names_in_json_cast_enabled"; | |||
public static final String LEGACY_JSON_CAST = "legacy_json_cast"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC: @pranjalssh @tdcmeehan @majetideepak @aditi-pandit @spershin
We may need to add logic to check this session property and fail fast if native workers do not support it. Otherwise, we'll get wrong results when processing queries with Prestissimo.
I wonder if it is better to not allow changing this on a per-query basis and rather have this as a cluster-wide config. In general, session properties are not supposed to affect the results of the query. |
What is the use case for this feature? |
You mean that it's better to let it be a config property which could only be set in config.properties? |
I am in agreement that this should probably be just a config property. I also want to see if we have alignment that this is a bug fix, and should default the value instead to false, and add a release note indicating the new behavior. In general, I think if we keep it enabled, practically speaking no one will disable it, which means the bug doesn't get fixed. If someone has batch pipelines and would prefer not to regress them, they can opt to simply enabling the config. Thoughts? |
Hi @spershin. In some cases, user want to preserve the case for their row type's filed name, for example provided by @yhwang in issue #20701, when we execute the following statement:
We hope to get the following result:
rather than:
And when it comes to casting from
We want to get the following result rather than an exception:
So maybe it's better to give users who really intend to preserve the case a choice to achieve this. |
677c9e1
to
5bba02d
Compare
I agree that it is nice when default values include latest bug fixes and not legacy behavior. At the same time, it is important to avoid regressions to existing users as well.
I would argue that noone wants to introduce regressions and enabling the config is not as easy as it sounds. Folks need to have a super easy way to learn that they need to modify configs when upgrading to a new release. Wondering if we already have a section at the top of release notes that lists all the breaking changes and whether it is possible to easily access a combined list of breaking changes between any two releases. Perhaps, we could introduce a single property that says "Do not break me". When this property is set, the defaults for breaking configs are set to match legacy behavior. This would allow existing users to set a single property once and not worry about new release breaking them. At the same time new users will not have this config set and will be getting latest and greatest.
This is something I'm not sure about and would like to get more clarity. In JSON, keys are case sensitive, right? But in SQL, column names are not? Or is it that column names are not case sensitive, but struct fields are? Or is it that both column names and struct fields are case insensitive unless they are in double quotes? How do we ensure that field names in RowType preserve their case throughout the query execution? Do we rely on the planner to sort things out and refer to struct fields by index in the query plan and not by name? Do we need any support from execution (Prestissimo) for this functionality? |
5bba02d
to
182b9ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the documentation! I made a few suggestions, some nits, for improving the readability. If you have questions about my suggestions, let me know and we'll figure it out together. Thanks!
de534e7
to
d264560
Compare
@steveburnett Thanks for your suggestions. They have been fixed, please take a look when convenient. |
Hi @tdcmeehan @mbasmanova, following your guidance, property |
presto-main/src/test/java/com/facebook/presto/type/TestRowOperators.java
Show resolved
Hide resolved
2c3fbd3
to
535f4bc
Compare
Hi @rschlussel @tdcmeehan, I have dropped the deprecated test methods, and added a new exception |
3453667
to
a7a560a
Compare
presto-main/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question about the default config value, and a small comment, but generally the change looks good. Also, I feel this needs a highlight in the release notes and not just a regular release note, as it is a behavior change that could cause query failures.
presto-main/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java
Outdated
Show resolved
Hide resolved
@rschlussel thanks for your review! I'm OK about the default config value, as it's involving the users' current behavior. @tdcmeehan What is your opinion? A highlight in the release notes sounds great! |
We can target this PR for 0.287 with it disabled by default, then change the default in advance of the cut for 0.288. |
534a263
to
3866468
Compare
@rschlussel @tdcmeehan @steveburnett I have changed the default config value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on the docs! I made some minor suggestions and nits about formatting and phrasing. Let me know what you think!
* **Type:** ``boolean`` | ||
* **Default value:** ``true`` | ||
|
||
When casting from ``JSON`` to ``ROW``, by default ignore the case of field names in RowType when casting from ``JSON`` to ``ROW`` (for legacy support), the matching would be always case-insensitive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When casting from ``JSON`` to ``ROW``, by default ignore the case of field names in RowType when casting from ``JSON`` to ``ROW`` (for legacy support), the matching would be always case-insensitive. | |
When casting from ``JSON`` to ``ROW``, ignore the case of field names in ``RowType`` when casting from ``JSON`` to ``ROW`` for legacy support so that the matching is case-insensitive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the second 'when casting from JSON
to ROW
'. It seems a little duplicate, is that ok?
3866468
to
e3505b8
Compare
@steveburnett Thanks for your detailed suggestions. They are all fixed, only one minor modification. Please take a look when available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick revision! One tiny nit I should have caught earlier, but everything else looks great.
Preserve the case of double quoted field names in RowType. When casting from ``JSON`` to ``ROW``, treat the double quoted field name as case sensitive and unquoted field name as case insensitive in field matching. Add a config property `legacy_json_cast` to control and support the legacy behavior which do not enforce the case when matching.
e3505b8
to
8b13091
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local build of docs, everything looks good. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % the error handling can be cleaned up.
@@ -171,6 +173,10 @@ private static ErrorCode toErrorCode(Throwable throwable) | |||
return SLICE_TOO_LARGE.toErrorCode(); | |||
} | |||
|
|||
if (throwable instanceof InvalidTypeDefinitionException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than catch and redefine, let's just directly throw a PrestoException
with INVALID_TYPE_DEFINITION
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't throw a PrestoException from presto-common, as it's defined in presto-spi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, missed that.
Description
As discussed in #21866 and #21602, add a config property
legacy_json_cast
to control and support the legacy behavior which do not reserve the case of field name in json when casting to row type.The deferences in behavior with and without the config property could be found in presto-docs/src/main/sphinx/functions/json.rst.
Impact
Add a new config property
legacy_json_cast
whose default value istrue
to support legacy behavior. When set the propertylegacy_json_cast
tofalse
in both coordinators and workers' config file, there will be some changes in behavior of casting from a json to a row.Test Plan
CAST
#21602 (comment)Contributor checklist
Release Notes