-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancing compatibility with legacy ORC files #21391
Conversation
Is it possible to add unit testing? I can see some related tests at testHiveFileFormats |
ok, i just did't know where to test |
@jaystarshot i add unit test for it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, deferring to a orc expert for any concerns
CC: @sdruzkin |
@tdcmeehan @sdruzkin @mbasmanova It's has been two weeks since last action, what should i do now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits. Overall, after reading through the ORC spec, this change seems fine. Pinging @sdruzkin one last time as our ORC expert, otherwise I'm happy to merge once the nits are addressed.
@@ -1136,7 +1135,11 @@ public static List<HiveColumnHandle> getPhysicalHiveColumnHandles(List<HiveColum | |||
} | |||
|
|||
List<String> columnNames = getColumnNames(types); | |||
verifyFileHasColumnNames(columnNames, path); | |||
|
|||
boolean hasColumnNames = isFileHasColumnNames(columnNames, path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boolean hasColumnNames = isFileHasColumnNames(columnNames, path); | |
boolean hasColumnNames = fileHasColumnNames(columnNames, path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -1166,13 +1169,13 @@ private static List<String> getColumnNames(List<OrcType> types) | |||
return types.get(0).getFieldNames(); | |||
} | |||
|
|||
private static void verifyFileHasColumnNames(List<String> physicalColumnNames, Path path) | |||
private static boolean isFileHasColumnNames(List<String> physicalColumnNames, Path path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static boolean isFileHasColumnNames(List<String> physicalColumnNames, Path path) | |
private static boolean fileHasColumnNames(List<String> physicalColumnNames, Path path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path is no longer used and can be removed.
Also interesting that the method name suggests that we check the file footer to get the column names, but this method does not read the file. Does this method just checks whether the columns names in the metastore have Hive style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
boolean hasColumnNames = true; | ||
if (!physicalColumnNames.isEmpty() && physicalColumnNames.stream().allMatch(physicalColumnName -> DEFAULT_HIVE_COLUMN_NAME_PATTERN.matcher(physicalColumnName).matches())) { | ||
throw new PrestoException( | ||
HIVE_FILE_MISSING_COLUMN_NAMES, | ||
"ORC file does not contain column names in the footer: " + path); | ||
hasColumnNames = false; | ||
} | ||
return hasColumnNames; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return physicalColumnNames.isEmpty || !physicalColumnNames.stream().allMatch(physicalColumnName -> DEFAULT_HIVE_COLUMN_NAME_PATTERN.matcher(physicalColumnName).matches())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Also, please follow our guidelines for commits. Namely:
|
@@ -1166,13 +1169,13 @@ private static List<String> getColumnNames(List<OrcType> types) | |||
return types.get(0).getFieldNames(); | |||
} | |||
|
|||
private static void verifyFileHasColumnNames(List<String> physicalColumnNames, Path path) | |||
private static boolean isFileHasColumnNames(List<String> physicalColumnNames, Path path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path is no longer used and can be removed.
Also interesting that the method name suggests that we check the file footer to get the column names, but this method does not read the file. Does this method just checks whether the columns names in the metastore have Hive style?
public void testOrcUseColumnNamesCompatibility(int rowCount) | ||
throws Exception | ||
{ | ||
// test hive.orc.use-column-names can fallback to use hive column names, if in orc file has no real column name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has no real column name -> has no real column names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
throws Exception | ||
{ | ||
// test hive.orc.use-column-names can fallback to use hive column names, if in orc file has no real column name | ||
// only have old hive style name _col1, _clo2, _clo3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_clo2, _clo3 -> _col2, _col3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
private static List<TestColumn> getHiveColumnNameColumns() | ||
{ | ||
//Creates a new list of TestColumn objects with Hive-style column names based on their indices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a space after //
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
thanks,I will fix these |
|
9358465
to
a617dbf
Compare
Suggest revising release note entry following the Order of changes in the release notes guidelines. Perhaps something like this:
|
feb6a24
to
43bc00d
Compare
i found it's very difficult to run all check passed, each time it will have different test not passed. even though i didn't change code |
@ico01 the CI environment has become unstable lately. We are working on making it more stable again. In the meantime, maintainers will ensure tests pass and will merge. |
ok |
3720bac
to
f774aeb
Compare
Description
This PR modifies the handling of the hive.orc.use-column-names configuration setting in Presto to allow for improved compatibility with older ORC data that may not contain column names in the file. The new strategy uses ORC file column names by default and falls back to Hive schema column names when ORC names are not present.
Motivation and Context
The ability for Presto to flexibly handle column names in ORC files is critical for accessing a wide range of data, including legacy datasets that may not contain embedded column names. The original hive.orc.use-column-names configuration setting only allowed access via column names and did not account for files without them. This approach helps mitigate issues with reading old ORC data and ensures that users can access their datasets consistently without manual intervention.
Impact
The changes to hive.orc.use-column-names represent an enhancement to the existing functionality rather than a breaking change. Users specifying this option will now benefit from an intelligent fallback mechanism. There should be no performance impact as the fallback only occurs when required, and there are no changes to the public API.
Test Plan
The changes have been tested using both new and old ORC files—both those with embedded column names and those without. The Presto cluster correctly accessed column data using the appropriate naming strategy in each case.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.