Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancing compatibility with legacy ORC files #21391

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

ico01
Copy link
Contributor

@ico01 ico01 commented Nov 15, 2023

Description

This PR modifies the handling of the hive.orc.use-column-names configuration setting in Presto to allow for improved compatibility with older ORC data that may not contain column names in the file. The new strategy uses ORC file column names by default and falls back to Hive schema column names when ORC names are not present.

Motivation and Context

The ability for Presto to flexibly handle column names in ORC files is critical for accessing a wide range of data, including legacy datasets that may not contain embedded column names. The original hive.orc.use-column-names configuration setting only allowed access via column names and did not account for files without them. This approach helps mitigate issues with reading old ORC data and ensures that users can access their datasets consistently without manual intervention.

Impact

The changes to hive.orc.use-column-names represent an enhancement to the existing functionality rather than a breaking change. Users specifying this option will now benefit from an intelligent fallback mechanism. There should be no performance impact as the fallback only occurs when required, and there are no changes to the public API.

Test Plan

The changes have been tested using both new and old ORC files—both those with embedded column names and those without. The Presto cluster correctly accessed column data using the appropriate naming strategy in each case.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Hive Changes
* Improves the `hive.orc.use-column-names` configuration setting to no longer fail on reading ORC files 
  without column names but falls back to using Hive's schema, enhancing compatibility with legacy ORC files.

@ico01 ico01 requested a review from a team as a code owner November 15, 2023 10:00
@ico01 ico01 requested a review from presto-oss November 15, 2023 10:00
@ico01 ico01 changed the title add compation for old data enhancing compatibility with legacy ORC files Nov 15, 2023
@jaystarshot
Copy link
Member

Is it possible to add unit testing? I can see some related tests at testHiveFileFormats

@ico01
Copy link
Contributor Author

ico01 commented Nov 16, 2023

Is it possible to add unit testing? I can see some related tests at testHiveFileFormats

ok, i just did't know where to test

@ico01
Copy link
Contributor Author

ico01 commented Nov 16, 2023

@jaystarshot i add unit test for it

jaystarshot
jaystarshot previously approved these changes Nov 16, 2023
Copy link
Member

@jaystarshot jaystarshot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, deferring to a orc expert for any concerns

@tdcmeehan
Copy link
Contributor

CC: @sdruzkin

@ico01
Copy link
Contributor Author

ico01 commented Nov 30, 2023

@tdcmeehan @sdruzkin @mbasmanova It's has been two weeks since last action, what should i do now

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits. Overall, after reading through the ORC spec, this change seems fine. Pinging @sdruzkin one last time as our ORC expert, otherwise I'm happy to merge once the nits are addressed.

@@ -1136,7 +1135,11 @@ public static List<HiveColumnHandle> getPhysicalHiveColumnHandles(List<HiveColum
}

List<String> columnNames = getColumnNames(types);
verifyFileHasColumnNames(columnNames, path);

boolean hasColumnNames = isFileHasColumnNames(columnNames, path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
boolean hasColumnNames = isFileHasColumnNames(columnNames, path);
boolean hasColumnNames = fileHasColumnNames(columnNames, path);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -1166,13 +1169,13 @@ private static List<String> getColumnNames(List<OrcType> types)
return types.get(0).getFieldNames();
}

private static void verifyFileHasColumnNames(List<String> physicalColumnNames, Path path)
private static boolean isFileHasColumnNames(List<String> physicalColumnNames, Path path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static boolean isFileHasColumnNames(List<String> physicalColumnNames, Path path)
private static boolean fileHasColumnNames(List<String> physicalColumnNames, Path path)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path is no longer used and can be removed.

Also interesting that the method name suggests that we check the file footer to get the column names, but this method does not read the file. Does this method just checks whether the columns names in the metastore have Hive style?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 1174 to 1178
boolean hasColumnNames = true;
if (!physicalColumnNames.isEmpty() && physicalColumnNames.stream().allMatch(physicalColumnName -> DEFAULT_HIVE_COLUMN_NAME_PATTERN.matcher(physicalColumnName).matches())) {
throw new PrestoException(
HIVE_FILE_MISSING_COLUMN_NAMES,
"ORC file does not contain column names in the footer: " + path);
hasColumnNames = false;
}
return hasColumnNames;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    return physicalColumnNames.isEmpty || !physicalColumnNames.stream().allMatch(physicalColumnName -> DEFAULT_HIVE_COLUMN_NAME_PATTERN.matcher(physicalColumnName).matches())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Nov 30, 2023

Also, please follow our guidelines for commits. Namely:

  1. Please squash your commits
  2. Please make sure you read through our guidelines on commit messages (although in the contributor checklist this box was checked, it does not follow our guidelines). Perhaps rename the commit message to Allow reading of legacy ORC files with default column names

@@ -1166,13 +1169,13 @@ private static List<String> getColumnNames(List<OrcType> types)
return types.get(0).getFieldNames();
}

private static void verifyFileHasColumnNames(List<String> physicalColumnNames, Path path)
private static boolean isFileHasColumnNames(List<String> physicalColumnNames, Path path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path is no longer used and can be removed.

Also interesting that the method name suggests that we check the file footer to get the column names, but this method does not read the file. Does this method just checks whether the columns names in the metastore have Hive style?

public void testOrcUseColumnNamesCompatibility(int rowCount)
throws Exception
{
// test hive.orc.use-column-names can fallback to use hive column names, if in orc file has no real column name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has no real column name -> has no real column names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

throws Exception
{
// test hive.orc.use-column-names can fallback to use hive column names, if in orc file has no real column name
// only have old hive style name _col1, _clo2, _clo3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_clo2, _clo3 -> _col2, _col3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


private static List<TestColumn> getHiveColumnNameColumns()
{
//Creates a new list of TestColumn objects with Hive-style column names based on their indices.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a space after //

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ico01
Copy link
Contributor Author

ico01 commented Dec 6, 2023

thanks,I will fix these

Copy link

linux-foundation-easycla bot commented Mar 6, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@tdcmeehan tdcmeehan self-assigned this Mar 6, 2024
@ico01 ico01 force-pushed the fix_orc_reader_4 branch 3 times, most recently from 9358465 to a617dbf Compare March 6, 2024 15:05
@steveburnett
Copy link
Contributor

Suggest revising release note entry following the Order of changes in the release notes guidelines. Perhaps something like this:

== RELEASE NOTES ==

Hive Changes
* Improves the `hive.orc.use-column-names` configuration setting to no longer fail on reading ORC files 
  without column names but falls back to using Hive's schema, enhancing compatibility with legacy ORC files.

@ico01 ico01 force-pushed the fix_orc_reader_4 branch 7 times, most recently from feb6a24 to 43bc00d Compare March 8, 2024 07:37
@ico01 ico01 force-pushed the fix_orc_reader_4 branch from 43bc00d to c30cb7c Compare March 8, 2024 12:22
@ico01 ico01 force-pushed the fix_orc_reader_4 branch from c30cb7c to 3720bac Compare March 9, 2024 06:42
@ico01
Copy link
Contributor Author

ico01 commented Mar 9, 2024

i found it's very difficult to run all check passed, each time it will have different test not passed. even though i didn't change code

@tdcmeehan
Copy link
Contributor

@ico01 the CI environment has become unstable lately. We are working on making it more stable again. In the meantime, maintainers will ensure tests pass and will merge.

@ico01
Copy link
Contributor Author

ico01 commented Mar 10, 2024

ok

@tdcmeehan tdcmeehan merged commit 63aa9f7 into prestodb:master Apr 16, 2024
56 checks passed
@wanglinsong wanglinsong mentioned this pull request May 1, 2024
48 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants