Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce FileSystem calls in LinePageSourceFactory #18959

Merged

Conversation

pettyjamesm
Copy link
Member

Description

Avoids calling exists() and length() on TrinoInputFile inside of LinePageSourceFactory, both of which will map to some API metadata lookups like S3's HeadObject which is unnecessary since:

  1. Opening the stream will throw FileNotFoundException anyway if the file does not exist and issuing a HeadObject eagerly doesn't guarantee the object will still exist for a subsequent GetObject call
  2. Adjusting the split length based on the exact file size doesn't prevent data corruption, since the file may change size between the two calls anyway and for line based text inputs, you can simply open the stream and read until an EOF is encountered without a separate call to lookup the size.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Improve performance of Trino native text format readers

Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need a similar change for the other formats

@pettyjamesm pettyjamesm force-pushed the reduce-linepagesource-metadata-calls branch from 1a7ef1f to 23206e3 Compare September 12, 2023 19:38
@pettyjamesm pettyjamesm marked this pull request as ready for review September 13, 2023 13:48
@findepi
Copy link
Member

findepi commented Sep 13, 2023

/test-with-secrets sha=cd6fc9fe6f4669af151c5fe0a88f2ac8480b392d

@github-actions
Copy link

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/6174060858

@pettyjamesm
Copy link
Member Author

@electrum - one more commit was added to fix the related S3InputStream behavior which will now have to deal with length == null and would previously do something very similar by consuming the stream when calling skipNBytes. Tests with secrets run passed.

Avoids calling exists() and length() on TrinoInputFile inside of
LinePageSourceFactory as part of LinePageSource creation. Both of which
will map to some API metadata lookups like S3 HeadObject which is
unnecessary since:

1. GetObject will throw FileNotFoundException anyway if the file does
   not exist and issuing a HeadObject eagerly doesn't guarantee the
   object will still exist for a subsequent GetObject call
2. Adjusting the split length based on the exact file size doesn't
   prevent data corruption, it's only meant to handle the scenario where
   encrypted objects may include the last block padding bytes into the
   file size. This can be detected by reading until an EOF is
   encountered and will similarly not be prevented by checking the
   length eagerly since the size could still change between the two
   calls.
@pettyjamesm pettyjamesm force-pushed the reduce-linepagesource-metadata-calls branch from cd6fc9f to b9fdcd9 Compare September 19, 2023 12:54
@pettyjamesm pettyjamesm merged commit 1668ef3 into trinodb:master Sep 19, 2023
@pettyjamesm pettyjamesm deleted the reduce-linepagesource-metadata-calls branch September 19, 2023 17:29
@github-actions github-actions bot added this to the 427 milestone Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

3 participants