Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse metadata and protocol entries while retrieving the active files #19410

Merged

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Oct 16, 2023

Description

This is an incremental improvement in the direction set by #18916

While retrieving the metadata & protocol entries from a multi-part checkpoint file, stop scanning the checkpoint files as soon as the metadata & protocol entries are actually found.

Reuse metadata and protocol entries while retrieving the active files

The metadata & protocol entries are already read (and saved) once
when retrieving the table handle.
Reuse this information while retrieving the active files for the table.

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Delta Lake
* Avoid redundant reading of the checkpoint files. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Oct 16, 2023
@findinpath findinpath added delta-lake Delta Lake connector and removed cla-signed labels Oct 16, 2023
@cla-bot cla-bot bot added the cla-signed label Oct 16, 2023
@findinpath findinpath changed the title Avoid useless scanning of multi-part checkpoint files Reuse metadata and protocol entries while retrieving the active files Oct 16, 2023
@findinpath findinpath self-assigned this Oct 16, 2023
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find

@@ -742,12 +769,12 @@ private record FileOperation(FileType fileType, String fileId, OperationType ope
{
public static FileOperation create(String path, OperationType operationType)
{
Pattern dataFilePattern = Pattern.compile(".*?/(?<partition>key=[^/]*/)?(?<queryId>\\d{8}_\\d{6}_\\d{5}_\\w{5})_(?<uuid>[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})");
Pattern dataFilePattern = Pattern.compile(".*?/(?<partition>key=[^/]*/)?((?<queryId>\\d{8}_\\d{6}_\\d{5}_\\w{5})_(?<uuid>[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})|.*.parquet)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .*.parquet is pretty broad.
For example, checkpoint files would also match the pattern.

Maybe instead of this,
let's turn dataFilePattern into an ordinary catch all pattern \Q table_directory \E / ( key=value /)? [^/]+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the pattern to .*?/(?<partition>key=[^/]*/)?[^/]+ and placed it only after matching against metadata files.

tableSnapshot,
ImmutableSet.of(ADD),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

if (entryTypes.contains(ADD)) {
            metadataAndProtocol = Optional.of(getCheckpointMetadataAndProtocolEntries(
                    session,
                    checkpointSchemaManager,
                    typeManager,
                    fileSystem,
                    stats,
                    checkpoint));

from io.trino.plugin.deltalake.transactionlog.TableSnapshot#getCheckpointTransactionLogEntries

it's now dead code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this:

snapshot.getCheckpointTransactionLogEntries(
session,
ImmutableSet.of(PROTOCOL, TRANSACTION, ADD, REMOVE, COMMIT),
checkpointSchemaManager,
typeManager,
fileSystem,
fileFormatDataSourceStats)
.forEach(checkpointBuilder::addLogEntry);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the linked code place already reads metadata entry.
let it read protocol as well and pass both

Typically metadata files should be accessed once, so prefer `.add(x)`
over `.addCopies(x, 1)`, so that `.addCopies` stand out as potentially
something to address.
Once the metadata & protocol entries are found, the scanning of
multi-part checkpoint files can be stopped.
@findinpath findinpath force-pushed the findinpath/multipart-checkpoint-files branch from 5736604 to 7c56c69 Compare October 16, 2023 19:39
@findinpath findinpath requested review from findepi and homar October 17, 2023 07:01
The `metadata` & `protocol` entries are already read (and saved) once
when retrieving the table handle.
Reuse this information while retrieving the active files for the table.
@findinpath findinpath force-pushed the findinpath/multipart-checkpoint-files branch from 7c56c69 to 6e2c8fe Compare October 17, 2023 07:54
@findinpath
Copy link
Contributor Author

@findepi pls test this PR with secrets.

@findinpath findinpath force-pushed the findinpath/multipart-checkpoint-files branch from 0061960 to f308b77 Compare October 17, 2023 10:49
@findepi
Copy link
Member

findepi commented Oct 17, 2023

/test-with-secrets sha=f308b77d8670341f48b98c4503b86baea449575c

@findinpath findinpath force-pushed the findinpath/multipart-checkpoint-files branch from f308b77 to 2151f45 Compare October 17, 2023 11:56
@findinpath findinpath requested a review from findepi October 18, 2023 03:51
@findepi findepi merged commit f5b1e89 into trinodb:master Oct 18, 2023
@findepi
Copy link
Member

findepi commented Oct 18, 2023

🚀

@github-actions github-actions bot added this to the 430 milestone Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

2 participants