-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse metadata and protocol entries while retrieving the active files #19410
Reuse metadata and protocol entries while retrieving the active files #19410
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find
...in/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeFileOperations.java
Outdated
Show resolved
Hide resolved
@@ -742,12 +769,12 @@ private record FileOperation(FileType fileType, String fileId, OperationType ope | |||
{ | |||
public static FileOperation create(String path, OperationType operationType) | |||
{ | |||
Pattern dataFilePattern = Pattern.compile(".*?/(?<partition>key=[^/]*/)?(?<queryId>\\d{8}_\\d{6}_\\d{5}_\\w{5})_(?<uuid>[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})"); | |||
Pattern dataFilePattern = Pattern.compile(".*?/(?<partition>key=[^/]*/)?((?<queryId>\\d{8}_\\d{6}_\\d{5}_\\w{5})_(?<uuid>[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})|.*.parquet)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .*.parquet
is pretty broad.
For example, checkpoint files would also match the pattern.
Maybe instead of this,
let's turn dataFilePattern
into an ordinary catch all pattern \Q table_directory \E / ( key=value /)? [^/]+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the pattern to .*?/(?<partition>key=[^/]*/)?[^/]+
and placed it only after matching against metadata files.
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
...in/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeFileOperations.java
Outdated
Show resolved
Hide resolved
tableSnapshot, | ||
ImmutableSet.of(ADD), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
if (entryTypes.contains(ADD)) {
metadataAndProtocol = Optional.of(getCheckpointMetadataAndProtocolEntries(
session,
checkpointSchemaManager,
typeManager,
fileSystem,
stats,
checkpoint));
from io.trino.plugin.deltalake.transactionlog.TableSnapshot#getCheckpointTransactionLogEntries
it's now dead code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this:
Lines 116 to 123 in b7d7ba8
snapshot.getCheckpointTransactionLogEntries( | |
session, | |
ImmutableSet.of(PROTOCOL, TRANSACTION, ADD, REMOVE, COMMIT), | |
checkpointSchemaManager, | |
typeManager, | |
fileSystem, | |
fileFormatDataSourceStats) | |
.forEach(checkpointBuilder::addLogEntry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the linked code place already reads metadata entry.
let it read protocol as well and pass both
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
Outdated
Show resolved
Hide resolved
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
Typically metadata files should be accessed once, so prefer `.add(x)` over `.addCopies(x, 1)`, so that `.addCopies` stand out as potentially something to address.
Once the metadata & protocol entries are found, the scanning of multi-part checkpoint files can be stopped.
5736604
to
7c56c69
Compare
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
The `metadata` & `protocol` entries are already read (and saved) once when retrieving the table handle. Reuse this information while retrieving the active files for the table.
7c56c69
to
6e2c8fe
Compare
@findepi pls test this PR with secrets. |
0061960
to
f308b77
Compare
...ino-delta-lake/src/test/java/io/trino/plugin/deltalake/transactionlog/TestTableSnapshot.java
Outdated
Show resolved
Hide resolved
/test-with-secrets sha=f308b77d8670341f48b98c4503b86baea449575c |
f308b77
to
2151f45
Compare
🚀 |
Description
This is an incremental improvement in the direction set by #18916
While retrieving the
metadata
&protocol
entries from a multi-part checkpoint file, stop scanning the checkpoint files as soon as the metadata & protocol entries are actually found.Reuse metadata and protocol entries while retrieving the active files
The
metadata
&protocol
entries are already read (and saved) oncewhen retrieving the table handle.
Reuse this information while retrieving the active files for the table.
Additional context and related issues
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: