-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazily load hive partition information #10215
Lazily load hive partition information #10215
Conversation
bd737f6
to
b044ed3
Compare
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveConnectorTest.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveConnectorTest.java
Outdated
Show resolved
Hide resolved
426ffc3
to
d7d383b
Compare
d7d383b
to
5e0f123
Compare
plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java
Show resolved
Hide resolved
5e0f123
to
79c8012
Compare
@sopel39 AC |
79c8012
to
34c436f
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionResult.java
Outdated
Show resolved
Hide resolved
} | ||
else { | ||
List<String> partitionNames = getFilteredPartitionNames(metastore, identity, tableName, partitionColumns, compactEffectivePredicate); | ||
if (hiveTableHandle.getPartitionNames().isPresent()) { | ||
partitionNames = hiveTableHandle.getPartitionNames().get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd that we don't need to filter partition names while we had to do .filter(partition -> partitionMatches(partitionColumns, effectivePredicate, predicate, partition))
above.
Comment would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping, I still don't understand why we don't have to filter here.
Why there can be partition names and not getPartitions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitionMatches
effectively converts the partition names from String
to HivePartitionInformation
and then tries to match with the filter predicate.
Why there can be partition names and not getPartitions?
The contract is that a table can either load a partition details (if it is less than a threshold) or could maintain it as rawString
if it crosses a threshold. We don't have an intermediate state here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitionMatches effectively converts the partition names from String to HivePartitionInformation and then tries to match with the filter predicate.
So in this branch:
partitionNames = hiveTableHandle.getPartitionNames().get();
we don't do any filtering. Does it mean we return excessive partitionNames
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't do any filtering.
We do kind of partitial filterting, like if we have a TupleDomain (with Domain on Partition columns) then the filtering would be applied at a metastore layer but we won't perform partitionMatches
(and materializing it into HivePartition
)
Does it mean we return excessive partitionNames?
Since we dont invoke partitionMatches
there is a chance that we could return excessive partition names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we dont invoke partitionMatches there is a chance that we could return excessive partition names.
That doesn't seem correct, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case we don't have specify the enforcedTupleDomain
as TupleDomain
so that the filter expression is not lost. It will be applied during the next applyFilter
optimizer. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case we don't have specify the enforcedTupleDomain as TupleDomain so that the filter expression is not lost. It will be applied during the next applyFilter optimizer. WDYT ?
Depends on contract. If we say getPartitions
returns partitions that match constrant
, then it should be the case
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
@sopel39 Added comments. |
@@ -345,7 +359,20 @@ public String getTableName() | |||
return dataColumns; | |||
} | |||
|
|||
// do not serialize partitions as they are not needed on workers | |||
/** | |||
* Represents raw partition information as String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it filtered by table predicate?
Can partitionNames
be loaded independently of partitions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it filtered by table predicate?
Yes but partially.
Can partitionNames be loaded independently of partitions?
If partitions is loaded then partitionNames is reset to Optional.empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add that as a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it filtered by table predicate?
Yes but partially.
that must be documented
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
} | ||
else { | ||
List<String> partitionNames = getFilteredPartitionNames(metastore, identity, tableName, partitionColumns, compactEffectivePredicate); | ||
if (hiveTableHandle.getPartitionNames().isPresent()) { | ||
partitionNames = hiveTableHandle.getPartitionNames().get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping, I still don't understand why we don't have to filter here.
Why there can be partition names and not getPartitions
?
@@ -345,7 +359,20 @@ public String getTableName() | |||
return dataColumns; | |||
} | |||
|
|||
// do not serialize partitions as they are not needed on workers | |||
/** | |||
* Represents raw partition information as String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it filtered by table predicate?
Yes but partially.
that must be documented
// TODO this shouldn't fail | ||
assertThatThrownBy(() -> query("SELECT * FROM " + tableName + " WHERE part1 % 400 = 3")) // may be translated to Domain.all | ||
.hasMessage(format("Query over table 'tpch.%s' can potentially read more than 1000 partitions", tableName)); | ||
assertQuery("SELECT * FROM " + tableName + " WHERE part1 % 400 = 3", "SELECT 'bar', 3, 3"); // may be translated to Domain.all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertThat(query
as in the next assertion
} | ||
|
||
@Test | ||
public void testFilterNotDerivedFromTablePropertiesForTooManyPartitions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add this test in the prep commit, like you did with the other test.
i understand the query fails before changes?
@@ -130,7 +143,14 @@ public HivePartitionResult getPartitions(SemiTransactionalHiveMetastore metastor | |||
// All partition key domains will be fully evaluated, so we don't need to include those | |||
TupleDomain<ColumnHandle> remainingTupleDomain = effectivePredicate.filter((column, domain) -> !partitionColumns.contains(column)); | |||
TupleDomain<ColumnHandle> enforcedTupleDomain = effectivePredicate.filter((column, domain) -> partitionColumns.contains(column)); | |||
return new HivePartitionResult(partitionColumns, partitionsIterable, compactEffectivePredicate, remainingTupleDomain, enforcedTupleDomain, hiveBucketHandle, bucketFilter); | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a javadoc
/** | ||
* Partitions will be parsed if | ||
* 1. Number of partitionNames is less than or equal to threshold value. | ||
* 2. If additional predicate is passed as a part of Constraint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document why we're making this choice.
In any case, putting this boolean in HivePartitionResult
is wrong.
The user of HivePartitionResult
should do this logic (getPartitionsAsList
or caller of it)
Remove HivePartitionResult.canParsePartitions
fielld
partitionNames = partitions.stream() | ||
.map(HivePartition::getPartitionId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wasteful. If we knot the partitions (the objects), we don't need the names anymore.
.map(partitionValues -> toPartitionName(partitionColumnNames, partitionValues)) | ||
.collect(toImmutableList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we will calculate List<HivePartition> partitionList
, the partitionNames
list won't be needed.
no need to materialize the list.
if (hiveTable.getPartitionNames().isEmpty()) { | ||
HivePartitionResult partitionResult = partitionManager.getPartitions(metastore, new HiveIdentity(session), table, new Constraint(hiveTable.getEnforcedConstraint())); | ||
if (partitionResult.canParsePartitions()) { | ||
List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks as potentially expensive and is also being thrown away.
Document why we believe this is not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is being seen in the current master too. Since getTableProperties
doesn't allow us to update TableHandle
this issue is seen.
List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult); | ||
predicate = predicate.intersect(createPredicate(partitionColumns, partitions)); | ||
|
||
if (!partitionColumns.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can partitionColumns
be empty here? The outer block of code looks like dealing with partitioned table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of unpartitioned table
too we might get a single entry for HivePartition
, so we need this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a comment.
as a followup please refactor this code. Doing some partition related work over 15 lines only to eventually check that... table isn't actually partitioned. We should reverse the checks
Just a general question, when is |
@alexjo2144 seems the only use of |
Thanks Piotr, I ask mainly because of the check in |
may or may not |
0ef477f
to
68a9551
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult); | ||
predicate = predicate.intersect(createPredicate(partitionColumns, partitions)); | ||
|
||
if (!partitionColumns.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of unpartitioned table
too we might get a single entry for HivePartition
, so we need this check.
if (hiveTable.getPartitionNames().isEmpty()) { | ||
HivePartitionResult partitionResult = partitionManager.getPartitions(metastore, new HiveIdentity(session), table, new Constraint(hiveTable.getEnforcedConstraint())); | ||
if (partitionResult.canParsePartitions()) { | ||
List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is being seen in the current master too. Since getTableProperties
doesn't allow us to update TableHandle
this issue is seen.
68a9551
to
ba60f3a
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
// Compute enforced and remaining TupleDomain if the partitions are loaded. | ||
if (partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions || constraint.predicate().isPresent()) { | ||
// All partition key domains will be fully evaluated, so we don't need to include those | ||
remainingTupleDomain = effectivePredicate.filter((column, domain) -> !partitionColumns.contains(column)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this block conditional on partitions being loaded? Only the columns list is used here, not the values of the partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the HiveTableHandle
has only raw partition names, then there is a good chance that the Domain
for partition columns partially enforced, so we compute the enforced tuple domain post materialization of HivePartition
.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java
Outdated
Show resolved
Hide resolved
@@ -64,6 +68,11 @@ public HivePartitionResult( | |||
return partitionColumns; | |||
} | |||
|
|||
public Optional<List<String>> getPartitionNames() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bump
TupleDomain<ColumnHandle> enforcedTupleDomain = TupleDomain.all(); | ||
|
||
// Compute enforced and remaining TupleDomain if the partitions are loaded. | ||
if (partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions || constraint.predicate().isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition here (partitionNames list size, constraint.predicate() being present) doesn't not match the objects being used in the enclosed code (partitionColumns). I don't find the code-level documentation explanatory for this. Can you elaborate?
Also, the logic partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions
is a tricky hack, as partitionNames
being empty doesn't mean no partitions being scanned (empty list).
Split into .isEmpty
/ .isPresent
check and separate size check.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
|
||
// If the partitions are not loaded, try out if they can be loaded. | ||
if (hiveTable.getPartitions().isEmpty()) { | ||
// Since getTableProperties doesn't allow us to update ConnectorTableHandle, we might throw away this post creating TableProperties. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"post"?
did you mean
// Note that the computation is not persisted in the table handle, so can be redone many times
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
// If the partitions are not loaded, try out if they can be loaded. | ||
if (hiveTable.getPartitions().isEmpty()) { | ||
// Since getTableProperties doesn't allow us to update ConnectorTableHandle, we might throw away this post creating TableProperties. | ||
// TODO: Allow ConnectorMetadata#getTableProperties to update ConnectorTableHandle if required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not sure this is a great idea (would rather have a new API rather than make a seemingly read-only method not be a read-only method). However, if this is something to address, we should have an issue.
Also, can this cause planning time performance degradation? @sopel39 i think it can.
Note that this new code triggers not only for queries that would fail, but also for queries where we loaded many partition names, and didn't convert them and narrow down to partition objects, out of fear that there will be too many.
Should we have a kill switch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, document why we're doing this. One can intuitively want to return "no known constraint here", and i currently wouldn't be able to explain why not.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult); | ||
predicate = predicate.intersect(createPredicate(partitionColumns, partitions)); | ||
|
||
if (!partitionColumns.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment
project( | ||
tableScan("table_unpartitioned", Map.of("R_STR_COL", "str_col", "R_INT_COL", "int_col"))))))))); | ||
|
||
assertThatThrownBy(() -> getQueryRunner().execute(query)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the query pass only if you disable join reordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we disable join reordering, the table statistics are not computed during the planning, so the query crosses the planning phase but it fails during execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this should be a code comment
- that means the actual user problem has not been solved, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that means the actual user problem has not been solved, right?
No. The actual problem that we are trying to solve is -
Query currently fails when we apply a criteria like partition_column like '%abc%'
- if the initial number of partitions crosses the threshold - even if only a lesser number of partitions satisfies it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the test showing the successful execution of a query benefiting from the changes here?
i see one in io.trino.plugin.hive.BaseHiveConnectorTest#testPartitionPerScanLimitWithMultiplePartitionColumns
but it's simple SELECT (good test case to test)
i also see io.trino.plugin.hive.optimizer.TestHivePlans#testQueryScanningForTooManyPartitions
covering Joins, but it fails with defaults, and requires disabling join reordering to pass.
The solution should work with defaults setting, or any other setting that are probable to be used by a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have fixed for fetching table statistics - now it passes during planning while it fails during execution - due to scanning maximum partitions.
b706a93
to
741ac2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi Thanks for the feedback, have applied the comments.
project( | ||
tableScan("table_unpartitioned", Map.of("R_STR_COL", "str_col", "R_INT_COL", "int_col"))))))))); | ||
|
||
assertThatThrownBy(() -> getQueryRunner().execute(query)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we disable join reordering, the table statistics are not computed during the planning, so the query crosses the planning phase but it fails during execution.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java
Show resolved
Hide resolved
@@ -64,6 +68,11 @@ public HivePartitionResult( | |||
return partitionColumns; | |||
} | |||
|
|||
public Optional<List<String>> getPartitionNames() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah !! In case of HiveTableHandle
it is a bit strict (like it fails if we have both partitionNames
and partitions
).. we check it when we are initializing.
new Constraint( | ||
TupleDomain.all(), | ||
partitionValues -> true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constraint.alwaysTrue
sets the predicate
as Optional#empty
but we need to pass an default predicate for the API..
Can we modify Constraint.alwaysTrue
like we do for Constraint.alwaysFalse
?
|
||
// If the partitions are not loaded, try out if they can be loaded. | ||
if (hiveTable.getPartitions().isEmpty()) { | ||
// Since getTableProperties doesn't allow us to update ConnectorTableHandle, we might throw away this post creating TableProperties. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
TupleDomain<ColumnHandle> enforcedTupleDomain = TupleDomain.all(); | ||
|
||
// Compute enforced and remaining TupleDomain if the partitions are loaded. | ||
if (partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions || constraint.predicate().isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes so in that case we try to compute the remaining tuple domain.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
@@ -64,6 +68,11 @@ public HivePartitionResult( | |||
return partitionColumns; | |||
} | |||
|
|||
public Optional<List<String>> getPartitionNames() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why didn't you add same check in HivePartitionResult::new
?
// Apply extra filters which could not be done by getFilteredPartitionNames | ||
.map(partitionName -> parseValuesAndFilterPartition(tableName, partitionName, partitionColumns, partitionTypes, effectivePredicate, predicate)) | ||
.filter(Optional::isPresent) | ||
.map(Optional::get) | ||
.iterator(); | ||
partitionsAreLoaded = partitionNamesList.size() <= maxPartitions || constraint.predicate().isPresent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partitions are not loaded, since it's a throw-away information.
It's conceivable that hiveTableHandle
is later consumed with a slightly different logic, making the enforcedTupleDomain
untrue.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
new Constraint( | ||
TupleDomain.all(), | ||
partitionValues -> true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we modify
Constraint.alwaysTrue
like we do forConstraint.alwaysFalse
?
that's effective today, but basing on unwritten non-API assumptions sounds like a brittle hack.
.or(() -> { | ||
// We load the partitions so as to compute the predicates enforced by the table. | ||
// Note that the computation is not persisted in the table handle, so can be redone many times | ||
// TODO: Allow ConnectorMetadata#getTableProperties to update ConnectorTableHandle if required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a prescribed solution, and one which i am not sold on yet.
File a ticket about this problem, but avoiding prescribing solutions, to avoid channelling the discussion.
Link it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
project( | ||
tableScan("table_unpartitioned", Map.of("R_STR_COL", "str_col", "R_INT_COL", "int_col"))))))))); | ||
|
||
assertThatThrownBy(() -> getQueryRunner().execute(query)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this should be a code comment
- that means the actual user problem has not been solved, right?
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
getPartitionsAsList(getPartitions(metastore, table, new Constraint(summary)))); | ||
} | ||
|
||
public boolean canPartitionsBeLoaded(HivePartitionResult partitionResult) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some callers of getPartitions
which don't then call this method before collecting the Partitions. getTableStatistics
for example, should this be checked there?
.or(() -> { | ||
// We load the partitions to compute the predicates enforced by the table. | ||
// Note that the computation is not persisted in the table handle, so can be redone many times | ||
// TODO: https://github.com/trinodb/trino/issues/10980. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change worth doing until we have that done? Seems like loading the partition information once eagerly is better than lazily doing it multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexjo2144 not sure what's your suggestion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just asking if this change to lazily load the partition information is actually an improvement until that linked issue is completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if that issue is a blocker for this to be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We lazily evaluate so that it will be loaded after all the filters are pushed to the Hive.
@alexjo2144 AC |
bdbd541
to
47f12ba
Compare
new Constraint( | ||
TupleDomain.all(), | ||
partitionValues -> true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please switch to Constraint.alwaysTrue()
plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java
Show resolved
Hide resolved
@@ -64,6 +68,11 @@ public HivePartitionResult( | |||
return partitionColumns; | |||
} | |||
|
|||
public Optional<List<String>> getPartitionNames() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, can we have a check there?
95cc629
to
47f12ba
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult); | ||
predicate = predicate.intersect(createPredicate(partitionColumns, partitions)); | ||
|
||
if (!partitionColumns.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a comment.
as a followup please refactor this code. Doing some partition related work over 15 lines only to eventually check that... table isn't actually partitioned. We should reverse the checks
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
@@ -109,6 +108,7 @@ public HivePartitionResult getPartitions(SemiTransactionalHiveMetastore metastor | |||
.map(HiveColumnHandle::getType) | |||
.collect(toList()); | |||
|
|||
Optional<List<String>> partitionNames = hiveTableHandle.getPartitionNames(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (hiveTableHandle.getPartitions().isPresent())
then why do we pass partitionNames
further, as if we're preserving some value?
move the assignment under if (hiveTableHandle.getPartitions().isPresent())
block, and set the value to Optional.empty()
explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can set it as Optional.empty()
and no need to move the assignment.
{ | ||
Optional<List<String>> partitionNames = partitions.getPartitionNames(); | ||
Optional<List<HivePartition>> partitionList = Optional.empty(); | ||
TupleDomain<ColumnHandle> enforcedConstraint = partitions.getEffectivePredicate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enforcedConstraint
or effectivePredicate
?
it's not enforced yet...
also includes Domains on non-partitioning columns, right? (otherwise you wouldn't do partitions.getEffectivePredicate().filter((column, domain) -> partitionColumns.contains(column))
below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think it has to be TupleDomain#all
partitions.getCompactEffectivePredicate(), | ||
partitions.getEnforcedConstraint(), | ||
enforcedConstraint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we skip the if (canPartitionsBeLoaded(partitions) || constraint.predicate().isPresent())
block above (i.e. condition was false), then this variable contains partitions.getEffectivePredicate()
which looks like not enforced and potentially containing Domains on non-partitioning columns (so something that won't be enforced)
this is here passed to HiveTableHandle#enforcedConstraint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then this variable contains partitions.getEffectivePredicate()
Actually it would contain TupleDomain#all
- but now it will be initialized to HiveTableHandle#enforcedConstraint
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Show resolved
Hide resolved
* Test to ensure filter on build side table is derived from table properties * Test to ensure query fails if it scans too many partitions
16a81d2
to
8afe1db
Compare
@findepi AC. Since there was a conflict had to rebase it. |
plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java
Outdated
Show resolved
Hide resolved
@findepi AC |
We defer the initial loading of HivePartitionInformation if the number of partitions crosses a limit. This allows further invocation of applyFilter which could reduce the number of partitions to be scanned.
c68acf2
to
d7a6b9c
Compare
No description provided.