Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazily load hive partition information #10215

Merged

Conversation

Praveen2112
Copy link
Member

No description provided.

@Praveen2112 Praveen2112 added the WIP label Dec 7, 2021
@cla-bot cla-bot bot added the cla-signed label Dec 7, 2021
@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from bd737f6 to b044ed3 Compare December 7, 2021 16:12
@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from 426ffc3 to d7d383b Compare December 10, 2021 06:17
@Praveen2112
Copy link
Member Author

@findepi @sopel39 AC

@Praveen2112 Praveen2112 marked this pull request as ready for review December 10, 2021 06:17
@Praveen2112 Praveen2112 changed the title Defer loading hive partition information Lazily load hive partition information Dec 10, 2021
@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from d7d383b to 5e0f123 Compare December 10, 2021 06:24
@Praveen2112 Praveen2112 removed the WIP label Dec 10, 2021
@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from 5e0f123 to 79c8012 Compare December 10, 2021 10:59
@Praveen2112
Copy link
Member Author

@sopel39 AC

@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from 79c8012 to 34c436f Compare December 13, 2021 06:46
}
else {
List<String> partitionNames = getFilteredPartitionNames(metastore, identity, tableName, partitionColumns, compactEffectivePredicate);
if (hiveTableHandle.getPartitionNames().isPresent()) {
partitionNames = hiveTableHandle.getPartitionNames().get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd that we don't need to filter partition names while we had to do .filter(partition -> partitionMatches(partitionColumns, effectivePredicate, predicate, partition)) above.

Comment would be nice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping, I still don't understand why we don't have to filter here.
Why there can be partition names and not getPartitions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionMatches effectively converts the partition names from String to HivePartitionInformation and then tries to match with the filter predicate.

Why there can be partition names and not getPartitions?
The contract is that a table can either load a partition details (if it is less than a threshold) or could maintain it as raw String if it crosses a threshold. We don't have an intermediate state here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionMatches effectively converts the partition names from String to HivePartitionInformation and then tries to match with the filter predicate.

So in this branch:

partitionNames = hiveTableHandle.getPartitionNames().get();

we don't do any filtering. Does it mean we return excessive partitionNames?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't do any filtering.

We do kind of partitial filterting, like if we have a TupleDomain (with Domain on Partition columns) then the filtering would be applied at a metastore layer but we won't perform partitionMatches (and materializing it into HivePartition)

Does it mean we return excessive partitionNames?

Since we dont invoke partitionMatches there is a chance that we could return excessive partition names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we dont invoke partitionMatches there is a chance that we could return excessive partition names.

That doesn't seem correct, does it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case we don't have specify the enforcedTupleDomain as TupleDomain so that the filter expression is not lost. It will be applied during the next applyFilter optimizer. WDYT ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case we don't have specify the enforcedTupleDomain as TupleDomain so that the filter expression is not lost. It will be applied during the next applyFilter optimizer. WDYT ?

Depends on contract. If we say getPartitions returns partitions that match constrant, then it should be the case

@Praveen2112
Copy link
Member Author

@sopel39 Added comments.

@@ -345,7 +359,20 @@ public String getTableName()
return dataColumns;
}

// do not serialize partitions as they are not needed on workers
/**
* Represents raw partition information as String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it filtered by table predicate?
Can partitionNames be loaded independently of partitions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it filtered by table predicate?

Yes but partially.

Can partitionNames be loaded independently of partitions?

If partitions is loaded then partitionNames is reset to Optional.empty

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add that as a comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it filtered by table predicate?

Yes but partially.

that must be documented

}
else {
List<String> partitionNames = getFilteredPartitionNames(metastore, identity, tableName, partitionColumns, compactEffectivePredicate);
if (hiveTableHandle.getPartitionNames().isPresent()) {
partitionNames = hiveTableHandle.getPartitionNames().get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping, I still don't understand why we don't have to filter here.
Why there can be partition names and not getPartitions?

@findepi findepi requested a review from alexjo2144 December 16, 2021 16:10
@@ -345,7 +359,20 @@ public String getTableName()
return dataColumns;
}

// do not serialize partitions as they are not needed on workers
/**
* Represents raw partition information as String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it filtered by table predicate?

Yes but partially.

that must be documented

// TODO this shouldn't fail
assertThatThrownBy(() -> query("SELECT * FROM " + tableName + " WHERE part1 % 400 = 3")) // may be translated to Domain.all
.hasMessage(format("Query over table 'tpch.%s' can potentially read more than 1000 partitions", tableName));
assertQuery("SELECT * FROM " + tableName + " WHERE part1 % 400 = 3", "SELECT 'bar', 3, 3"); // may be translated to Domain.all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assertThat(query as in the next assertion

}

@Test
public void testFilterNotDerivedFromTablePropertiesForTooManyPartitions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add this test in the prep commit, like you did with the other test.
i understand the query fails before changes?

@@ -130,7 +143,14 @@ public HivePartitionResult getPartitions(SemiTransactionalHiveMetastore metastor
// All partition key domains will be fully evaluated, so we don't need to include those
TupleDomain<ColumnHandle> remainingTupleDomain = effectivePredicate.filter((column, domain) -> !partitionColumns.contains(column));
TupleDomain<ColumnHandle> enforcedTupleDomain = effectivePredicate.filter((column, domain) -> partitionColumns.contains(column));
return new HivePartitionResult(partitionColumns, partitionsIterable, compactEffectivePredicate, remainingTupleDomain, enforcedTupleDomain, hiveBucketHandle, bucketFilter);

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a javadoc

/**
* Partitions will be parsed if
* 1. Number of partitionNames is less than or equal to threshold value.
* 2. If additional predicate is passed as a part of Constraint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document why we're making this choice.

In any case, putting this boolean in HivePartitionResult is wrong.
The user of HivePartitionResult should do this logic (getPartitionsAsList or caller of it)
Remove HivePartitionResult.canParsePartitions fielld

Comment on lines 124 to 125
partitionNames = partitions.stream()
.map(HivePartition::getPartitionId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wasteful. If we knot the partitions (the objects), we don't need the names anymore.

.map(partitionValues -> toPartitionName(partitionColumnNames, partitionValues))
.collect(toImmutableList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we will calculate List<HivePartition> partitionList, the partitionNames list won't be needed.
no need to materialize the list.

if (hiveTable.getPartitionNames().isEmpty()) {
HivePartitionResult partitionResult = partitionManager.getPartitions(metastore, new HiveIdentity(session), table, new Constraint(hiveTable.getEnforcedConstraint()));
if (partitionResult.canParsePartitions()) {
List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks as potentially expensive and is also being thrown away.
Document why we believe this is not a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is being seen in the current master too. Since getTableProperties doesn't allow us to update TableHandle this issue is seen.

List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);
predicate = predicate.intersect(createPredicate(partitionColumns, partitions));

if (!partitionColumns.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can partitionColumns be empty here? The outer block of code looks like dealing with partitioned table

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of unpartitioned table too we might get a single entry for HivePartition, so we need this check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a comment.
as a followup please refactor this code. Doing some partition related work over 15 lines only to eventually check that... table isn't actually partitioned. We should reverse the checks

@alexjo2144
Copy link
Member

Just a general question, when is getTableProperties called for the discretePredicate relative to when applyFilter is called?

@findepi
Copy link
Member

findepi commented Dec 16, 2021

@alexjo2144 seems the only use of ConnectorTableProperties#getDiscretePredicates is in MetadataQueryOptimizer (optimizer.optimize-metadata-queries), so unrelated to applyFilter.

@alexjo2144
Copy link
Member

Thanks Piotr, I ask mainly because of the check in getTableProperties looks at partitionNames, but partitionNames is set in applyFilter. It wasn't clear to me that applyFilter happens before MetadataQueryOptimizer.

@findepi
Copy link
Member

findepi commented Dec 16, 2021

It wasn't clear to me that applyFilter happens before MetadataQueryOptimizer.

may or may not

@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch 3 times, most recently from 0ef477f to 68a9551 Compare January 4, 2022 12:46
Copy link
Member Author

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);
predicate = predicate.intersect(createPredicate(partitionColumns, partitions));

if (!partitionColumns.isEmpty()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of unpartitioned table too we might get a single entry for HivePartition, so we need this check.

if (hiveTable.getPartitionNames().isEmpty()) {
HivePartitionResult partitionResult = partitionManager.getPartitions(metastore, new HiveIdentity(session), table, new Constraint(hiveTable.getEnforcedConstraint()));
if (partitionResult.canParsePartitions()) {
List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is being seen in the current master too. Since getTableProperties doesn't allow us to update TableHandle this issue is seen.

@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from 68a9551 to ba60f3a Compare January 5, 2022 05:23
// Compute enforced and remaining TupleDomain if the partitions are loaded.
if (partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions || constraint.predicate().isPresent()) {
// All partition key domains will be fully evaluated, so we don't need to include those
remainingTupleDomain = effectivePredicate.filter((column, domain) -> !partitionColumns.contains(column));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this block conditional on partitions being loaded? Only the columns list is used here, not the values of the partitions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the HiveTableHandle has only raw partition names, then there is a good chance that the Domain for partition columns partially enforced, so we compute the enforced tuple domain post materialization of HivePartition.

@@ -64,6 +68,11 @@ public HivePartitionResult(
return partitionColumns;
}

public Optional<List<String>> getPartitionNames()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

TupleDomain<ColumnHandle> enforcedTupleDomain = TupleDomain.all();

// Compute enforced and remaining TupleDomain if the partitions are loaded.
if (partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions || constraint.predicate().isPresent()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition here (partitionNames list size, constraint.predicate() being present) doesn't not match the objects being used in the enclosed code (partitionColumns). I don't find the code-level documentation explanatory for this. Can you elaborate?

Also, the logic partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions is a tricky hack, as partitionNames being empty doesn't mean no partitions being scanned (empty list).
Split into .isEmpty / .isPresent check and separate size check.


// If the partitions are not loaded, try out if they can be loaded.
if (hiveTable.getPartitions().isEmpty()) {
// Since getTableProperties doesn't allow us to update ConnectorTableHandle, we might throw away this post creating TableProperties.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"post"?
did you mean

// Note that the computation is not persisted in the table handle, so can be redone many times

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

// If the partitions are not loaded, try out if they can be loaded.
if (hiveTable.getPartitions().isEmpty()) {
// Since getTableProperties doesn't allow us to update ConnectorTableHandle, we might throw away this post creating TableProperties.
// TODO: Allow ConnectorMetadata#getTableProperties to update ConnectorTableHandle if required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure this is a great idea (would rather have a new API rather than make a seemingly read-only method not be a read-only method). However, if this is something to address, we should have an issue.

Also, can this cause planning time performance degradation? @sopel39 i think it can.
Note that this new code triggers not only for queries that would fail, but also for queries where we loaded many partition names, and didn't convert them and narrow down to partition objects, out of fear that there will be too many.

Should we have a kill switch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, document why we're doing this. One can intuitively want to return "no known constraint here", and i currently wouldn't be able to explain why not.

List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);
predicate = predicate.intersect(createPredicate(partitionColumns, partitions));

if (!partitionColumns.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment

project(
tableScan("table_unpartitioned", Map.of("R_STR_COL", "str_col", "R_INT_COL", "int_col")))))))));

assertThatThrownBy(() -> getQueryRunner().execute(query))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the query pass only if you disable join reordering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we disable join reordering, the table statistics are not computed during the planning, so the query crosses the planning phase but it fails during execution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this should be a code comment
  2. that means the actual user problem has not been solved, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that means the actual user problem has not been solved, right?

No. The actual problem that we are trying to solve is -

Query currently fails when we apply a criteria like partition_column like '%abc%' - if the initial number of partitions crosses the threshold - even if only a lesser number of partitions satisfies it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the test showing the successful execution of a query benefiting from the changes here?
i see one in io.trino.plugin.hive.BaseHiveConnectorTest#testPartitionPerScanLimitWithMultiplePartitionColumns but it's simple SELECT (good test case to test)
i also see io.trino.plugin.hive.optimizer.TestHivePlans#testQueryScanningForTooManyPartitions covering Joins, but it fails with defaults, and requires disabling join reordering to pass.

The solution should work with defaults setting, or any other setting that are probable to be used by a user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have fixed for fetching table statistics - now it passes during planning while it fails during execution - due to scanning maximum partitions.

@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from b706a93 to 741ac2c Compare February 7, 2022 10:57
Copy link
Member Author

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi Thanks for the feedback, have applied the comments.

project(
tableScan("table_unpartitioned", Map.of("R_STR_COL", "str_col", "R_INT_COL", "int_col")))))))));

assertThatThrownBy(() -> getQueryRunner().execute(query))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we disable join reordering, the table statistics are not computed during the planning, so the query crosses the planning phase but it fails during execution.

@@ -64,6 +68,11 @@ public HivePartitionResult(
return partitionColumns;
}

public Optional<List<String>> getPartitionNames()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah !! In case of HiveTableHandle it is a bit strict (like it fails if we have both partitionNames and partitions).. we check it when we are initializing.

Comment on lines 490 to 492
new Constraint(
TupleDomain.all(),
partitionValues -> true,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constraint.alwaysTrue sets the predicate as Optional#empty but we need to pass an default predicate for the API..

Can we modify Constraint.alwaysTrue like we do for Constraint.alwaysFalse ?


// If the partitions are not loaded, try out if they can be loaded.
if (hiveTable.getPartitions().isEmpty()) {
// Since getTableProperties doesn't allow us to update ConnectorTableHandle, we might throw away this post creating TableProperties.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

TupleDomain<ColumnHandle> enforcedTupleDomain = TupleDomain.all();

// Compute enforced and remaining TupleDomain if the partitions are loaded.
if (partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions || constraint.predicate().isPresent()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes so in that case we try to compute the remaining tuple domain.

@@ -64,6 +68,11 @@ public HivePartitionResult(
return partitionColumns;
}

public Optional<List<String>> getPartitionNames()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn't you add same check in HivePartitionResult::new?

// Apply extra filters which could not be done by getFilteredPartitionNames
.map(partitionName -> parseValuesAndFilterPartition(tableName, partitionName, partitionColumns, partitionTypes, effectivePredicate, predicate))
.filter(Optional::isPresent)
.map(Optional::get)
.iterator();
partitionsAreLoaded = partitionNamesList.size() <= maxPartitions || constraint.predicate().isPresent();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partitions are not loaded, since it's a throw-away information.
It's conceivable that hiveTableHandle is later consumed with a slightly different logic, making the enforcedTupleDomain untrue.

Comment on lines 490 to 492
new Constraint(
TupleDomain.all(),
partitionValues -> true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify Constraint.alwaysTrue like we do for Constraint.alwaysFalse ?

that's effective today, but basing on unwritten non-API assumptions sounds like a brittle hack.

.or(() -> {
// We load the partitions so as to compute the predicates enforced by the table.
// Note that the computation is not persisted in the table handle, so can be redone many times
// TODO: Allow ConnectorMetadata#getTableProperties to update ConnectorTableHandle if required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a prescribed solution, and one which i am not sold on yet.
File a ticket about this problem, but avoiding prescribing solutions, to avoid channelling the discussion.
Link it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project(
tableScan("table_unpartitioned", Map.of("R_STR_COL", "str_col", "R_INT_COL", "int_col")))))))));

assertThatThrownBy(() -> getQueryRunner().execute(query))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this should be a code comment
  2. that means the actual user problem has not been solved, right?

@Praveen2112 Praveen2112 requested a review from findepi February 8, 2022 12:55
getPartitionsAsList(getPartitions(metastore, table, new Constraint(summary))));
}

public boolean canPartitionsBeLoaded(HivePartitionResult partitionResult)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some callers of getPartitions which don't then call this method before collecting the Partitions. getTableStatistics for example, should this be checked there?

.or(() -> {
// We load the partitions to compute the predicates enforced by the table.
// Note that the computation is not persisted in the table handle, so can be redone many times
// TODO: https://github.com/trinodb/trino/issues/10980.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change worth doing until we have that done? Seems like loading the partition information once eagerly is better than lazily doing it multiple times

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexjo2144 not sure what's your suggestion here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just asking if this change to lazily load the partition information is actually an improvement until that linked issue is completed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if that issue is a blocker for this to be merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We lazily evaluate so that it will be loaded after all the filters are pushed to the Hive.

@Praveen2112
Copy link
Member Author

@alexjo2144 AC

@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from bdbd541 to 47f12ba Compare February 11, 2022 08:21
Comment on lines 490 to 492
new Constraint(
TupleDomain.all(),
partitionValues -> true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please switch to Constraint.alwaysTrue()

@@ -64,6 +68,11 @@ public HivePartitionResult(
return partitionColumns;
}

public Optional<List<String>> getPartitionNames()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, can we have a check there?

@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from 95cc629 to 47f12ba Compare February 15, 2022 13:14
List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);
predicate = predicate.intersect(createPredicate(partitionColumns, partitions));

if (!partitionColumns.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a comment.
as a followup please refactor this code. Doing some partition related work over 15 lines only to eventually check that... table isn't actually partitioned. We should reverse the checks

@@ -109,6 +108,7 @@ public HivePartitionResult getPartitions(SemiTransactionalHiveMetastore metastor
.map(HiveColumnHandle::getType)
.collect(toList());

Optional<List<String>> partitionNames = hiveTableHandle.getPartitionNames();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (hiveTableHandle.getPartitions().isPresent()) then why do we pass partitionNames further, as if we're preserving some value?

move the assignment under if (hiveTableHandle.getPartitions().isPresent()) block, and set the value to Optional.empty() explicitly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set it as Optional.empty() and no need to move the assignment.

{
Optional<List<String>> partitionNames = partitions.getPartitionNames();
Optional<List<HivePartition>> partitionList = Optional.empty();
TupleDomain<ColumnHandle> enforcedConstraint = partitions.getEffectivePredicate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enforcedConstraint or effectivePredicate ?

it's not enforced yet...
also includes Domains on non-partitioning columns, right? (otherwise you wouldn't do partitions.getEffectivePredicate().filter((column, domain) -> partitionColumns.contains(column)) below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think it has to be TupleDomain#all

partitions.getCompactEffectivePredicate(),
partitions.getEnforcedConstraint(),
enforcedConstraint,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we skip the if (canPartitionsBeLoaded(partitions) || constraint.predicate().isPresent()) block above (i.e. condition was false), then this variable contains partitions.getEffectivePredicate() which looks like not enforced and potentially containing Domains on non-partitioning columns (so something that won't be enforced)
this is here passed to HiveTableHandle#enforcedConstraint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this variable contains partitions.getEffectivePredicate()

Actually it would contain TupleDomain#all - but now it will be initialized to HiveTableHandle#enforcedConstraint

* Test to ensure filter on build side table is derived from table properties
* Test to ensure query fails if it scans too many partitions
@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from 16a81d2 to 8afe1db Compare February 18, 2022 13:01
@Praveen2112
Copy link
Member Author

@findepi AC. Since there was a conflict had to rebase it.

@Praveen2112
Copy link
Member Author

@findepi AC

@Praveen2112 Praveen2112 requested a review from findepi February 28, 2022 06:07
We defer the initial loading of HivePartitionInformation if the number of
partitions crosses a limit. This allows further invocation of
applyFilter which could reduce the number of partitions to be scanned.
@Praveen2112 Praveen2112 force-pushed the praveen/hive/defer_partition branch from c68acf2 to d7a6b9c Compare March 2, 2022 06:43
@Praveen2112 Praveen2112 merged commit fe230f6 into trinodb:master Mar 3, 2022
@github-actions github-actions bot added this to the 373 milestone Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants