Lazily load hive partition information #10215

Praveen2112 · 2021-12-07T13:53:45Z

No description provided.

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveConnectorTest.java

Praveen2112 · 2021-12-10T06:17:22Z

@findepi @sopel39 AC

plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java

Praveen2112 · 2021-12-10T10:59:26Z

@sopel39 AC

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionResult.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

sopel39 · 2021-12-13T11:12:08Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

        }
        else {
-            List<String> partitionNames = getFilteredPartitionNames(metastore, identity, tableName, partitionColumns, compactEffectivePredicate);
+            if (hiveTableHandle.getPartitionNames().isPresent()) {
+                partitionNames = hiveTableHandle.getPartitionNames().get();


It's odd that we don't need to filter partition names while we had to do .filter(partition -> partitionMatches(partitionColumns, effectivePredicate, predicate, partition)) above.

Comment would be nice

ping, I still don't understand why we don't have to filter here.
Why there can be partition names and not getPartitions?

partitionMatches effectively converts the partition names from String to HivePartitionInformation and then tries to match with the filter predicate.

Why there can be partition names and not getPartitions?
The contract is that a table can either load a partition details (if it is less than a threshold) or could maintain it as raw String if it crosses a threshold. We don't have an intermediate state here

partitionMatches effectively converts the partition names from String to HivePartitionInformation and then tries to match with the filter predicate.

So in this branch:

partitionNames = hiveTableHandle.getPartitionNames().get();

we don't do any filtering. Does it mean we return excessive partitionNames?

we don't do any filtering.

We do kind of partitial filterting, like if we have a TupleDomain (with Domain on Partition columns) then the filtering would be applied at a metastore layer but we won't perform partitionMatches (and materializing it into HivePartition)

Does it mean we return excessive partitionNames?

Since we dont invoke partitionMatches there is a chance that we could return excessive partition names.

Since we dont invoke partitionMatches there is a chance that we could return excessive partition names.

That doesn't seem correct, does it?

I think in this case we don't have specify the enforcedTupleDomain as TupleDomain so that the filter expression is not lost. It will be applied during the next applyFilter optimizer. WDYT ?

I think in this case we don't have specify the enforcedTupleDomain as TupleDomain so that the filter expression is not lost. It will be applied during the next applyFilter optimizer. WDYT ?

Depends on contract. If we say getPartitions returns partitions that match constrant, then it should be the case

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

Praveen2112 · 2021-12-14T13:19:47Z

@sopel39 Added comments.

sopel39 · 2021-12-16T12:49:21Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java

@@ -345,7 +359,20 @@ public String getTableName()
        return dataColumns;
    }

-    // do not serialize partitions as they are not needed on workers
+    /**
+     * Represents raw partition information as String


Is it filtered by table predicate?
Can partitionNames be loaded independently of partitions?

Is it filtered by table predicate?

Yes but partially.

Can partitionNames be loaded independently of partitions?

If partitions is loaded then partitionNames is reset to Optional.empty

Could you add that as a comment?

Is it filtered by table predicate?

Yes but partially.

that must be documented

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

sopel39 · 2021-12-16T12:55:43Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

        }
        else {
-            List<String> partitionNames = getFilteredPartitionNames(metastore, identity, tableName, partitionColumns, compactEffectivePredicate);
+            if (hiveTableHandle.getPartitionNames().isPresent()) {
+                partitionNames = hiveTableHandle.getPartitionNames().get();


ping, I still don't understand why we don't have to filter here.
Why there can be partition names and not getPartitions?

findepi · 2021-12-16T16:11:50Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java

@@ -345,7 +359,20 @@ public String getTableName()
        return dataColumns;
    }

-    // do not serialize partitions as they are not needed on workers
+    /**
+     * Represents raw partition information as String


Is it filtered by table predicate?

Yes but partially.

that must be documented

findepi · 2021-12-16T16:12:50Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveConnectorTest.java

-        // TODO this shouldn't fail
-        assertThatThrownBy(() -> query("SELECT * FROM " + tableName + " WHERE part1 % 400 = 3")) // may be translated to Domain.all
-                .hasMessage(format("Query over table 'tpch.%s' can potentially read more than 1000 partitions", tableName));
+        assertQuery("SELECT * FROM " + tableName + " WHERE part1 % 400 = 3", "SELECT 'bar', 3, 3"); // may be translated to Domain.all


use assertThat(query as in the next assertion

findepi · 2021-12-16T16:14:13Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java

+    }
+
+    @Test
+    public void testFilterNotDerivedFromTablePropertiesForTooManyPartitions()


You can add this test in the prep commit, like you did with the other test.
i understand the query fails before changes?

findepi · 2021-12-16T16:16:30Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

@@ -130,7 +143,14 @@ public HivePartitionResult getPartitions(SemiTransactionalHiveMetastore metastor
        // All partition key domains will be fully evaluated, so we don't need to include those
        TupleDomain<ColumnHandle> remainingTupleDomain = effectivePredicate.filter((column, domain) -> !partitionColumns.contains(column));
        TupleDomain<ColumnHandle> enforcedTupleDomain = effectivePredicate.filter((column, domain) -> partitionColumns.contains(column));
-        return new HivePartitionResult(partitionColumns, partitionsIterable, compactEffectivePredicate, remainingTupleDomain, enforcedTupleDomain, hiveBucketHandle, bucketFilter);
+
+        /**


not a javadoc

findepi · 2021-12-16T16:18:05Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

+        /**
+         * Partitions will be parsed if
+         *  1. Number of partitionNames is less than or equal to threshold value.
+         *  2. If additional predicate is passed as a part of Constraint.


Document why we're making this choice.

In any case, putting this boolean in HivePartitionResult is wrong.
The user of HivePartitionResult should do this logic (getPartitionsAsList or caller of it)
Remove HivePartitionResult.canParsePartitions fielld

findepi · 2021-12-16T16:19:01Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

+            partitionNames = partitions.stream()
+                    .map(HivePartition::getPartitionId)


This looks wasteful. If we knot the partitions (the objects), we don't need the names anymore.

findepi · 2021-12-16T16:21:43Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

                .map(partitionValues -> toPartitionName(partitionColumnNames, partitionValues))
+                .collect(toImmutableList());


since we will calculate List<HivePartition> partitionList, the partitionNames list won't be needed.
no need to materialize the list.

findepi · 2021-12-16T16:23:47Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+        if (hiveTable.getPartitionNames().isEmpty()) {
+            HivePartitionResult partitionResult = partitionManager.getPartitions(metastore, new HiveIdentity(session), table, new Constraint(hiveTable.getEnforcedConstraint()));
+            if (partitionResult.canParsePartitions()) {
+                List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);


This looks as potentially expensive and is also being thrown away.
Document why we believe this is not a problem.

I think this is being seen in the current master too. Since getTableProperties doesn't allow us to update TableHandle this issue is seen.

findepi · 2021-12-16T16:24:27Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+                List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);
+                predicate = predicate.intersect(createPredicate(partitionColumns, partitions));
+
+                if (!partitionColumns.isEmpty()) {


can partitionColumns be empty here? The outer block of code looks like dealing with partitioned table

In case of unpartitioned table too we might get a single entry for HivePartition, so we need this check.

Add a comment

Thanks for adding a comment.
as a followup please refactor this code. Doing some partition related work over 15 lines only to eventually check that... table isn't actually partitioned. We should reverse the checks

alexjo2144 · 2021-12-16T17:25:58Z

Just a general question, when is getTableProperties called for the discretePredicate relative to when applyFilter is called?

findepi · 2021-12-16T20:02:18Z

@alexjo2144 seems the only use of ConnectorTableProperties#getDiscretePredicates is in MetadataQueryOptimizer (optimizer.optimize-metadata-queries), so unrelated to applyFilter.

alexjo2144 · 2021-12-16T21:00:19Z

Thanks Piotr, I ask mainly because of the check in getTableProperties looks at partitionNames, but partitionNames is set in applyFilter. It wasn't clear to me that applyFilter happens before MetadataQueryOptimizer.

findepi · 2021-12-16T21:50:28Z

It wasn't clear to me that applyFilter happens before MetadataQueryOptimizer.

may or may not

Praveen2112

@sopel39 , @findepi AC

Praveen2112 · 2022-01-04T12:36:13Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+                List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);
+                predicate = predicate.intersect(createPredicate(partitionColumns, partitions));
+
+                if (!partitionColumns.isEmpty()) {


In case of unpartitioned table too we might get a single entry for HivePartition, so we need this check.

Praveen2112 · 2022-01-04T12:43:43Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+        if (hiveTable.getPartitionNames().isEmpty()) {
+            HivePartitionResult partitionResult = partitionManager.getPartitions(metastore, new HiveIdentity(session), table, new Constraint(hiveTable.getEnforcedConstraint()));
+            if (partitionResult.canParsePartitions()) {
+                List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);


I think this is being seen in the current master too. Since getTableProperties doesn't allow us to update TableHandle this issue is seen.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

alexjo2144 · 2022-01-05T18:42:02Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

+        // Compute enforced and remaining TupleDomain if the partitions are loaded.
+        if (partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions || constraint.predicate().isPresent()) {
+            // All partition key domains will be fully evaluated, so we don't need to include those
+            remainingTupleDomain = effectivePredicate.filter((column, domain) -> !partitionColumns.contains(column));


Why is this block conditional on partitions being loaded? Only the columns list is used here, not the values of the partitions.

If the HiveTableHandle has only raw partition names, then there is a good chance that the Domain for partition columns partially enforced, so we compute the enforced tuple domain post materialization of HivePartition.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java

findepi · 2022-02-01T09:15:11Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionResult.java

@@ -64,6 +68,11 @@ public HivePartitionResult(
        return partitionColumns;
    }

+    public Optional<List<String>> getPartitionNames()


findepi · 2022-02-01T09:20:57Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

+        TupleDomain<ColumnHandle> enforcedTupleDomain = TupleDomain.all();
+
+        // Compute enforced and remaining TupleDomain if the partitions are loaded.
+        if (partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions || constraint.predicate().isPresent()) {


The condition here (partitionNames list size, constraint.predicate() being present) doesn't not match the objects being used in the enclosed code (partitionColumns). I don't find the code-level documentation explanatory for this. Can you elaborate?

Also, the logic partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions is a tricky hack, as partitionNames being empty doesn't mean no partitions being scanned (empty list).
Split into .isEmpty / .isPresent check and separate size check.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

findepi · 2022-02-01T09:29:51Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+
+            // If the partitions are not loaded, try out if they can be loaded.
+            if (hiveTable.getPartitions().isEmpty()) {
+                // Since getTableProperties doesn't allow us to update ConnectorTableHandle, we might throw away this post creating TableProperties.


"post"?
did you mean

// Note that the computation is not persisted in the table handle, so can be redone many times

?

findepi · 2022-02-01T09:31:52Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+            // If the partitions are not loaded, try out if they can be loaded.
+            if (hiveTable.getPartitions().isEmpty()) {
+                // Since getTableProperties doesn't allow us to update ConnectorTableHandle, we might throw away this post creating TableProperties.
+                // TODO: Allow ConnectorMetadata#getTableProperties to update ConnectorTableHandle if required.


i am not sure this is a great idea (would rather have a new API rather than make a seemingly read-only method not be a read-only method). However, if this is something to address, we should have an issue.

Also, can this cause planning time performance degradation? @sopel39 i think it can.
Note that this new code triggers not only for queries that would fail, but also for queries where we loaded many partition names, and didn't convert them and narrow down to partition objects, out of fear that there will be too many.

Should we have a kill switch?

Also, document why we're doing this. One can intuitively want to return "no known constraint here", and i currently wouldn't be able to explain why not.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

findepi · 2022-02-01T09:40:01Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+                List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);
+                predicate = predicate.intersect(createPredicate(partitionColumns, partitions));
+
+                if (!partitionColumns.isEmpty()) {


Add a comment

findepi · 2022-02-01T09:44:11Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java

+                                                        project(
+                                                                tableScan("table_unpartitioned", Map.of("R_STR_COL", "str_col", "R_INT_COL", "int_col")))))))));
+
+        assertThatThrownBy(() -> getQueryRunner().execute(query))


Why does the query pass only if you disable join reordering?

When we disable join reordering, the table statistics are not computed during the planning, so the query crosses the planning phase but it fails during execution.

this should be a code comment

that means the actual user problem has not been solved, right?

that means the actual user problem has not been solved, right?

No. The actual problem that we are trying to solve is -

Query currently fails when we apply a criteria like partition_column like '%abc%' - if the initial number of partitions crosses the threshold - even if only a lesser number of partitions satisfies it.

Where is the test showing the successful execution of a query benefiting from the changes here?
i see one in io.trino.plugin.hive.BaseHiveConnectorTest#testPartitionPerScanLimitWithMultiplePartitionColumns but it's simple SELECT (good test case to test)
i also see io.trino.plugin.hive.optimizer.TestHivePlans#testQueryScanningForTooManyPartitions covering Joins, but it fails with defaults, and requires disabling join reordering to pass.

The solution should work with defaults setting, or any other setting that are probable to be used by a user.

We have fixed for fetching table statistics - now it passes during planning while it fails during execution - due to scanning maximum partitions.

Praveen2112

@findepi Thanks for the feedback, have applied the comments.

Praveen2112 · 2022-02-07T07:51:52Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java

+                                                        project(
+                                                                tableScan("table_unpartitioned", Map.of("R_STR_COL", "str_col", "R_INT_COL", "int_col")))))))));
+
+        assertThatThrownBy(() -> getQueryRunner().execute(query))


When we disable join reordering, the table statistics are not computed during the planning, so the query crosses the planning phase but it fails during execution.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java

Praveen2112 · 2022-02-07T07:59:14Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionResult.java

@@ -64,6 +68,11 @@ public HivePartitionResult(
        return partitionColumns;
    }

+    public Optional<List<String>> getPartitionNames()


Yeah !! In case of HiveTableHandle it is a bit strict (like it fails if we have both partitionNames and partitions).. we check it when we are initializing.

Praveen2112 · 2022-02-07T08:06:34Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+                    new Constraint(
+                            TupleDomain.all(),
+                            partitionValues -> true,


Constraint.alwaysTrue sets the predicate as Optional#empty but we need to pass an default predicate for the API..

Can we modify Constraint.alwaysTrue like we do for Constraint.alwaysFalse ?

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

Praveen2112 · 2022-02-07T09:09:03Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+
+            // If the partitions are not loaded, try out if they can be loaded.
+            if (hiveTable.getPartitions().isEmpty()) {
+                // Since getTableProperties doesn't allow us to update ConnectorTableHandle, we might throw away this post creating TableProperties.


Praveen2112 · 2022-02-07T09:18:29Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

+        TupleDomain<ColumnHandle> enforcedTupleDomain = TupleDomain.all();
+
+        // Compute enforced and remaining TupleDomain if the partitions are loaded.
+        if (partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions || constraint.predicate().isPresent()) {


Yes so in that case we try to compute the remaining tuple domain.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

findepi · 2022-02-07T16:01:58Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionResult.java

@@ -64,6 +68,11 @@ public HivePartitionResult(
        return partitionColumns;
    }

+    public Optional<List<String>> getPartitionNames()


Why didn't you add same check in HivePartitionResult::new?

findepi · 2022-02-07T16:04:35Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

                    // Apply extra filters which could not be done by getFilteredPartitionNames
                    .map(partitionName -> parseValuesAndFilterPartition(tableName, partitionName, partitionColumns, partitionTypes, effectivePredicate, predicate))
                    .filter(Optional::isPresent)
                    .map(Optional::get)
                    .iterator();
+            partitionsAreLoaded = partitionNamesList.size() <= maxPartitions || constraint.predicate().isPresent();


Partitions are not loaded, since it's a throw-away information.
It's conceivable that hiveTableHandle is later consumed with a slightly different logic, making the enforcedTupleDomain untrue.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

findepi · 2022-02-07T16:07:42Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+                    new Constraint(
+                            TupleDomain.all(),
+                            partitionValues -> true,


Can we modify Constraint.alwaysTrue like we do for Constraint.alwaysFalse ?

that's effective today, but basing on unwritten non-API assumptions sounds like a brittle hack.

findepi · 2022-02-07T16:09:27Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+                    .or(() -> {
+                        // We load the partitions so as to compute the predicates enforced by the table.
+                        // Note that the computation is not persisted in the table handle, so can be redone many times
+                        // TODO: Allow ConnectorMetadata#getTableProperties to update ConnectorTableHandle if required.


That's a prescribed solution, and one which i am not sold on yet.
File a ticket about this problem, but avoiding prescribing solutions, to avoid channelling the discussion.
Link it here.

findepi · 2022-02-07T16:11:03Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java

+                                                        project(
+                                                                tableScan("table_unpartitioned", Map.of("R_STR_COL", "str_col", "R_INT_COL", "int_col")))))))));
+
+        assertThatThrownBy(() -> getQueryRunner().execute(query))


this should be a code comment

that means the actual user problem has not been solved, right?

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

alexjo2144 · 2022-02-08T20:17:31Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

+                getPartitionsAsList(getPartitions(metastore, table, new Constraint(summary))));
+    }
+
+    public boolean canPartitionsBeLoaded(HivePartitionResult partitionResult)


There are some callers of getPartitions which don't then call this method before collecting the Partitions. getTableStatistics for example, should this be checked there?

alexjo2144 · 2022-02-08T20:20:27Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+                    .or(() -> {
+                        // We load the partitions to compute the predicates enforced by the table.
+                        // Note that the computation is not persisted in the table handle, so can be redone many times
+                        // TODO: https://github.com/trinodb/trino/issues/10980.


Is this change worth doing until we have that done? Seems like loading the partition information once eagerly is better than lazily doing it multiple times

@alexjo2144 not sure what's your suggestion here?

I'm just asking if this change to lazily load the partition information is actually an improvement until that linked issue is completed.

Or if that issue is a blocker for this to be merged.

We lazily evaluate so that it will be loaded after all the filters are pushed to the Hive.

Praveen2112 · 2022-02-10T14:22:03Z

@alexjo2144 AC

findepi · 2022-02-14T11:31:24Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+                    new Constraint(
+                            TupleDomain.all(),
+                            partitionValues -> true,


please switch to Constraint.alwaysTrue()

plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java

findepi · 2022-02-14T11:37:09Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionResult.java

@@ -64,6 +68,11 @@ public HivePartitionResult(
        return partitionColumns;
    }

+    public Optional<List<String>> getPartitionNames()


Still, can we have a check there?

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

findepi · 2022-02-17T08:34:17Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+                List<HivePartition> partitions = partitionManager.getPartitionsAsList(partitionResult);
+                predicate = predicate.intersect(createPredicate(partitionColumns, partitions));
+
+                if (!partitionColumns.isEmpty()) {


Thanks for adding a comment.
as a followup please refactor this code. Doing some partition related work over 15 lines only to eventually check that... table isn't actually partitioned. We should reverse the checks

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

findepi · 2022-02-17T08:40:56Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

@@ -109,6 +108,7 @@ public HivePartitionResult getPartitions(SemiTransactionalHiveMetastore metastor
                .map(HiveColumnHandle::getType)
                .collect(toList());

+        Optional<List<String>> partitionNames = hiveTableHandle.getPartitionNames();


if (hiveTableHandle.getPartitions().isPresent()) then why do we pass partitionNames further, as if we're preserving some value?

move the assignment under if (hiveTableHandle.getPartitions().isPresent()) block, and set the value to Optional.empty() explicitly

We can set it as Optional.empty() and no need to move the assignment.

findepi · 2022-02-17T08:41:33Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

    {
+        Optional<List<String>> partitionNames = partitions.getPartitionNames();
+        Optional<List<HivePartition>> partitionList = Optional.empty();
+        TupleDomain<ColumnHandle> enforcedConstraint = partitions.getEffectivePredicate();


enforcedConstraint or effectivePredicate ?

it's not enforced yet...
also includes Domains on non-partitioning columns, right? (otherwise you wouldn't do partitions.getEffectivePredicate().filter((column, domain) -> partitionColumns.contains(column)) below)

Yeah. I think it has to be TupleDomain#all

findepi · 2022-02-17T08:44:28Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

                partitions.getCompactEffectivePredicate(),
-                partitions.getEnforcedConstraint(),
+                enforcedConstraint,


if we skip the if (canPartitionsBeLoaded(partitions) || constraint.predicate().isPresent()) block above (i.e. condition was false), then this variable contains partitions.getEffectivePredicate() which looks like not enforced and potentially containing Domains on non-partitioning columns (so something that won't be enforced)
this is here passed to HiveTableHandle#enforcedConstraint

then this variable contains partitions.getEffectivePredicate()

Actually it would contain TupleDomain#all - but now it will be initialized to HiveTableHandle#enforcedConstraint

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java

* Test to ensure filter on build side table is derived from table properties * Test to ensure query fails if it scans too many partitions

Praveen2112 · 2022-02-18T13:01:29Z

@findepi AC. Since there was a conflict had to rebase it.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java

Praveen2112 · 2022-02-28T06:07:28Z

@findepi AC

We defer the initial loading of HivePartitionInformation if the number of partitions crosses a limit. This allows further invocation of applyFilter which could reduce the number of partitions to be scanned.

Praveen2112 added the WIP label Dec 7, 2021

cla-bot bot added the cla-signed label Dec 7, 2021

Praveen2112 force-pushed the praveen/hive/defer_partition branch from bd737f6 to b044ed3 Compare December 7, 2021 16:12

findepi reviewed Dec 8, 2021

View reviewed changes

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveConnectorTest.java Outdated Show resolved Hide resolved

sopel39 reviewed Dec 8, 2021

View reviewed changes

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveConnectorTest.java Outdated Show resolved Hide resolved

Praveen2112 force-pushed the praveen/hive/defer_partition branch from 426ffc3 to d7d383b Compare December 10, 2021 06:17

Praveen2112 marked this pull request as ready for review December 10, 2021 06:17

Praveen2112 changed the title ~~Defer loading hive partition information~~ Lazily load hive partition information Dec 10, 2021

Praveen2112 force-pushed the praveen/hive/defer_partition branch from d7d383b to 5e0f123 Compare December 10, 2021 06:24

Praveen2112 removed the WIP label Dec 10, 2021

sopel39 reviewed Dec 10, 2021

View reviewed changes

plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java Show resolved Hide resolved

Praveen2112 force-pushed the praveen/hive/defer_partition branch from 5e0f123 to 79c8012 Compare December 10, 2021 10:59

Praveen2112 force-pushed the praveen/hive/defer_partition branch from 79c8012 to 34c436f Compare December 13, 2021 06:46

sopel39 reviewed Dec 13, 2021

View reviewed changes

sopel39 reviewed Dec 16, 2021

View reviewed changes

findepi requested a review from alexjo2144 December 16, 2021 16:10

findepi reviewed Dec 16, 2021

View reviewed changes

Praveen2112 force-pushed the praveen/hive/defer_partition branch 3 times, most recently from 0ef477f to 68a9551 Compare January 4, 2022 12:46

Praveen2112 commented Jan 4, 2022

View reviewed changes

Praveen2112 force-pushed the praveen/hive/defer_partition branch from 68a9551 to ba60f3a Compare January 5, 2022 05:23

alexjo2144 reviewed Jan 5, 2022

View reviewed changes

findepi reviewed Feb 1, 2022

View reviewed changes

Praveen2112 force-pushed the praveen/hive/defer_partition branch from b706a93 to 741ac2c Compare February 7, 2022 10:57

Praveen2112 commented Feb 7, 2022

View reviewed changes

Praveen2112 requested review from findepi and alexjo2144 February 7, 2022 10:58

findepi reviewed Feb 7, 2022

View reviewed changes

findepi requested changes Feb 7, 2022

View reviewed changes

Praveen2112 requested a review from findepi February 8, 2022 12:55

alexjo2144 reviewed Feb 8, 2022

View reviewed changes

findepi mentioned this pull request Feb 9, 2022

ConnectorMetadata #getTableStatistics and #getTableProperties may do repeated throw-away computation #10980

Open

Praveen2112 force-pushed the praveen/hive/defer_partition branch from bdbd541 to 47f12ba Compare February 11, 2022 08:21

findepi reviewed Feb 14, 2022

View reviewed changes

Praveen2112 force-pushed the praveen/hive/defer_partition branch from 95cc629 to 47f12ba Compare February 15, 2022 13:14

findepi reviewed Feb 17, 2022

View reviewed changes

findepi mentioned this pull request Feb 17, 2022

Flexible TupleDomain.intersect type #11078

Merged

More coverage to TestHivePlans

56ccfa9

* Test to ensure filter on build side table is derived from table properties * Test to ensure query fails if it scans too many partitions

Praveen2112 force-pushed the praveen/hive/defer_partition branch from 16a81d2 to 8afe1db Compare February 18, 2022 13:01

findepi reviewed Feb 18, 2022

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java Show resolved Hide resolved

plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java Outdated Show resolved Hide resolved

Praveen2112 requested a review from findepi February 28, 2022 06:07

findepi approved these changes Feb 28, 2022

View reviewed changes

Praveen2112 added 2 commits March 2, 2022 12:12

Defer loading hive partition if number of partitions crosses limit

3a6e1da

We defer the initial loading of HivePartitionInformation if the number of partitions crosses a limit. This allows further invocation of applyFilter which could reduce the number of partitions to be scanned.

Avoid static import of TupleDomain#none

d7a6b9c

Praveen2112 force-pushed the praveen/hive/defer_partition branch from c68acf2 to d7a6b9c Compare March 2, 2022 06:43

Praveen2112 merged commit fe230f6 into trinodb:master Mar 3, 2022

github-actions bot added this to the 373 milestone Mar 3, 2022

mosabua mentioned this pull request Mar 3, 2022

Add Trino 373 release notes #11290

Merged

		partitionNames = partitions.stream()
		.map(HivePartition::getPartitionId)

		.map(partitionValues -> toPartitionName(partitionColumnNames, partitionValues))
		.collect(toImmutableList());

Lazily load hive partition information #10215

Lazily load hive partition information #10215

Conversation

Praveen2112 commented Dec 7, 2021

Praveen2112 commented Dec 10, 2021

Praveen2112 commented Dec 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Praveen2112 commented Dec 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexjo2144 commented Dec 16, 2021

findepi commented Dec 16, 2021

alexjo2144 commented Dec 16, 2021

findepi commented Dec 16, 2021

Praveen2112 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Praveen2112 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Praveen2112 commented Feb 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment