[SPARK-11678] [SQL] Partition discovery should stop at the root path of the table. #9651

yhuai · 2015-11-12T05:24:50Z

https://issues.apache.org/jira/browse/SPARK-11678

The change of this PR is to pass root paths of table to the partition discovery logic. So, the process of partition discovery stops at those root paths instead of going all the way to the root path of the file system.

yhuai · 2015-11-12T05:25:17Z

@liancheng @viirya can you guys review it?

yhuai · 2015-11-12T05:25:58Z

...cala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala

+      paths.map(new Path(_)),
+      defaultPartitionName,
+      true,
+      Set(new Path("hdfs://host:9000/path/something=true/table")))


This will fail without the change.

SparkQA · 2015-11-12T05:56:38Z

Test build #45707 has finished for PR 9651 at commit e406792.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-11-12T06:33:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

+          chopped.getParent == null ||
+          rootPaths.contains(basePath)
+
+      if (maybeColumn.isDefined && !rootPaths.contains(basePath)) {


As we will stop when rootPaths.contains(basePath) == true and in this case columns will not be modified and we don't care the content of maybeColumn, maybe we can skip parsePartitionColumn too?

We can do it in this way:

if (rootPaths.contains(basePath)) { finished = true } else { val maybeColumn = parsePartitionColumn(chopped.getName, defaultPartitionName, typeInference) maybeColumn.foreach(columns += _) basePath = chopped chopped = chopped.getParent finished = (maybeColumn.isEmpty && columns.nonEmpty) || chopped.getParent == null }

SparkQA · 2015-11-12T08:07:43Z

Test build #45715 has finished for PR 9651 at commit 28a1227.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-11-12T08:17:58Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

@@ -294,7 +294,7 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
        // If the "part = 1" filter gets pushed down, this query will throw an exception since
        // "part" is not a valid column in the actual Parquet file
        checkAnswer(
-          sqlContext.read.parquet(path).filter("part = 1"),
+          sqlContext.read.parquet(dir.getCanonicalPath).filter("part = 1"),


Why do we need getCanonicalPath here?

path is a partition dir and if we load that single dir, I am not sure we should attach part as a column to your table.

liancheng · 2015-11-12T08:44:27Z

Seems that this PR breaks another existing feature, namely explicitly specifying a subset of partitions. E.g.:

sqlContext.read.parquet("base/p1=a/p2=1", "base/p1=a/p2=2")
sqlContext.read.parquet("base/year=201?/month=10")

In the second case, the the glob pattern is expanded into multiple input paths. And these paths are considered to be root paths, which prevents partition discovery.

yhuai · 2015-11-13T01:21:51Z

need to work on the doc and comments.

SparkQA · 2015-11-13T02:32:42Z

Test build #45802 has finished for PR 9651 at commit d784a52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-11-13T02:33:36Z

test this please

liancheng · 2015-11-13T03:03:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

@@ -144,7 +149,7 @@ private[sql] object PartitioningUtils {
   *       Literal.create("hello", StringType),
   *       Literal.create(3.14, FloatType)))
   * }}}
-   * and the base path:
+   * and the path when we stop the discovery is:
   * {{{
   *   /path/to/partition


Should we add the hdfs://<host>:<port> part? I think basePath is required to be a canonical/qualified HDFS path, would be better to document this explicitly.

SparkQA · 2015-11-13T03:35:24Z

Test build #45809 has finished for PR 9651 at commit d784a52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-11-13T03:50:35Z

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

+      // If the user does not provide basePath, we will just use paths.
+      val pathSet = paths.toSet
+      pathSet.map(p => new Path(p))
+    }


Considering parsePartitions asserts that basePaths.distinct.size == 1, users should either provide a basePath or passing only a single input path, right?

Ah, actually the basePaths in parsePartitions is completely something else with the same name.

liancheng · 2015-11-13T04:02:10Z

retest this please

liancheng · 2015-11-13T04:51:50Z

retest this please...

liancheng · 2015-11-13T05:20:40Z

The previous LDASuite test failures were because of flaky tests.

SparkQA · 2015-11-13T05:47:30Z

Test build #45828 has finished for PR 9651 at commit d784a52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-13T06:59:12Z

Test build #45838 has finished for PR 9651 at commit 240bcf3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-11-13T07:10:01Z

retest this please

liancheng · 2015-11-13T07:16:28Z

Hopefully PR #9677 fixes the flaky MLlib tests.

SparkQA · 2015-11-13T10:16:40Z

Test build #45840 has finished for PR 9651 at commit 240bcf3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…f the table. https://issues.apache.org/jira/browse/SPARK-11678 The change of this PR is to pass root paths of table to the partition discovery logic. So, the process of partition discovery stops at those root paths instead of going all the way to the root path of the file system. Author: Yin Huai <yhuai@databricks.com> Closes #9651 from yhuai/SPARK-11678. (cherry picked from commit 7b5d905) Signed-off-by: Cheng Lian <lian@databricks.com>

liancheng · 2015-11-13T10:45:56Z

Thanks, merged to master and branch-1.6.

cc @marmbrus

…f the table. https://issues.apache.org/jira/browse/SPARK-11678 The change of this PR is to pass root paths of table to the partition discovery logic. So, the process of partition discovery stops at those root paths instead of going all the way to the root path of the file system. Author: Yin Huai <yhuai@databricks.com> Closes apache#9651 from yhuai/SPARK-11678.

…s a Path to Parquet File #### What changes were proposed in this pull request? When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema. This PR is to fix the behavior inconsistency issue. The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path. By default, the paths of the dataset provided by users will be base paths. Below are three typical cases, **Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be `/path/something=true/`, and the returned DataFrame will not contain a column of `something`. **Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be still `/path/something=true/`, and the returned DataFrame will also not contain a column of `something`. **Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned DataFrame will have the column of `something`. Users also can override the basePath by setting `basePath` in the options to pass the new base path to the data source. For example, ```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```, and the returned DataFrame will have the column of `something`. The related PRs: - #9651 - #10211 #### How was this patch tested? Added a couple of test cases Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12828 from gatorsmile/readPartitionedTable.

…s a Path to Parquet File #### What changes were proposed in this pull request? When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema. This PR is to fix the behavior inconsistency issue. The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path. By default, the paths of the dataset provided by users will be base paths. Below are three typical cases, **Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be `/path/something=true/`, and the returned DataFrame will not contain a column of `something`. **Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be still `/path/something=true/`, and the returned DataFrame will also not contain a column of `something`. **Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned DataFrame will have the column of `something`. Users also can override the basePath by setting `basePath` in the options to pass the new base path to the data source. For example, ```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```, and the returned DataFrame will have the column of `something`. The related PRs: - #9651 - #10211 #### How was this patch tested? Added a couple of test cases Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12828 from gatorsmile/readPartitionedTable. (cherry picked from commit ef55e46) Signed-off-by: Yin Huai <yhuai@databricks.com>

Partition discovery stops at the root path of the table.

e406792

yhuai reviewed Nov 12, 2015
View reviewed changes

Update tests.

28a1227

viirya reviewed Nov 12, 2015
View reviewed changes

liancheng reviewed Nov 12, 2015
View reviewed changes

yhuai added 3 commits November 12, 2015 13:15

Update

df48220

Add a basePath option to HadoopFsRelations.

1744624

Format.

89f21ba

Comments and docs.

d784a52

liancheng reviewed Nov 13, 2015
View reviewed changes

Comments.

240bcf3

asfgit closed this in 7b5d905 Nov 13, 2015

gatorsmile mentioned this pull request May 2, 2016

[SPARK-14993] [SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File #12828

Closed

Ngone51 mentioned this pull request Nov 27, 2019

[SPARK-29537][SQL] throw exception when user defined a wrong base path #26195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11678] [SQL] Partition discovery should stop at the root path of the table. #9651

[SPARK-11678] [SQL] Partition discovery should stop at the root path of the table. #9651

yhuai commented Nov 12, 2015

yhuai commented Nov 12, 2015

yhuai Nov 12, 2015

SparkQA commented Nov 12, 2015

viirya Nov 12, 2015

liancheng Nov 12, 2015

SparkQA commented Nov 12, 2015

liancheng Nov 12, 2015

yhuai Nov 12, 2015

liancheng commented Nov 12, 2015

yhuai commented Nov 13, 2015

SparkQA commented Nov 13, 2015

yhuai commented Nov 13, 2015

liancheng Nov 13, 2015

yhuai Nov 13, 2015

SparkQA commented Nov 13, 2015

liancheng Nov 13, 2015

liancheng Nov 13, 2015

liancheng commented Nov 13, 2015

liancheng commented Nov 13, 2015

liancheng commented Nov 13, 2015

SparkQA commented Nov 13, 2015

SparkQA commented Nov 13, 2015

liancheng commented Nov 13, 2015

liancheng commented Nov 13, 2015

SparkQA commented Nov 13, 2015

liancheng commented Nov 13, 2015

[SPARK-11678] [SQL] Partition discovery should stop at the root path of the table. #9651

[SPARK-11678] [SQL] Partition discovery should stop at the root path of the table. #9651

Conversation

yhuai commented Nov 12, 2015

yhuai commented Nov 12, 2015

Choose a reason for hiding this comment

SparkQA commented Nov 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Nov 12, 2015

yhuai commented Nov 13, 2015

SparkQA commented Nov 13, 2015

yhuai commented Nov 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Nov 13, 2015

liancheng commented Nov 13, 2015

liancheng commented Nov 13, 2015

SparkQA commented Nov 13, 2015

SparkQA commented Nov 13, 2015

liancheng commented Nov 13, 2015

liancheng commented Nov 13, 2015

SparkQA commented Nov 13, 2015

liancheng commented Nov 13, 2015