-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11678] [SQL] Partition discovery should stop at the root path of the table. #9651
Conversation
@liancheng @viirya can you guys review it? |
paths.map(new Path(_)), | ||
defaultPartitionName, | ||
true, | ||
Set(new Path("hdfs://host:9000/path/something=true/table"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail without the change.
Test build #45707 has finished for PR 9651 at commit
|
chopped.getParent == null || | ||
rootPaths.contains(basePath) | ||
|
||
if (maybeColumn.isDefined && !rootPaths.contains(basePath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we will stop when rootPaths.contains(basePath) == true
and in this case columns
will not be modified and we don't care the content of maybeColumn
, maybe we can skip parsePartitionColumn
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it in this way:
if (rootPaths.contains(basePath)) {
finished = true
} else {
val maybeColumn = parsePartitionColumn(chopped.getName, defaultPartitionName, typeInference)
maybeColumn.foreach(columns += _)
basePath = chopped
chopped = chopped.getParent
finished = (maybeColumn.isEmpty && columns.nonEmpty) || chopped.getParent == null
}
Test build #45715 has finished for PR 9651 at commit
|
@@ -294,7 +294,7 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex | |||
// If the "part = 1" filter gets pushed down, this query will throw an exception since | |||
// "part" is not a valid column in the actual Parquet file | |||
checkAnswer( | |||
sqlContext.read.parquet(path).filter("part = 1"), | |||
sqlContext.read.parquet(dir.getCanonicalPath).filter("part = 1"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need getCanonicalPath
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path is a partition dir and if we load that single dir, I am not sure we should attach part as a column to your table.
Seems that this PR breaks another existing feature, namely explicitly specifying a subset of partitions. E.g.: sqlContext.read.parquet("base/p1=a/p2=1", "base/p1=a/p2=2")
sqlContext.read.parquet("base/year=201?/month=10") In the second case, the the glob pattern is expanded into multiple input paths. And these paths are considered to be root paths, which prevents partition discovery. |
need to work on the doc and comments. |
Test build #45802 has finished for PR 9651 at commit
|
test this please |
@@ -144,7 +149,7 @@ private[sql] object PartitioningUtils { | |||
* Literal.create("hello", StringType), | |||
* Literal.create(3.14, FloatType))) | |||
* }}} | |||
* and the base path: | |||
* and the path when we stop the discovery is: | |||
* {{{ | |||
* /path/to/partition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the hdfs://<host>:<port>
part? I think basePath
is required to be a canonical/qualified HDFS path, would be better to document this explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Test build #45809 has finished for PR 9651 at commit
|
// If the user does not provide basePath, we will just use paths. | ||
val pathSet = paths.toSet | ||
pathSet.map(p => new Path(p)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering parsePartitions
asserts that basePaths.distinct.size == 1
, users should either provide a basePath
or passing only a single input path, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, actually the basePaths
in parsePartitions
is completely something else with the same name.
retest this please |
retest this please... |
The previous |
Test build #45828 has finished for PR 9651 at commit
|
Test build #45838 has finished for PR 9651 at commit
|
retest this please |
Hopefully PR #9677 fixes the flaky MLlib tests. |
Test build #45840 has finished for PR 9651 at commit
|
…f the table. https://issues.apache.org/jira/browse/SPARK-11678 The change of this PR is to pass root paths of table to the partition discovery logic. So, the process of partition discovery stops at those root paths instead of going all the way to the root path of the file system. Author: Yin Huai <yhuai@databricks.com> Closes #9651 from yhuai/SPARK-11678. (cherry picked from commit 7b5d905) Signed-off-by: Cheng Lian <lian@databricks.com>
Thanks, merged to master and branch-1.6. cc @marmbrus |
…f the table. https://issues.apache.org/jira/browse/SPARK-11678 The change of this PR is to pass root paths of table to the partition discovery logic. So, the process of partition discovery stops at those root paths instead of going all the way to the root path of the file system. Author: Yin Huai <yhuai@databricks.com> Closes apache#9651 from yhuai/SPARK-11678.
…s a Path to Parquet File #### What changes were proposed in this pull request? When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema. This PR is to fix the behavior inconsistency issue. The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path. By default, the paths of the dataset provided by users will be base paths. Below are three typical cases, **Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be `/path/something=true/`, and the returned DataFrame will not contain a column of `something`. **Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be still `/path/something=true/`, and the returned DataFrame will also not contain a column of `something`. **Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned DataFrame will have the column of `something`. Users also can override the basePath by setting `basePath` in the options to pass the new base path to the data source. For example, ```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```, and the returned DataFrame will have the column of `something`. The related PRs: - #9651 - #10211 #### How was this patch tested? Added a couple of test cases Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12828 from gatorsmile/readPartitionedTable.
…s a Path to Parquet File #### What changes were proposed in this pull request? When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema. This PR is to fix the behavior inconsistency issue. The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path. By default, the paths of the dataset provided by users will be base paths. Below are three typical cases, **Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be `/path/something=true/`, and the returned DataFrame will not contain a column of `something`. **Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be still `/path/something=true/`, and the returned DataFrame will also not contain a column of `something`. **Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned DataFrame will have the column of `something`. Users also can override the basePath by setting `basePath` in the options to pass the new base path to the data source. For example, ```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```, and the returned DataFrame will have the column of `something`. The related PRs: - #9651 - #10211 #### How was this patch tested? Added a couple of test cases Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12828 from gatorsmile/readPartitionedTable. (cherry picked from commit ef55e46) Signed-off-by: Yin Huai <yhuai@databricks.com>
https://issues.apache.org/jira/browse/SPARK-11678
The change of this PR is to pass root paths of table to the partition discovery logic. So, the process of partition discovery stops at those root paths instead of going all the way to the root path of the file system.