[SPARK-22389][SQL] data source v2 partitioning reporting interface #20201

cloud-fan · 2018-01-09T12:35:44Z

What changes were proposed in this pull request?

a new interface which allows data source to report partitioning and avoid shuffle at Spark side.

The design is pretty like the internal distribution/partitioing framework. Spark defines a Distribution interfaces and several concrete implementations, and ask the data source to report a Partitioning, the Partitioning should tell Spark if it can satisfy a Distribution or not.

How was this patch tested?

new test

cloud-fan · 2018-01-09T12:37:02Z

cc @rxin @RussellSpitzer @kiszk @sameeragarwal @gatorsmile

SparkQA · 2018-01-09T15:59:10Z

Test build #85852 has finished for PR 20201 at commit be14e3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait Partitioning
public class ClusteredDistribution implements Distribution
class DataSourcePartitioning(

RussellSpitzer · 2018-01-09T19:54:24Z

This looks very exciting to me

SparkQA · 2018-01-16T15:07:51Z

Test build #86174 has started for PR 20201 at commit ff5b650.

SparkQA · 2018-01-16T15:27:58Z

Test build #86177 has started for PR 20201 at commit 713140a.

cloud-fan · 2018-01-17T05:57:01Z

retest this please

SparkQA · 2018-01-17T08:05:02Z

Test build #86248 has finished for PR 20201 at commit 713140a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait Partitioning
public class ClusteredDistribution implements Distribution
class DataSourcePartitioning(

cloud-fan · 2018-01-17T08:21:46Z

retest this please

SparkQA · 2018-01-17T11:50:22Z

Test build #86255 has finished for PR 20201 at commit 713140a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait Partitioning
public class ClusteredDistribution implements Distribution
class DataSourcePartitioning(

gatorsmile · 2018-01-19T21:11:26Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

+          case e: ShuffleExchangeExec => e
+        }.isEmpty)
+
+        val groupByColAB = df.groupBy('a, 'b).agg(count("*"))


Try df.groupBy('a + 'b).agg(count("*")).show()

At least, it should not fail, even if we do not support complex ClusteredDistribution expressions

gatorsmile · 2018-01-19T21:12:34Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ClusteredDistribution.java

+ */
+@InterfaceStability.Evolving
+public class ClusteredDistribution implements Distribution {
+  public String[] clusteredColumns;


Need to emphasize these columns are order insensitive.

gatorsmile · 2018-01-19T21:13:42Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourcePartitioning.scala

+import org.apache.spark.sql.sources.v2.reader.{ClusteredDistribution, Partitioning}
+
+/**
+ * An adapter from public data source partitioning to catalyst internal partitioning.


Partitioning

gatorsmile · 2018-01-19T21:14:21Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Distribution.java

+ * the data ordering inside one partition(the output records of a single {@link ReadTask}).
+ *
+ * The instance of this interface is created and provided by Spark, then consumed by
+ * {@link Partitioning#satisfy(Distribution)}. This means users don't need to implement


users -> data source developers

gatorsmile · 2018-01-19T21:15:06Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Partitioning.java

+import org.apache.spark.annotation.InterfaceStability;
+
+/**
+ * An interface to represent output data partitioning for a data source, which is returned by


output -> the output

gatorsmile · 2018-01-19T21:15:30Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Partitioning.java

+/**
+ * An interface to represent output data partitioning for a data source, which is returned by
+ * {@link SupportsReportPartitioning#outputPartitioning()}. Note that this should work like a
+ * snapshot, once created, it should be deterministic and always report same number of partitions


, once -> . Once

SparkQA · 2018-01-22T17:29:29Z

Test build #86483 has finished for PR 20201 at commit 28987a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-22T17:25:16Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Partitioning.java

+ * An interface to represent the output data partitioning for a data source, which is returned by
+ * {@link SupportsReportPartitioning#outputPartitioning()}. Note that this should work like a
+ * snapshot. Once created, it should be deterministic and always report same number of partitions
+ * and same "satisfy" result for a certain distribution.


same number -> the same number and same "satisfy" result -> the same "satisfy" result

gatorsmile · 2018-01-22T17:26:43Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Partitioning.java

+public interface Partitioning {
+
+  /**
+   * Returns the number of partitions/{@link ReadTask}s the data source outputs.


Returns the number of partitions/(i.e., {@link ReadTask}s) that the data source outputs.

gatorsmile · 2018-01-22T17:32:32Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Partitioning.java

+   * recommended to check every Spark new release and support new distributions if possible, to
+   * avoid shuffle at Spark side for more cases.
+   */
+  boolean satisfy(Distribution d);


d -> distribution

gatorsmile · 2018-01-22T17:34:38Z

sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaPartitionAwareDataSource.java

+    }
+
+    @Override
+    public boolean satisfy(Distribution d) {


gatorsmile · 2018-01-22T17:52:39Z

LGTM except a few minor comments.

SparkQA · 2018-01-22T21:33:13Z

Test build #86489 has finished for PR 20201 at commit 48b9fda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-22T23:21:36Z

Thanks! Merged to master/2.3

## What changes were proposed in this pull request? a new interface which allows data source to report partitioning and avoid shuffle at Spark side. The design is pretty like the internal distribution/partitioing framework. Spark defines a `Distribution` interfaces and several concrete implementations, and ask the data source to report a `Partitioning`, the `Partitioning` should tell Spark if it can satisfy a `Distribution` or not. ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #20201 from cloud-fan/partition-reporting. (cherry picked from commit 51eb750) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

rdblue · 2018-01-22T23:52:12Z

@cloud-fan, please ping me to review PRs for DataSourceV2. Our new table format uses it and we're preparing some changes, so I want to make sure we're heading in the same direction for this.

cloud-fan · 2018-01-23T01:09:44Z

ah sorry I missed this, but it's not too late to do post-hoc reviews, any comments are welcome!

cloud-fan force-pushed the partition-reporting branch from be14e3b to ff5b650 Compare January 16, 2018 15:06

cloud-fan force-pushed the partition-reporting branch from ff5b650 to 713140a Compare January 16, 2018 15:24

gatorsmile reviewed Jan 19, 2018

View reviewed changes

cloud-fan added 2 commits January 22, 2018 21:33

data source v2 partitioning reporting interface

ea4719f

address comments

28987a7

cloud-fan force-pushed the partition-reporting branch from 713140a to 28987a7 Compare January 22, 2018 14:11

gatorsmile reviewed Jan 22, 2018

View reviewed changes

more comments

48b9fda

asfgit closed this in 51eb750 Jan 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22389][SQL] data source v2 partitioning reporting interface #20201

[SPARK-22389][SQL] data source v2 partitioning reporting interface #20201

cloud-fan commented Jan 9, 2018

cloud-fan commented Jan 9, 2018 •

edited

Loading

SparkQA commented Jan 9, 2018

RussellSpitzer commented Jan 9, 2018

SparkQA commented Jan 16, 2018

SparkQA commented Jan 16, 2018

cloud-fan commented Jan 17, 2018

SparkQA commented Jan 17, 2018

cloud-fan commented Jan 17, 2018

SparkQA commented Jan 17, 2018

gatorsmile Jan 19, 2018

gatorsmile Jan 19, 2018

gatorsmile Jan 19, 2018

gatorsmile Jan 19, 2018

gatorsmile Jan 19, 2018

gatorsmile Jan 19, 2018

SparkQA commented Jan 22, 2018

gatorsmile Jan 22, 2018

gatorsmile Jan 22, 2018

gatorsmile Jan 22, 2018

gatorsmile Jan 22, 2018

gatorsmile commented Jan 22, 2018

SparkQA commented Jan 22, 2018

gatorsmile commented Jan 22, 2018

rdblue commented Jan 22, 2018

cloud-fan commented Jan 23, 2018

[SPARK-22389][SQL] data source v2 partitioning reporting interface #20201

[SPARK-22389][SQL] data source v2 partitioning reporting interface #20201

Conversation

cloud-fan commented Jan 9, 2018

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Jan 9, 2018 • edited Loading

SparkQA commented Jan 9, 2018

RussellSpitzer commented Jan 9, 2018

SparkQA commented Jan 16, 2018

SparkQA commented Jan 16, 2018

cloud-fan commented Jan 17, 2018

SparkQA commented Jan 17, 2018

cloud-fan commented Jan 17, 2018

SparkQA commented Jan 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 22, 2018

SparkQA commented Jan 22, 2018

gatorsmile commented Jan 22, 2018

rdblue commented Jan 22, 2018

cloud-fan commented Jan 23, 2018

cloud-fan commented Jan 9, 2018 •

edited

Loading