[SPARK-29248][SQL] Pass in number of partitions to WriteBuilder #25945

edrevo · 2019-09-26T15:46:31Z

What changes were proposed in this pull request?

When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions.

However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience.

Why are the changes needed?

Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example:

it could be used to provision a Kafka topic with a specific number of partitions
it could be used to scale a microservice prior to sending the data to it
it could be used to create a DsV2 that sends the data to another spark cluster (currently not possible since the reader wouldn't be able to know the number of partitions)

Does this PR introduce any user-facing change?

No

How was this patch tested?

I ran the test, but I am getting an OOM error so I haven't been able to run the full suite.

HeartSaVioR · 2019-09-26T20:46:38Z

cc. @rdblue @cloud-fan as it proposes a change to DSv2 API

rdblue · 2019-09-26T22:08:58Z

The use case seems reasonable to me, as does the approach of adding the number of partitions with a method that is defaulted.

I'd like to make sure that all code paths call this method in tests. Could you update the InMemoryTable test class so that it throws an exception if this is not called before a write operation commits? That will ensure in tests that all code paths that commit to a table call this correctly.

cloud-fan · 2019-09-27T08:00:19Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/WriteBuilder.java

+   * @return a new builder with the `schema`. By default it returns `this`, which means the given
+   *         `numPartitions` is ignored. Please override this method to take the `numPartitions`.
+   */
+  default WriteBuilder withNumPartitions(int numPartitions) {


I'm OK with the approach here, but just want to share a few thoughts about how to make the API better. The use case is: there are some additional information (input schema, numPartition, etc.) that Spark should always provide, and the implementation only need to write extra code if they need to access the additional information.

With the current API, we can:

add more additional information in future versions without breaking backward compatibility.

users only need to overwrite withNumPartitions and other methods if they need to access the additional information.

But there is one drawback: we need to take extra effort to make sure the additional information is provided by Spark. It's better to guarantee this at compile time.

I think we can improve this API a little bit. For Table#newWriteBuilder, we can define it as

WriteBuilder newWriteBuilder(CaseInsensitiveStringMap options, WriteInfo info);

While WriteInfo is an interface providing additional information:

interface WriteInfo { String queryId(); StructType inputDataSchema(); ... }

The WriteInfo is implemented by Spark and called by data source implementations, so we can add more methods in future versions without breaking backward compatibility. The WriteInfo can also make sure Spark always provide additional information at compile time.

If you guys think it makes sense, we can do it in a followup.

I agree with you that the WriteInfo approach has better compile time guarantees. I actually started implementing the change like that, but then felt it was maybe too much of a change and that I should focus on the numPartitions.

I'm happy to change it in a followup PR, if that works for everyone.

If we want to take this approach, then let's do it now before a release. Otherwise we should use the original implementation to add an additional method because that is a compatible change.

@edrevo can you implement this approach here? I think adding a numPartitions is really a small change, we can set the main focus of this PR to improve this API.

edrevo · 2019-09-27T09:39:51Z

@rdblue, I have modified the tests

edrevo · 2019-11-07T09:31:42Z

@rdblue, since #26001 seems to be blocked, could we more forward with this PR which contains the minimal changes required to propagate partition information to ensure this gets in for Spark 3.0?

…tition-information

edrevo · 2019-11-14T02:14:00Z

@cloud-fan, after today's Community Sync it seems that it is better to move forward with this PR instead of #25990. You did mention there was an alternative trait/interface where the number of physical partitions could be reported, but I didn't get the specific name. Were you refering to BatchWrite?

Thanks for all the help with this, by the way! Much appreciated!

cloud-fan · 2019-11-14T03:19:54Z

Yea we should add the method to BatchWrite and StreamingWrite, so that we can pass the physical info after the write instance is created.

cloud-fan · 2019-11-14T15:38:26Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

@@ -117,12 +119,32 @@ class InMemoryTable(
        this
      }

-      override def buildForBatch(): BatchWrite = writer
+      override def withQueryId(queryId: String): WriteBuilder = {
+        assert(!queryIdProvided, "queryId provided twice")


later on can we continue the work in #25990 ? It still has value to give stronger guarantees. So that implementations don't need to do check like this.

cloud-fan · 2019-11-15T02:33:33Z

ok to test

cloud-fan · 2019-11-15T02:34:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/MicroBatchWrite.scala

@@ -36,8 +36,8 @@ class MicroBatchWrite(eppchId: Long, val writeSupport: StreamingWrite) extends B
    writeSupport.abort(eppchId, messages)
  }

-  override def createBatchWriterFactory(): DataWriterFactory = {
-    new MicroBatchWriterFactory(eppchId, writeSupport.createStreamingWriterFactory())
+  override def createBatchWriterFactory(numPartitions: Int): DataWriterFactory = {


sorry to come up with this at the last minute: can we create a PhysicalWriteInfo interface? In case we want to add more physical information in the future.

No problem! Since we still want to move forward with the interface-based approach, I've decided to evolve #25990 to include both the PhysicalWriteInfo as well as WriteInfo

SparkQA · 2019-11-15T02:47:50Z

Test build #113832 has finished for PR 25945 at commit f3dba5e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

edrevo · 2019-11-15T12:51:49Z

Closing this in favor of #25990

[SPARK-29248][SQL] Pass in number of partitions to WriteBuilder

280986c

dongjoon-hyun added the SQL label Sep 26, 2019

cloud-fan reviewed Sep 27, 2019

View reviewed changes

PR feedback

11c6dd8

edrevo mentioned this pull request Oct 1, 2019

[SPARK-29248][SQL] Add PhysicalWriteInfo with number of partitions #25990

Closed

Merge branch 'master' of https://github.com/apache/spark into add-par…

c6b85f9

…tition-information

moving away from withNumPartitions

f3dba5e

edrevo requested a review from cloud-fan November 14, 2019 14:54

cloud-fan reviewed Nov 14, 2019

View reviewed changes

cloud-fan approved these changes Nov 14, 2019

View reviewed changes

cloud-fan reviewed Nov 15, 2019

View reviewed changes

edrevo closed this Nov 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29248][SQL] Pass in number of partitions to WriteBuilder #25945

[SPARK-29248][SQL] Pass in number of partitions to WriteBuilder #25945

edrevo commented Sep 26, 2019

HeartSaVioR commented Sep 26, 2019

rdblue commented Sep 26, 2019

cloud-fan Sep 27, 2019

edrevo Sep 27, 2019

rdblue Sep 27, 2019

cloud-fan Sep 28, 2019

edrevo commented Sep 27, 2019

edrevo commented Nov 7, 2019

edrevo commented Nov 14, 2019

cloud-fan commented Nov 14, 2019

cloud-fan Nov 14, 2019

cloud-fan commented Nov 15, 2019

cloud-fan Nov 15, 2019

edrevo Nov 15, 2019

SparkQA commented Nov 15, 2019

edrevo commented Nov 15, 2019

[SPARK-29248][SQL] Pass in number of partitions to WriteBuilder #25945

[SPARK-29248][SQL] Pass in number of partitions to WriteBuilder #25945

Conversation

edrevo commented Sep 26, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HeartSaVioR commented Sep 26, 2019

rdblue commented Sep 26, 2019

cloud-fan Sep 27, 2019

Choose a reason for hiding this comment

edrevo Sep 27, 2019

Choose a reason for hiding this comment

rdblue Sep 27, 2019

Choose a reason for hiding this comment

cloud-fan Sep 28, 2019

Choose a reason for hiding this comment

edrevo commented Sep 27, 2019

edrevo commented Nov 7, 2019

edrevo commented Nov 14, 2019

cloud-fan commented Nov 14, 2019

cloud-fan Nov 14, 2019

Choose a reason for hiding this comment

cloud-fan commented Nov 15, 2019

cloud-fan Nov 15, 2019

Choose a reason for hiding this comment

edrevo Nov 15, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 15, 2019

edrevo commented Nov 15, 2019