-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29248][SQL] Pass in number of partitions to WriteBuilder #25945
Conversation
cc. @rdblue @cloud-fan as it proposes a change to DSv2 API |
The use case seems reasonable to me, as does the approach of adding the number of partitions with a method that is defaulted. I'd like to make sure that all code paths call this method in tests. Could you update the InMemoryTable test class so that it throws an exception if this is not called before a write operation commits? That will ensure in tests that all code paths that commit to a table call this correctly. |
* @return a new builder with the `schema`. By default it returns `this`, which means the given | ||
* `numPartitions` is ignored. Please override this method to take the `numPartitions`. | ||
*/ | ||
default WriteBuilder withNumPartitions(int numPartitions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the approach here, but just want to share a few thoughts about how to make the API better. The use case is: there are some additional information (input schema, numPartition, etc.) that Spark should always provide, and the implementation only need to write extra code if they need to access the additional information.
With the current API, we can:
- add more additional information in future versions without breaking backward compatibility.
- users only need to overwrite
withNumPartitions
and other methods if they need to access the additional information.
But there is one drawback: we need to take extra effort to make sure the additional information is provided by Spark. It's better to guarantee this at compile time.
I think we can improve this API a little bit. For Table#newWriteBuilder
, we can define it as
WriteBuilder newWriteBuilder(CaseInsensitiveStringMap options, WriteInfo info);
While WriteInfo
is an interface providing additional information:
interface WriteInfo {
String queryId();
StructType inputDataSchema();
...
}
The WriteInfo
is implemented by Spark and called by data source implementations, so we can add more methods in future versions without breaking backward compatibility. The WriteInfo
can also make sure Spark always provide additional information at compile time.
If you guys think it makes sense, we can do it in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that the WriteInfo approach has better compile time guarantees. I actually started implementing the change like that, but then felt it was maybe too much of a change and that I should focus on the numPartitions.
I'm happy to change it in a followup PR, if that works for everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to take this approach, then let's do it now before a release. Otherwise we should use the original implementation to add an additional method because that is a compatible change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edrevo can you implement this approach here? I think adding a numPartitions
is really a small change, we can set the main focus of this PR to improve this API.
@rdblue, I have modified the tests |
…tition-information
@cloud-fan, after today's Community Sync it seems that it is better to move forward with this PR instead of #25990. You did mention there was an alternative trait/interface where the number of physical partitions could be reported, but I didn't get the specific name. Were you refering to Thanks for all the help with this, by the way! Much appreciated! |
Yea we should add the method to |
@@ -117,12 +119,32 @@ class InMemoryTable( | |||
this | |||
} | |||
|
|||
override def buildForBatch(): BatchWrite = writer | |||
override def withQueryId(queryId: String): WriteBuilder = { | |||
assert(!queryIdProvided, "queryId provided twice") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
later on can we continue the work in #25990 ? It still has value to give stronger guarantees. So that implementations don't need to do check like this.
ok to test |
@@ -36,8 +36,8 @@ class MicroBatchWrite(eppchId: Long, val writeSupport: StreamingWrite) extends B | |||
writeSupport.abort(eppchId, messages) | |||
} | |||
|
|||
override def createBatchWriterFactory(): DataWriterFactory = { | |||
new MicroBatchWriterFactory(eppchId, writeSupport.createStreamingWriterFactory()) | |||
override def createBatchWriterFactory(numPartitions: Int): DataWriterFactory = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to come up with this at the last minute: can we create a PhysicalWriteInfo
interface? In case we want to add more physical information in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! Since we still want to move forward with the interface-based approach, I've decided to evolve #25990 to include both the PhysicalWriteInfo
as well as WriteInfo
Test build #113832 has finished for PR 25945 at commit
|
Closing this in favor of #25990 |
What changes were proposed in this pull request?
When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions.
However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience.
Why are the changes needed?
Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example:
Does this PR introduce any user-facing change?
No
How was this patch tested?
I ran the test, but I am getting an OOM error so I haven't been able to run the full suite.