-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format for vectorized UDF. #19349
Conversation
Test build #82175 has finished for PR 19349 at commit
|
The performance test I did in my local based on @BryanCutler's (#18659 (comment)) is as follows: from pyspark.sql.functions import *
from pyspark.sql.types import *
@udf(DoubleType())
def my_udf(p1, p2):
from math import log, exp
return exp(log(p1) + log(p2) - log(0.5))
@pandas_udf(DoubleType())
def my_pandas_udf(p1, p2):
from numpy import log, exp
return exp(log(p1) + log(p2) - log(0.5))
df = spark.range(1 << 24, numPartitions=16).toDF("id") \
.withColumn("p1", rand()).withColumn("p2", rand())
df_udf = df.withColumn("p", my_udf(col("p1"), col("p2")))
df_pandas_udf = df.withColumn("p", my_pandas_udf(col("p1"), col("p2"))) Normal UDF:
Vectorized UDF before this patch:
Vectorized UDF after this patch:
Environment:
Updated commands because the configuration to enable Arrow stream format was removed. |
Test build #82180 has finished for PR 19349 at commit
|
retest this please |
Test build #82183 has finished for PR 19349 at commit
|
Test build #82184 has finished for PR 19349 at commit
|
python/pyspark/serializers.py
Outdated
for series in iterator: | ||
batch = _create_batch(series) | ||
if writer is None: | ||
write_int(0, stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add a new entry in SpecialLenths
and use it here instead of 0
?
.internal() | ||
.doc("When using Apache Arrow, use Arrow stream protocol if possible.") | ||
.booleanConf | ||
.createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any known problems? I think we should enable it by default, otherwise most users can't benefit from it
private var closed = false | ||
|
||
context.addTaskCompletionListener { _ => | ||
// todo: we need something like `read.end()`, which release all the resources, but leave |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @BryanCutler for arrow side issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that ArrowStreamReader.close()
should not close the input stream. I filed https://issues.apache.org/jira/browse/ARROW-1613 to fix this.
LGTM |
Nice job on refactoring |
python/pyspark/serializers.py
Outdated
arrs = [pa.Array.from_pandas(cast_series(s, t), mask=s.isnull(), type=t) for s, t in series] | ||
return pa.RecordBatch.from_arrays(arrs, ["_%d" % i for i in xrange(len(arrs))]) | ||
|
||
|
||
class ArrowPandasSerializer(ArrowSerializer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep ArrowPandasSerializer
? I don't see we use it other than in pandas udf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll remove it.
batch.setNumRows(root.getRowCount) | ||
batch | ||
} else { | ||
read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is loadNextBatch
a blocking action and returning false only no batch anymore? But looks like we call read()
again if no batch is loaded, so loadNextBatch
is an async action and can return false if the batch is not ready? If it takes too long for the batch to be ready, can recursive read
be an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think whether recursive read
may cause StackOverflowException
.
Can we implement this as a loop? Or, can we ensure it does not cause StackOverflowException
. exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe loadNextBatch
is a blocking action. Here's a single-line comment from a source code of the method:
Returns true if a batch was read, false on EOS
cc @BryanCutler Could you confirm this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it may not incurring StackOverflowException
as batchLoaded
is false now and we won't enter the if
at 153.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk I might miss something, but I don't think StackOverflowException
happens because of the protocol to communicate with Python worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Do you mind to add a comment like:
} else {
// Reach end of stream. Call `read()` again to read control data.
read()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Sure, I'll add the comment. Thanks!
python/pyspark/serializers.py
Outdated
table = pa.Table.from_batches([batch]) | ||
yield [c.to_pandas() for c in table.itercolumns()] | ||
|
||
def dump_stream(self, iterator, stream): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add few comments for dump_stream
and load_stream
like ArrowPandasSerializer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add comments.
context: TaskContext): WriterThread = { | ||
new WriterThread(env, worker, inputIterator, partitionIndex, context) { | ||
|
||
override def writeCommand(dataOut: DataOutputStream): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this implementation is no different than the writeCommand
in ArrowPythonRunner
? If so, I think we don't need to duplicate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll try to avoid duplicate.
root.close() | ||
allocator.close() | ||
closed = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to write out END_OF_DATA_SECTION
after all data are written out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm. ArrowStreamPandasSerializer
is not a FramedSerializer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me. Other few pending comments from me were a subset of @viirya's basically.
} | ||
|
||
def writeCommand(dataOut: DataOutputStream): Unit | ||
def writeIteratorToStream(dataOut: DataOutputStream): Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave few comments for methods that should be implemented here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add comments.
Test build #82214 has finished for PR 19349 at commit
|
LGTM |
Test build #82221 has finished for PR 19349 at commit
|
LGTM too. Should be good to go. |
Test build #82223 has finished for PR 19349 at commit
|
Test build #82231 has finished for PR 19349 at commit
|
Merged to master. |
What changes were proposed in this pull request?
Currently we use Arrow File format to communicate with Python worker when invoking vectorized UDF but we can use Arrow Stream format.
This pr replaces the Arrow File format with the Arrow Stream format.
How was this patch tested?
Existing tests.