[SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format for vectorized UDF. #19349

ueshin · 2017-09-26T05:09:01Z

What changes were proposed in this pull request?

Currently we use Arrow File format to communicate with Python worker when invoking vectorized UDF but we can use Arrow Stream format.

This pr replaces the Arrow File format with the Arrow Stream format.

How was this patch tested?

Existing tests.

ueshin · 2017-09-26T05:12:04Z

cc @BryanCutler @HyukjinKwon @viirya @cloud-fan

SparkQA · 2017-09-26T05:25:23Z

Test build #82175 has finished for PR 19349 at commit e62d619.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExtractPythonUDFs(conf: SQLConf) extends Rule[SparkPlan] with PredicateHelper

ueshin · 2017-09-26T06:23:03Z

The performance test I did in my local based on @BryanCutler's (#18659 (comment)) is as follows:

from pyspark.sql.functions import *
from pyspark.sql.types import *

@udf(DoubleType())
def my_udf(p1, p2):
    from math import log, exp
    return exp(log(p1) + log(p2) - log(0.5))

@pandas_udf(DoubleType())
def my_pandas_udf(p1, p2):
    from numpy import log, exp
    return exp(log(p1) + log(p2) - log(0.5))

df = spark.range(1 << 24, numPartitions=16).toDF("id") \
    .withColumn("p1", rand()).withColumn("p2", rand())
df_udf = df.withColumn("p", my_udf(col("p1"), col("p2")))
df_pandas_udf = df.withColumn("p", my_pandas_udf(col("p1"), col("p2")))

Normal UDF:

%timeit -n2 df_udf.select(sum(col('p'))).collect()

12.2 s ± 456 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

Vectorized UDF before this patch:

%timeit -n2 df_pandas_udf.select(sum(col('p'))).collect()

1.91 s ± 195 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

Vectorized UDF after this patch:

%timeit -n2 df_pandas_udf.select(sum(col('p'))).collect()

1.67 s ± 223 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

Environment:

Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01 on Mac OS X 10.12.6
Python 3.6.1 64bit [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
- pandas 0.20.1
- pyarrow 0.4.1

Updated commands because the configuration to enable Arrow stream format was removed.

SparkQA · 2017-09-26T07:04:43Z

Test build #82180 has finished for PR 19349 at commit 14aa3b6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-26T07:06:51Z

retest this please

SparkQA · 2017-09-26T10:18:18Z

Test build #82183 has finished for PR 19349 at commit 14aa3b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-26T10:24:28Z

Test build #82184 has finished for PR 19349 at commit 14aa3b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-09-26T15:56:34Z

python/pyspark/serializers.py

+            for series in iterator:
+                batch = _create_batch(series)
+                if writer is None:
+                    write_int(0, stream)


shall we add a new entry in SpecialLenths and use it here instead of 0?

cloud-fan · 2017-09-26T16:02:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("When using Apache Arrow, use Arrow stream protocol if possible.")
+      .booleanConf
+      .createWithDefault(false)


is there any known problems? I think we should enable it by default, otherwise most users can't benefit from it

cloud-fan · 2017-09-26T16:08:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowStreamPythonUDFRunner.scala

+      private var closed = false
+
+      context.addTaskCompletionListener { _ =>
+        // todo: we need something like `read.end()`, which release all the resources, but leave


cc @BryanCutler for arrow side issues.

I think that ArrowStreamReader.close() should not close the input stream. I filed https://issues.apache.org/jira/browse/ARROW-1613 to fix this.

cloud-fan · 2017-09-26T16:09:19Z

LGTM

BryanCutler · 2017-09-26T18:51:04Z

Nice job on refactoring PythonRunner! I think we should just replace the arrow file format with stream format for pandas udf instead of having a new conf to enable it, as long as all the issues are worked out. Along with being a little faster, it's also easier on memory usage. I'd like to do the same for toPandas() also, but that can be a followup. Is it possible to do away with the SQLConf and maybe rename some of these classes to be more general, e.g. ArrowStreamPythonUDFRunner -> ArrowPythonRunner?

…a new conf.

viirya · 2017-09-27T03:36:07Z

python/pyspark/serializers.py

+    arrs = [pa.Array.from_pandas(cast_series(s, t), mask=s.isnull(), type=t) for s, t in series]
+    return pa.RecordBatch.from_arrays(arrs, ["_%d" % i for i in xrange(len(arrs))])
+
+
 class ArrowPandasSerializer(ArrowSerializer):


Do we need to keep ArrowPandasSerializer? I don't see we use it other than in pandas udf.

Thanks! I'll remove it.

viirya · 2017-09-27T04:02:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala

+              batch.setNumRows(root.getRowCount)
+              batch
+            } else {
+              read()


Is loadNextBatch a blocking action and returning false only no batch anymore? But looks like we call read() again if no batch is loaded, so loadNextBatch is an async action and can return false if the batch is not ready? If it takes too long for the batch to be ready, can recursive read be an issue?

I also think whether recursive read may cause StackOverflowException.
Can we implement this as a loop? Or, can we ensure it does not cause StackOverflowException. exception?

I believe loadNextBatch is a blocking action. Here's a single-line comment from a source code of the method:

Returns true if a batch was read, false on EOS

cc @BryanCutler Could you confirm this?

Oh, it may not incurring StackOverflowException as batchLoaded is false now and we won't enter the if at 153.

@kiszk I might miss something, but I don't think StackOverflowException happens because of the protocol to communicate with Python worker.

@ueshin Do you mind to add a comment like:

} else { // Reach end of stream. Call `read()` again to read control data. read() }

@viirya Sure, I'll add the comment. Thanks!

viirya · 2017-09-27T04:27:43Z

python/pyspark/serializers.py

+            table = pa.Table.from_batches([batch])
+            yield [c.to_pandas() for c in table.itercolumns()]
+
+    def dump_stream(self, iterator, stream):


Maybe add few comments for dump_stream and load_stream like ArrowPandasSerializer.

Sure, I'll add comments.

viirya · 2017-09-27T04:31:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala

+      context: TaskContext): WriterThread = {
+    new WriterThread(env, worker, inputIterator, partitionIndex, context) {
+
+      override def writeCommand(dataOut: DataOutputStream): Unit = {


Looks like this implementation is no different than the writeCommand in ArrowPythonRunner? If so, I think we don't need to duplicate this.

Sure, I'll try to avoid duplicate.

viirya · 2017-09-27T04:35:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala

+          root.close()
+          allocator.close()
+          closed = true
+        }


I think we need to write out END_OF_DATA_SECTION after all data are written out?

nvm. ArrowStreamPandasSerializer is not a FramedSerializer.

HyukjinKwon

Looks pretty good to me. Other few pending comments from me were a subset of @viirya's basically.

HyukjinKwon · 2017-09-27T05:01:02Z

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

+    }
+
+    def writeCommand(dataOut: DataOutputStream): Unit
+    def writeIteratorToStream(dataOut: DataOutputStream): Unit


I'd leave few comments for methods that should be implemented here.

Sure, I'll add comments.

SparkQA · 2017-09-27T05:28:32Z

Test build #82214 has finished for PR 19349 at commit 7f6e43f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-27T06:58:25Z

LGTM

SparkQA · 2017-09-27T07:04:43Z

Test build #82221 has finished for PR 19349 at commit aa3fa70.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-27T07:25:08Z

LGTM too. Should be good to go.

SparkQA · 2017-09-27T10:15:55Z

Test build #82223 has finished for PR 19349 at commit 416bd10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-27T12:37:31Z

Test build #82231 has finished for PR 19349 at commit 7cd78b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-27T14:22:30Z

Merged to master.

ueshin added 7 commits September 26, 2017 11:12

Extract PythonRunner from PythonRDD.scala file.

3c45c5c

Extract writer thread.

1cd832c

Extract reader iterator.

919811d

Introduce ArrowStreamPythonUDFRunner.

b2fed10

Add ArrowStreamPandasSerializer.

937292d

Introduce ArrowStreamEvalPythonExec.

8016721

Enable vectorized UDF via Arrow stream protocol.

e62d619

Revert rename.

14aa3b6

cloud-fan reviewed Sep 26, 2017

View reviewed changes

ueshin added 2 commits September 27, 2017 10:26

Add a new entry in SpecialLenths.

4a23c52

Replace Arrow file format with Arrow stream format instead of having …

7f6e43f

…a new conf.

viirya reviewed Sep 27, 2017

View reviewed changes

HyukjinKwon approved these changes Sep 27, 2017

View reviewed changes

Remove ArrowPandasSerializer.

dd6eaa3

ueshin added 3 commits September 27, 2017 14:41

Extract duplicate code to utility method.

83bb3c2

Add comments.

aa3fa70

Add a comment to describe calling read() again.

416bd10

Minor cleanups.

7cd78b2

asfgit closed this in 09cbf3d Sep 27, 2017

[SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format for vectorized UDF. #19349

[SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format for vectorized UDF. #19349

Conversation

ueshin commented Sep 26, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

ueshin commented Sep 26, 2017

SparkQA commented Sep 26, 2017

ueshin commented Sep 26, 2017 • edited Loading

SparkQA commented Sep 26, 2017

HyukjinKwon commented Sep 26, 2017

SparkQA commented Sep 26, 2017

SparkQA commented Sep 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 26, 2017

BryanCutler commented Sep 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Sep 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 27, 2017

viirya commented Sep 27, 2017

SparkQA commented Sep 27, 2017

HyukjinKwon commented Sep 27, 2017

SparkQA commented Sep 27, 2017

SparkQA commented Sep 27, 2017

HyukjinKwon commented Sep 27, 2017

ueshin commented Sep 26, 2017 •

edited

Loading

ueshin commented Sep 26, 2017 •

edited

Loading

viirya Sep 27, 2017 •

edited

Loading