[SPARK-21190][PYSPARK] Python Vectorized UDFs #18659

BryanCutler · 2017-07-17T20:37:22Z

What changes were proposed in this pull request?

This PR adds vectorized UDFs to the Python API

Proposed API
Introduce a flag to turn on vectorization for a defined UDF, for example:

@pandas_udf(DoubleType())
def plus(a, b)
    return a + b

or

plus = pandas_udf(lambda a, b: a + b, DoubleType())

Usage is the same as normal UDFs

0-parameter UDFs
pandas_udf functions can declare an optional **kwargs and when evaluated, will contain a key "size" that will give the required length of the output. For example:

@pandas_udf(LongType())
def f0(**kwargs):
    return pd.Series(1).repeat(kwargs["size"])

df.select(f0())

How was this patch tested?

Added new unit tests in pyspark.sql that are enabled if pyarrow and Pandas are available.

TODO

Fix support for promoted types with null values
Discuss 0-param UDF API (use of kwargs)
Add tests for chained UDFs
Discuss behavior when pyarrow not installed / enabled
Cleanup pydoc and add user docs

BryanCutler · 2017-07-17T20:48:50Z

The following was used to test performance locally

spark = SparkSession.builder.appName("vectorized_udfs").getOrCreate()

vectorize = True

if vectorize:
    from numpy import log, exp
else:
    from math import log, exp

def my_func(p1, p2):
    w = 0.5
    return exp(log(p1) + log(p2) - log(w))

df = spark.range(1 << 24, numPartitions=16).toDF("id") \
    .withColumn("p1", rand()).withColumn("p2", rand())

my_udf = udf(my_func, DoubleType(), vectorized=vectorize)

df.withColumn("p", my_udf(col("p1"), col("p2")))

Updated with using `ColumnarBatches`

Non-Vectorized ~ 6.127449s

Vectorized ~ 2.867868s
~~Speedup of 2.14x~~

Vectorized ~1.877384

Speedup of 3.26x

BryanCutler · 2017-07-17T20:55:04Z

Some comments on the performance above

I used the ArrowFileWriter that is currently in the pyspark ArrowSerializer - this carries significant overhead in copying data to temporary buffers before transferring. Using the ArrowStreamWriter I was seeing much better performance, however it makes significant changes to PythonRunner in PythonRDD. If we move forward with this I can present those changes as well.
~~This naively transfers data back into a GenericInternalRow - I'm sure there is a better way to do this that is probably more efficient, maybe someone more familiar with SQL internals can comment~~ This is resolved with using ArrowColumnVectors now

SparkQA · 2017-07-17T23:43:54Z

Test build #79680 has finished for PR 18659 at commit 11a7a87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-17T23:56:07Z

Test build #79682 has finished for PR 18659 at commit 063dcd9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-07-18T08:33:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

+
+        val genericRowData = fields.map { field =>
+          field.getAccessor.getObject(_index)
+        }.toArray[Any]


How about using SpecificInternalRow to improve performance? I think that it could eliminate some boxing/unboxing. The following is a snippet for this usage.

val fieldTypes = fields.map { field => field match { case NullableIntVector => IntegerType case NullableFloat8Vector => DoubleType ... } } val row = new SpecificInternalRow(fieldTypes) fields.zipWithIndex.map { case (field, i) => field match { case NullableIntVector => row.setInt(i, field.asInstanceOf[NullableIntVector].getAccessor.get(_index)) case NullableFloat8Vector => LongType row.setDouble(i, field.asInstanceOf[NullableFloat8Vector].getAccessor.get(_index)) ... } }

As additional performance optimization, can we reuse InternalRow object that next() method returns?
For example, I think that this next() reuses UnsafeRow that is allocated in Java code, which is generated at here , as an instance variable.

Thanks @kiszk , I'll give that a shot and see if it helps!

@BryanCutler,

I have implemented arrow -> unsafe row conversions in:

icexelloss@8f38c15#diff-52cca47e7a940849b28d476ddf99d65eR575

This reuses the row object and doesn't do boxing. Hopefully it's useful to you as well?

@BryanCutler
As @cloud-fan suggested here, it is good to create ColumnarBatch with ArrowColumnVector and get an iterator. It looks simpler implementation.
cc: @ueshin

The following is code piece.

new Iterator[InternalRow] { private val _allocator = new RootAllocator(Long.MaxValue) private var _reader: ArrowFileReader = _ private var _root: VectorSchemaRoot = _ private var _index = 0 private var _iterator = null loadNextBatch() override def hasNext: Boolean = _root != null && _index < _root.getRowCount && _iterator.hasNext override def next(): InternalRow = { _index += 1 if (_index >= _root.getRowCount) { _index = 0 loadNextBatch() if (!hasNext) { close() } } _iterator.next() } ... private def loadNextBatch(): Unit = { closeReader() if (iter.hasNext) { val in = new ByteArrayReadableSeekableByteChannel(iter.next().asPythonSerializable) _reader = new ArrowFileReader(in, _allocator) _root = _reader.getVectorSchemaRoot // throws IOException _reader.loadNextBatch() // throws IOException val arrowSchema = ArrowUtils.fromArrowSchema(_root.getSchema) val fields = _root.getFieldVectors val rows = _root.getRowCount val columnarBatch = ColumnarBatch.allocateArrow( _root.getFieldVectors.asInstanceOf[java.util.List[ValueVector]], ArrowUtils.fromArrowSchema(_root.getSchema), _root.getRowCount) _iterator = columnarBatch.rowIterator } } public final class ColumnarBatch { ... public static ColumnarBatch allocateArrow(List<ValueVector> vectors, StructType schema, int maxRows) { // need to implement the following constructor for arrowColumnVector return new ColumnarBatch(vectors, schema, maxRows); } ... }

Thanks @kiszk , I'm giving it a try!

BryanCutler · 2017-07-29T00:51:28Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

+    return columns;
+  }
+
+  public static ColumnarBatch createReadOnly(


@ueshin I made some changes here to allow for use with ArrowColumnVectors. I was thinking of putting these in a separate JIRA because it can be used regardless of what is done with vectorized UDFs. What do you think?

@BryanCutler I agree with you, let's separate it from this pr.

ok, will do. I created https://issues.apache.org/jira/browse/SPARK-21583 for this

SparkQA · 2017-07-29T03:11:19Z

Test build #80030 has finished for PR 18659 at commit 46e4112.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-04T21:13:30Z

Test build #80264 has finished for PR 18659 at commit 912143e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-05T00:47:42Z

Test build #80265 has finished for PR 18659 at commit a01a2d3.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-25T21:27:34Z

Test build #81138 has finished for PR 18659 at commit 38474d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-01T21:02:42Z

Test build #81321 has finished for PR 18659 at commit cc7ed5a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-06T21:41:52Z

Test build #81478 has finished for PR 18659 at commit 3efa7f2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-09-06T22:09:10Z

python/pyspark/sql/functions.py

@@ -2112,7 +2113,7 @@ def wrapper(*args):


 @since(1.3)
-def udf(f=None, returnType=StringType()):
+def udf(f=None, returnType=StringType(), vectorized=False):


@felixcheung does this fit your idea for a more generic decorator? Not exclusively labeled as pandas_udf, just enable vectorization with a flag, e.g. @udf(DoubleType(), vectorized=True)

I think @pandas_udf(DoubleType()) is better than @udf(DoubleType(), vectorized=True), which is more concise.

as we discussed in the email, we should also accept data type of string format.

and also **kwargs to bring the size information

It seems like the consensus is for pandas_udf and I'm fine with that too. I'll make that change and the others brought up here.

felixcheung · 2017-09-06T23:00:29Z

Cool!

BryanCutler · 2017-09-06T23:03:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala

+      val outputRowIterator = ArrowConverters.fromPayloadIterator(
+        outputIterator.map(new ArrowPayload(_)), context)
+
+      assert(schemaOut.equals(outputRowIterator.schema))


@felixcheung , I think you had also brought up checking the return type matches what was defined in the UDF. This is done here.

SparkQA · 2017-09-07T00:59:29Z

Test build #81479 has finished for PR 18659 at commit fdea603.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-19T02:35:39Z

python/pyspark/serializers.py

+            series = [series]
+        series = [(s, None) if not isinstance(s, (list, tuple)) else s for s in series]
+        arrs = [pa.Array.from_pandas(s[0], type=s[1], mask=s[0].isnull()) for s in series]
+        batch = pa.RecordBatch.from_arrays(arrs, ["_%d" % i for i in range(len(arrs))])


I'd use xrange.

HyukjinKwon · 2017-09-19T02:36:57Z

python/pyspark/serializers.py

+        if not isinstance(series, (list, tuple)) or \
+                (len(series) == 2 and isinstance(series[1], pa.DataType)):
+            series = [series]
+        series = [(s, None) if not isinstance(s, (list, tuple)) else s for s in series]


I'd use generator comprehension.

That would work, but does it help much since series will already be a list or tuple?

Yea, it actually affects the performance because we can avoid an extra loop:

def im_map(x): print("I am map %s" % x) return x def im_gen(x): print("I am gen %s" % x) return x def im_list(x): print("I am list %s" % x) return x items = list(range(3)) map(im_map, [im_list(item) for item in items]) map(im_map, (im_gen(item) for item in items))

And .. this actually affects the performance up to my knowledge:

import time items = list(xrange(int(1e8))) for _ in xrange(10): s = time.time() _ = map(lambda x: x, [item for item in items]) print "I am list comprehension with a list: %s" % (time.time() - s) s = time.time() _ = map(lambda x: x, (item for item in items)) print "I am generator expression with a list: %s" % (time.time() - s)

This gives me ~13% improvement in Python 2

This might not be a big deal but .. I usually use generator if it iterates once and is discarded. This should consume less memory too as list comprehension should be evaluated once first up to my knowledge.

Thanks @HyukjinKwon , I suppose if there are more than a few series then it might make some difference. In that case, every little bit helps so sounds good to me!

HyukjinKwon · 2017-09-19T02:39:40Z

python/pyspark/serializers.py

+        reader = pa.RecordBatchFileReader(pa.BufferReader(obj))
+        batches = [reader.get_batch(i) for i in range(reader.num_record_batches)]
+        # NOTE: a 0-parameter pandas_udf will produce an empty batch that can have num_rows set
+        num_rows = sum([batch.num_rows for batch in batches])


I'd use generator comprehension here too.

I guess this makes sense because its a summation, no sense in making a list then adding it all up

HyukjinKwon · 2017-09-19T02:40:38Z

python/pyspark/serializers.py

+        """
+        import pyarrow as pa
+        reader = pa.RecordBatchFileReader(pa.BufferReader(obj))
+        batches = [reader.get_batch(i) for i in range(reader.num_record_batches)]


And .. xrange here too

cloud-fan · 2017-09-19T04:54:59Z

what if users installed an older version of pyarrow? Shall we throw exception and ask them to upgrade, or work around type casting issue?

SparkQA · 2017-09-19T21:45:40Z

Test build #81945 has finished for PR 18659 at commit 69112a5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
// enable memo iff we serialize the row with schema (schema and class should be memorized)
abstract class EvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)

BryanCutler · 2017-09-19T23:27:12Z

Thanks for the reviews @ueshin @viirya and @HyukjinKwon ! I updated with your comments

BryanCutler · 2017-09-19T23:45:36Z

Regarding the upgrade of Arrow, the concerns of #18974 are still valid - namely it has some risk and upgrading the Python side is a good amount of work that only a couple of people have the access to do. Would it be better to discuss the upgrade strategy in another JIRA?
cc @holdenk

BryanCutler · 2017-09-20T00:05:39Z

what if users installed an older version of pyarrow? Shall we throw exception and ask them to upgrade, or work around type casting issue?

@cloud-fan , in regards to handling of problems that might come up if using different versions of Arrow, I think we should first decide on a minimum supported version, then maybe we could put that version of pyarrow as a requirement for PySpark. If we decide to use 0.4.1 which we currently use, then we should probably work around the type casting issue and make sure this PR works with that version.

SparkQA · 2017-09-20T02:00:11Z

Test build #81955 has finished for PR 18659 at commit f451d65.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-09-20T05:22:15Z

ok let's work around the type casting issue and discuss arrow upgrading later.

LiShuMing · 2017-09-20T11:29:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonExec.scala

+ *           \         /
+ *            \     socket  (input of UDF)
+ *             \     /
+ *          upstream (from child)


Is Upstream better?

I think upstream is fine.

Maybe I put myself uncomfortable to see Downstream upper, forgive me..

that's fine but either looks fine and not a big deal.

BryanCutler · 2017-09-20T23:01:21Z

@ueshin I haven't had much luck with the casting workaround:

pa.Array.from_pandas(s.astype(t.to_pandas_dtype(), copy=False), mask=s.isnull(), type=t)

It appears that it forces a copy for floating point -> integer and then checks if any NaNs, so I get the error ValueError: Cannot convert non-finite values (NA or inf) to integer. I'm using Pandas 0.20.1, but also tried 0.19.4 with the same result, any ideas?

ueshin · 2017-09-21T04:58:18Z

@BryanCutler Hmm, I'm not exactly sure the reason why it doesn't work (or mine works) but I guess we can use fillna(0) before casting like:

pa.Array.from_pandas(s.fillna(0).astype(t.to_pandas_dtype(), copy=False), mask=s.isnull(), type=t)

BryanCutler · 2017-09-21T18:22:31Z

Thanks @ueshin , that works to allow the tests to pass. I do worry that it might cause some other issues and I would much prefer we upgrade Arrow to handle this, but I'll push this and we can discuss.

SparkQA · 2017-09-21T21:37:46Z

Test build #82042 has finished for PR 18659 at commit 53926cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-22T00:47:23Z

Test build #82053 has finished for PR 18659 at commit b8ffa50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-09-22T01:39:11Z

python/pyspark/serializers.py

+    """
+
+    def __init__(self):
+        super(ArrowPandasSerializer, self).__init__()


Do we need this?

No, that was leftovers.. I'll remove it in a followup.

cloud-fan · 2017-09-22T08:21:25Z

LGTM, merging to master!

We can address remaining minor comments in follow-up, and have new PRs to remove the 0-parameter UDF and use arrow streaming protocol.

BryanCutler · 2017-09-22T16:45:54Z

Thanks @cloud-fan @ueshin and others who reviewed! I'll make followups to disable 0-param and complete the docs for this.

kiszk reviewed Jul 18, 2017

View reviewed changes

BryanCutler commented Jul 29, 2017

View reviewed changes

BryanCutler mentioned this pull request Aug 1, 2017

[SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors #18787

Closed

BryanCutler force-pushed the arrow-vectorized-udfs-SPARK-21404 branch from 46e4112 to 912143e Compare August 4, 2017 21:03

BryanCutler force-pushed the arrow-vectorized-udfs-SPARK-21404 branch from a01a2d3 to 38474d8 Compare August 25, 2017 18:29

BryanCutler added 4 commits September 1, 2017 10:59

vectorized udfs working but hardcoded for ArrowPandasSerializer

be81ef6

Added conf for enabling vectorized UDFs, now working

8569736

Columns for ArrowPandasSerializer need unique name

9236e99

fixup ArrowEvalPythonExec to work with new fromPayloadIterator

cc7ed5a

BryanCutler force-pushed the arrow-vectorized-udfs-SPARK-21404 branch from 38474d8 to cc7ed5a Compare September 1, 2017 18:05

BryanCutler changed the title ~~[SPARK-21404][PYSPARK][WIP] Simple Python Vectorized UDFs~~ [SPARK-21190][PYSPARK][WIP] Simple Python Vectorized UDFs Sep 6, 2017

BryanCutler force-pushed the arrow-vectorized-udfs-SPARK-21404 branch 2 times, most recently from 1503fa0 to fdea603 Compare September 6, 2017 21:56

BryanCutler added 2 commits September 6, 2017 14:57

changed conf to use a vectorized flag instead on per UDF basis

cf764b0

fixed chaining of multiple udfs in serializer

4f6c950

BryanCutler force-pushed the arrow-vectorized-udfs-SPARK-21404 branch from fdea603 to 4f6c950 Compare September 6, 2017 21:57

BryanCutler commented Sep 6, 2017

View reviewed changes

HyukjinKwon reviewed Sep 19, 2017

View reviewed changes

Refactor EvalPythonExec.

69112a5

fixes from comments in PR

f451d65

LiShuMing reviewed Sep 20, 2017

View reviewed changes

use generator for conforming series input in serializer

44a20f6

BryanCutler added 2 commits September 21, 2017 11:23

use fillna before casting with astype

53926cc

added test and fix for chained pandas_udf

b8ffa50

ueshin reviewed Sep 22, 2017

View reviewed changes

asfgit closed this in 27fc536 Sep 22, 2017

ueshin mentioned this pull request Sep 22, 2017

[WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Python #19147

Closed

1 task

BryanCutler changed the title ~~[SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs~~ [SPARK-21190][PYSPARK] Python Vectorized UDFs Sep 22, 2017

BryanCutler mentioned this pull request Sep 22, 2017

[SPARK-22106][PYSPARK][SQL] Disable 0-parameter pandas_udf and add doctests #19325

Closed

ueshin mentioned this pull request Sep 26, 2017

[SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format for vectorized UDF. #19349

Closed

[SPARK-21190][PYSPARK] Python Vectorized UDFs #18659

[SPARK-21190][PYSPARK] Python Vectorized UDFs #18659

Conversation

BryanCutler commented Jul 17, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

TODO

BryanCutler commented Jul 17, 2017 • edited Loading

** Updated with using ColumnarBatches **

BryanCutler commented Jul 17, 2017 • edited Loading

SparkQA commented Jul 17, 2017

SparkQA commented Jul 17, 2017

kiszk Jul 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk Jul 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 29, 2017

SparkQA commented Aug 4, 2017

SparkQA commented Aug 5, 2017

SparkQA commented Aug 25, 2017

SparkQA commented Sep 1, 2017

SparkQA commented Sep 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Sep 6, 2017 via email

Choose a reason for hiding this comment

SparkQA commented Sep 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Sep 20, 2017 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Sep 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 19, 2017

SparkQA commented Sep 19, 2017

BryanCutler commented Sep 19, 2017

BryanCutler commented Sep 19, 2017 • edited Loading

BryanCutler commented Sep 20, 2017

SparkQA commented Sep 20, 2017

cloud-fan commented Sep 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Sep 20, 2017

ueshin commented Sep 21, 2017 • edited Loading

BryanCutler commented Sep 21, 2017

SparkQA commented Sep 21, 2017

SparkQA commented Sep 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 22, 2017

BryanCutler commented Sep 22, 2017

BryanCutler commented Jul 17, 2017 •

edited

Loading

BryanCutler commented Jul 17, 2017 •

edited

Loading

Updated with using `ColumnarBatches`

BryanCutler commented Jul 17, 2017 •

edited

Loading

kiszk Jul 18, 2017 •

edited

Loading

kiszk Jul 26, 2017 •

edited

Loading

HyukjinKwon Sep 20, 2017 •

edited

Loading

HyukjinKwon Sep 20, 2017 •

edited

Loading

BryanCutler commented Sep 19, 2017 •

edited

Loading

ueshin commented Sep 21, 2017 •

edited

Loading