[SPARK-22003][SQL] support array column in vectorized reader with UDF #19230

liufengdb · 2017-09-14T04:39:02Z

What changes were proposed in this pull request?

The UDF needs to deserialize the UnsafeRow. When the column type is Array, the get method from the ColumnVector, which is used by the vectorized reader, is called, but this method is not implemented.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

viirya · 2017-09-14T04:48:29Z

Add a test for it?

viirya · 2017-09-14T05:21:15Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

-      } else if (dt instanceof StringType) {
-        for (int i = 0; i < length; i++) {
-          if (!data.isNullAt(offset + i)) {
-            list[i] = getUTF8String(i).toString();


This looks suspicious. Why we get String before? Seems we should get UTF8String.

This looks like a bug.

cloud-fan · 2017-09-14T06:25:39Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

        for (int i = 0; i < length; i++) {
          if (!data.isNullAt(offset + i)) {
-            list[i] = data.getDouble(offset + i);
+            list[i] = getAtMethod.call(i);


can we just call get(i + offset, dt)? The getAtMethod seems not very useful, as we still need to go through the if-else branches in get everytime.

It should be get(i, dt)? I updated it anyway.

yea should be get(i, dt).

SparkQA · 2017-09-14T07:04:45Z

Test build #81759 has finished for PR 19230 at commit adbaeab.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-09-15T14:16:15Z

since ColumnVector is only used by vectorized parquet reader, and it currently doesn't support nested types, so I can't think of an end-to-end regression test. However we can still have a unit test for ColumnVector.

viirya · 2017-09-15T14:24:19Z

Yea we should add an unit test for it.

liufengdb · 2017-09-15T23:31:43Z

@viirya @cloud-fan unit test updated.

viirya · 2017-09-16T00:31:26Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

@@ -16,6 +16,7 @@
 */
 package org.apache.spark.sql.execution.vectorized;

+import org.apache.spark.api.java.function.Function;


We don't use this now.

@liufengdb I think we don't need to import this now?

SparkQA · 2017-09-16T01:50:01Z

Test build #81835 has finished for PR 19230 at commit 19502f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-16T04:47:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorSuite.scala

+    // Populate it with arrays [0], [1, 2], [], [3, 4, 5]
+    testVector.putArray(0, 0, 1)
+    testVector.putArray(1, 1, 2)
+    testVector.putArray(2, 2, 0)


I think it doesn't affect the result. But looks like the third array should be testVector.putArray(2, 3, 0)?

viirya · 2017-09-16T04:59:04Z

@liufengdb The PR description looks like an end-to-end failure. I'm curious are you facing the failure in an end-to-end case?

cloud-fan · 2017-09-16T14:30:37Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java

@@ -158,7 +158,7 @@ private static void appendValue(WritableColumnVector dst, DataType t, Object o)
        dst.getChildColumn(0).appendInt(c.months);
        dst.getChildColumn(1).appendLong(c.microseconds);
      } else if (t instanceof DateType) {
-        dst.appendInt(DateTimeUtils.fromJavaDate((Date)o));
+        dst.appendInt((int) DateTimeUtils.fromJavaDate((Date)o));


is it necessary?

cloud-fan · 2017-09-16T14:37:18Z

LGTM except some minor comments

viirya · 2017-09-16T14:54:47Z

LGTM too.

SparkQA · 2017-09-17T08:30:31Z

Test build #81850 has finished for PR 19230 at commit 5cbf978.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-17T17:20:56Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

@@ -16,6 +16,7 @@
 */
 package org.apache.spark.sql.execution.vectorized;

+import org.apache.spark.api.java.function.Function;


Please revert it back.

oops, reverted it.

kiszk · 2017-09-17T23:53:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorSuite.scala

+    assert(array.get(1, schema).asInstanceOf[ColumnarBatch.Row].get(0, IntegerType) === 456)
+    assert(array.get(1, schema).asInstanceOf[ColumnarBatch.Row].get(1, DoubleType) === 5.67)
+  }
+}


Is it better to add a test for map, too?

mapType is not supported in ColumnVector: https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L235

I see.
Does your change expect that this call finally throws an exception for Map element in array?

gatorsmile · 2017-09-18T02:16:43Z

retest this please

SparkQA · 2017-09-18T03:42:57Z

Test build #81861 has finished for PR 19230 at commit 5ea4e89.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-18T03:53:18Z

retest this please

cloud-fan · 2017-09-18T07:50:45Z

retest this please

kiszk · 2017-09-18T08:30:07Z

Can we add test code for null row in a column for each type?

SparkQA · 2017-09-18T09:07:45Z

Test build #81872 has finished for PR 19230 at commit 5ea4e89.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-09-18T10:41:37Z

retest this please

SparkQA · 2017-09-18T14:57:16Z

Test build #81877 has finished for PR 19230 at commit 5ea4e89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-18T15:49:48Z

Thanks! Merged to master.

init

adbaeab

viirya reviewed Sep 14, 2017

View reviewed changes

cloud-fan reviewed Sep 14, 2017

View reviewed changes

comment

19502f9

viirya reviewed Sep 16, 2017

View reviewed changes

cloud-fan reviewed Sep 16, 2017

View reviewed changes

final

5cbf978

gatorsmile reviewed Sep 17, 2017

View reviewed changes

kiszk reviewed Sep 17, 2017

View reviewed changes

some

5ea4e89

asfgit closed this in 3b049ab Sep 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22003][SQL] support array column in vectorized reader with UDF #19230

[SPARK-22003][SQL] support array column in vectorized reader with UDF #19230

liufengdb commented Sep 14, 2017

viirya commented Sep 14, 2017

viirya Sep 14, 2017

cloud-fan Sep 14, 2017

cloud-fan Sep 14, 2017

liufengdb Sep 15, 2017

cloud-fan Sep 15, 2017

SparkQA commented Sep 14, 2017

cloud-fan commented Sep 15, 2017

viirya commented Sep 15, 2017

liufengdb commented Sep 15, 2017

viirya Sep 16, 2017

viirya Sep 17, 2017

SparkQA commented Sep 16, 2017

viirya Sep 16, 2017

cloud-fan Sep 16, 2017

viirya commented Sep 16, 2017

cloud-fan Sep 16, 2017

cloud-fan commented Sep 16, 2017

viirya commented Sep 16, 2017

SparkQA commented Sep 17, 2017

gatorsmile Sep 17, 2017

liufengdb Sep 18, 2017

kiszk Sep 17, 2017

liufengdb Sep 18, 2017 •

edited

Loading

kiszk Sep 18, 2017

gatorsmile commented Sep 18, 2017

SparkQA commented Sep 18, 2017

gatorsmile commented Sep 18, 2017

cloud-fan commented Sep 18, 2017

kiszk commented Sep 18, 2017

SparkQA commented Sep 18, 2017

cloud-fan commented Sep 18, 2017

SparkQA commented Sep 18, 2017

gatorsmile commented Sep 18, 2017

[SPARK-22003][SQL] support array column in vectorized reader with UDF #19230

[SPARK-22003][SQL] support array column in vectorized reader with UDF #19230

Conversation

liufengdb commented Sep 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

viirya commented Sep 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 14, 2017

cloud-fan commented Sep 15, 2017

viirya commented Sep 15, 2017

liufengdb commented Sep 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Sep 16, 2017

Choose a reason for hiding this comment

cloud-fan commented Sep 16, 2017

viirya commented Sep 16, 2017

SparkQA commented Sep 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liufengdb Sep 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Sep 18, 2017

SparkQA commented Sep 18, 2017

gatorsmile commented Sep 18, 2017

cloud-fan commented Sep 18, 2017

kiszk commented Sep 18, 2017

SparkQA commented Sep 18, 2017

cloud-fan commented Sep 18, 2017

SparkQA commented Sep 18, 2017

gatorsmile commented Sep 18, 2017

liufengdb Sep 18, 2017 •

edited

Loading