[SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors #18787

BryanCutler · 2017-07-31T21:28:13Z

What changes were proposed in this pull request?

This PR allows the creation of a ColumnarBatch from ReadOnlyColumnVectors where previously a columnar batch could only allocate vectors internally. This is useful for using ArrowColumnVectors in a batch form to do row-based iteration. Also added ArrowConverter.fromPayloadIterator which converts ArrowPayload iterator to InternalRow iterator and uses a ColumnarBatch internally.

How was this patch tested?

Added a new unit test for creating a ColumnarBatch with ReadOnlyColumnVectors and a test to verify the roundtrip of rows -> ArrowPayload -> rows, using toPayloadIterator and fromPayloadIterator.

SparkQA · 2017-07-31T22:56:03Z

Test build #80094 has finished for PR 18787 at commit f35b92c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-01T01:05:51Z

Test build #80099 has finished for PR 18787 at commit 43214b1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-01T07:08:15Z

Test build #80108 has finished for PR 18787 at commit f906156.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-01T08:58:56Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

+      ReadOnlyColumnVector[] columns,
+      int numRows) {
+    for (ReadOnlyColumnVector c: columns) {
+      assert(c.capacity >= numRows);


Is there any good way to move this assert into other loop?
I am afraid that the loop with no body is executed in a production.

Maybe this should throw an exception then?

ueshin · 2017-08-01T09:25:14Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

+
+  public static ColumnarBatch createReadOnly(
+      StructType schema,
+      ReadOnlyColumnVector[] columns,


Do we need to restrict this to only ReadOnlyColumnVector?

Is it necessary? What impact will it cause?

It doesn't need to be restricted, but if they are ReadOnlyColumnVectors then it means they are already populated and it is safe to call setNumRows(numRows) here. If it took in any ColumnVector then it might cause issues by someone passing in unallocated vectors.

BryanCutler · 2017-08-01T17:57:07Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

+    return batch;
+  }
+
+  private static ColumnarBatch create(StructType schema, ColumnVector[] columns, int capacity) {


@ueshin , if we want to allow creating a ColumnarBatch from any Array of ColumnVectors then we could make this public as it doesn't call setNumRows and assume they are allocated already

BryanCutler · 2017-08-01T18:05:18Z

@cloud-fan @icexelloss, this just adds the ability to create a ColumnarBatch with a row iterator from Arrow data. It should be usable for any vectorized UDF implementation, and I already tried it out in #18659 and it works quite well. Let me know if it works for you, thanks!

BryanCutler · 2017-08-04T21:49:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

+        close()
+      }
+
+      private var _batch: ColumnarBatch = _


TODO: not needed

SparkQA · 2017-08-09T02:54:25Z

Test build #80430 has finished for PR 18787 at commit 23d19df.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-08-09T04:04:43Z

jenkins retest this please

SparkQA · 2017-08-09T06:23:58Z

Test build #80438 has finished for PR 18787 at commit 23d19df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2017-08-09T13:44:11Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

+      int numRows) {
+    assert(schema.length() == columns.length);
+    ColumnarBatch batch = new ColumnarBatch(schema, columns, numRows);
+    batch.setNumRows(numRows);


Do we need to check each ReadOnlyColumnVector has numRows?

The ArrowColumnVector.valueCount here would need to be moved to ReadOnlyColumnVector which could go in place of the capacity. If @ueshin thinks that's ok to do so here, I can add that.

icexelloss · 2017-08-09T13:49:31Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

+      ReadOnlyColumnVector[] columns,
+      int numRows) {
+    assert(schema.length() == columns.length);
+    ColumnarBatch batch = new ColumnarBatch(schema, columns, numRows);


Why the capacity is set to numRows inside the ctor but need to call batch.setNumRows() manually?

The max capacity only has meaning when allocating ColumnVectors so it doesn't really do
anything for read-only vectors. You need to callsetNumRows to tell the batch how many rows there for the given columns, it doesn't look at the capacity in the individual vectors.

BryanCutler · 2017-08-09T17:34:35Z

@ueshin @cloud-fan , what are your thoughts on merging this to enable ArrowColumnVector to be used in a batch?

cloud-fan · 2017-08-14T08:00:01Z

Actually I think ReadOnlyColumnVector may not be a good abstraction. Ideally ColumnVector should be read only, and then we have a MutableColumnVector with write interfaces. @ueshin is working on it and will send the PR soon, can we hold this patch for a while? Thanks!

BryanCutler · 2017-08-15T18:39:58Z

Yes, I agree with changing the interfaces as you suggest @cloud-fan , is there currently a JIRA open for that? I'm ok with holding off if it's planned to be soon, but I would like to get started on SPARK-20791 that will create a Spark DataFrame from Pandas with Arrow, which depends on this also. I don't think the changes you are suggesting would affect this PR much, just renaming the classes used. Any chance we can merge this first?

…ch-support-SPARK-21583

BryanCutler · 2017-08-25T07:26:18Z

Updated to use the new API for ColumnarBatch, please take a look @ueshin @cloud-fan

SparkQA · 2017-08-25T09:59:58Z

Test build #81122 has finished for PR 18787 at commit a90a71b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-08-28T04:10:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

+    new ArrowRowIterator {
+      private var reader: ArrowFileReader = null
+      private var schemaRead = StructType(Seq.empty)
+      private var rowIter = if (payloadIter.hasNext) nextBatch() else Iterator.empty


We can simply put Iterator.empty here.

nextBatch() returns the row iterator, so rowIter needs to be initialized here to the first row in the first batch

nvm, I thought the first call of hasNext would initialize it.

ueshin · 2017-08-28T04:21:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala

@@ -1261,4 +1264,55 @@ class ColumnarBatchSuite extends SparkFunSuite {
        s"vectorized reader"))
    }
  }
+
+  test("create read-only batch") {


create a columnar batch from Arrow column vectors or something?

ueshin · 2017-08-28T04:22:02Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala

+      batch.getRow(100)
+    }
+
+    columnVectors.foreach(_.close())


We can use batch.close() here.

ueshin · 2017-08-28T04:28:37Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala

+
+    val schema = StructType(Seq(StructField("int", IntegerType)))
+
+    val batch = new ColumnarBatch(schema, Array[ColumnVector](new ArrowColumnVector(vector)), 11)


Btw, do we need to use ColumnarBatch for this test?
I guess we can simply create Iterator[InternalRow] and use it.

you mean just calling something like new ColumnarBatch(..).rowIterator()? We still need to set the number of rows in the batch I believe

Oh, you mean not using ArrowColumnVector at all and just make an Iterator[InternalRow] some other way? That would probably work, but I figured why not test out the columnar batch this way also.

Yes, I meant your second comment.
We do test the columnar batch with ArrowColumnVector in ColumnarBatchSuite and also we use it in ArrowConverters.fromPayloadIterator(), so I thought we don't need to use it here.

ueshin · 2017-08-28T04:29:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala

+  test("roundtrip payloads") {
+    val allocator = ArrowUtils.rootAllocator.newChildAllocator("int", 0, Long.MaxValue)
+    val vector = ArrowUtils.toArrowField("int", IntegerType, nullable = true)
+      .createVector(allocator).asInstanceOf[NullableIntVector]


Should the allocator and the vector be closed at the end of this test?

yes, thanks for catching that. I close them now.

SparkQA · 2017-08-29T20:29:26Z

Test build #81225 has finished for PR 18787 at commit 3fcdec5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-08-30T00:15:47Z

@ueshin I updated and had a couple of questions on your comments, please take a look, thanks!

BryanCutler · 2017-08-31T00:02:27Z

@ueshin I updated the test to use a seq of Rows now

ueshin · 2017-08-31T01:15:09Z

LGTM, pending Jenkins.

SparkQA · 2017-08-31T02:14:49Z

Test build #81270 has finished for PR 18787 at commit ffcbf75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-08-31T04:09:03Z

Thanks! merging to master.

BryanCutler · 2017-08-31T16:08:36Z

Thanks @ueshin!

dongjoon-hyun · 2017-08-31T17:55:19Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala

+    }
+
+    intercept[java.lang.AssertionError] {
+      batch.getRow(100)


Hi, @BryanCutler and @ueshin .
This seems to make master branch fail. Could you take a look once more? Thank you in advance!

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/3696/testReport/

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/3730/testReport/

Hmm, that is strange. I'll take a look, thanks.

Thanks! It seems to happen Maven only. sbt-hadoop-2.6 passed.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/3480/

It's probably because the assert is being compiled out.. This should probably not be in the test then.

~~Then, please check the error message here.~~ Please ignore this.

I think the problem is that if the Java assertion is compiled out, then no error is produced and the test fails.

I just made #19098 to remove this check - it's not really testing the functionality added here anyway but maybe another test should be added for checkout index out of bounds errors.

BryanCutler added 3 commits July 31, 2017 12:10

refactored ColumnarBatch to allow creating from ColumnVectors

0c39389

Added fromPayloadIterator to use ColumnarBatch for row iteration

a4be6cf

added unit tests for ColumnarBatch with Arrow, and fromPayloadIterator

f35b92c

need to account for possible empty schema

43214b1

fix assert that columns equals schema length

f906156

kiszk reviewed Aug 1, 2017

View reviewed changes

ueshin reviewed Aug 1, 2017

View reviewed changes

BryanCutler commented Aug 1, 2017

View reviewed changes

BryanCutler commented Aug 4, 2017

View reviewed changes

BryanCutler added 3 commits August 8, 2017 17:21

changed fromPayloadIter to return schema also

3d80e54

minor cleanup on ColumnarBatch constructors

23d19df

added description on fromPayloadIterator

cc81d48

icexelloss reviewed Aug 9, 2017

View reviewed changes

BryanCutler added 3 commits August 24, 2017 23:00

Merge remote-tracking branch 'upstream/master' into arrow-ColumnarBat…

4e2b081

…ch-support-SPARK-21583

updated fromPayloadIterator to work with immutable ColumnVector

9eb929a

using ArrowRowIterator trait instead of tuple for schema

a90a71b

ueshin reviewed Aug 28, 2017

View reviewed changes

forgot to close allocator in test, some cleanup

3fcdec5

simplified round-trip test data to seq of rows

ffcbf75

asfgit closed this in 964b507 Aug 31, 2017

dongjoon-hyun reviewed Aug 31, 2017

View reviewed changes

BryanCutler deleted the arrow-ColumnarBatch-support-SPARK-21583 branch March 6, 2018 23:36


		val schema = StructType(Seq(StructField("int", IntegerType)))

		val batch = new ColumnarBatch(schema, Array[ColumnVector](new ArrowColumnVector(vector)), 11)

[SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors #18787

[SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors #18787

Conversation

BryanCutler commented Jul 31, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 31, 2017

SparkQA commented Aug 1, 2017

SparkQA commented Aug 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Aug 1, 2017

Choose a reason for hiding this comment

SparkQA commented Aug 9, 2017

BryanCutler commented Aug 9, 2017

SparkQA commented Aug 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Aug 9, 2017

cloud-fan commented Aug 14, 2017

BryanCutler commented Aug 15, 2017

BryanCutler commented Aug 25, 2017

SparkQA commented Aug 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler Aug 29, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Aug 29, 2017

BryanCutler commented Aug 30, 2017

BryanCutler commented Aug 31, 2017

ueshin commented Aug 31, 2017

SparkQA commented Aug 31, 2017

ueshin commented Aug 31, 2017

BryanCutler commented Aug 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Aug 31, 2017 • edited Loading

Choose a reason for hiding this comment

BryanCutler Aug 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler Aug 29, 2017 •

edited

Loading

dongjoon-hyun Aug 31, 2017 •

edited

Loading

BryanCutler Aug 31, 2017 •

edited

Loading