-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors #18787
[SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors #18787
Conversation
Test build #80094 has finished for PR 18787 at commit
|
Test build #80099 has finished for PR 18787 at commit
|
Test build #80108 has finished for PR 18787 at commit
|
ReadOnlyColumnVector[] columns, | ||
int numRows) { | ||
for (ReadOnlyColumnVector c: columns) { | ||
assert(c.capacity >= numRows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any good way to move this assert into other loop?
I am afraid that the loop with no body is executed in a production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should throw an exception then?
|
||
public static ColumnarBatch createReadOnly( | ||
StructType schema, | ||
ReadOnlyColumnVector[] columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to restrict this to only ReadOnlyColumnVector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary? What impact will it cause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to be restricted, but if they are ReadOnlyColumnVectors
then it means they are already populated and it is safe to call setNumRows(numRows)
here. If it took in any ColumnVector
then it might cause issues by someone passing in unallocated vectors.
return batch; | ||
} | ||
|
||
private static ColumnarBatch create(StructType schema, ColumnVector[] columns, int capacity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin , if we want to allow creating a ColumnarBatch
from any Array of ColumnVector
s then we could make this public as it doesn't call setNumRows
and assume they are allocated already
@cloud-fan @icexelloss, this just adds the ability to create a |
close() | ||
} | ||
|
||
private var _batch: ColumnarBatch = _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: not needed
Test build #80430 has finished for PR 18787 at commit
|
jenkins retest this please |
Test build #80438 has finished for PR 18787 at commit
|
int numRows) { | ||
assert(schema.length() == columns.length); | ||
ColumnarBatch batch = new ColumnarBatch(schema, columns, numRows); | ||
batch.setNumRows(numRows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check each ReadOnlyColumnVector has numRows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadOnlyColumnVector[] columns, | ||
int numRows) { | ||
assert(schema.length() == columns.length); | ||
ColumnarBatch batch = new ColumnarBatch(schema, columns, numRows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the capacity is set to numRows
inside the ctor but need to call batch.setNumRows()
manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max capacity
only has meaning when allocating ColumnVectors
so it doesn't really do
anything for read-only vectors. You need to callsetNumRows
to tell the batch how many rows there for the given columns, it doesn't look at the capacity in the individual vectors.
@ueshin @cloud-fan , what are your thoughts on merging this to enable |
Actually I think |
Yes, I agree with changing the interfaces as you suggest @cloud-fan , is there currently a JIRA open for that? I'm ok with holding off if it's planned to be soon, but I would like to get started on SPARK-20791 that will create a Spark DataFrame from Pandas with Arrow, which depends on this also. I don't think the changes you are suggesting would affect this PR much, just renaming the classes used. Any chance we can merge this first? |
Updated to use the new API for ColumnarBatch, please take a look @ueshin @cloud-fan |
Test build #81122 has finished for PR 18787 at commit
|
new ArrowRowIterator { | ||
private var reader: ArrowFileReader = null | ||
private var schemaRead = StructType(Seq.empty) | ||
private var rowIter = if (payloadIter.hasNext) nextBatch() else Iterator.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simply put Iterator.empty
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextBatch()
returns the row iterator, so rowIter
needs to be initialized here to the first row in the first batch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I thought the first call of hasNext
would initialize it.
@@ -1261,4 +1264,55 @@ class ColumnarBatchSuite extends SparkFunSuite { | |||
s"vectorized reader")) | |||
} | |||
} | |||
|
|||
test("create read-only batch") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create a columnar batch from Arrow column vectors
or something?
batch.getRow(100) | ||
} | ||
|
||
columnVectors.foreach(_.close()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use batch.close()
here.
|
||
val schema = StructType(Seq(StructField("int", IntegerType))) | ||
|
||
val batch = new ColumnarBatch(schema, Array[ColumnVector](new ArrowColumnVector(vector)), 11) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, do we need to use ColumnarBatch
for this test?
I guess we can simply create Iterator[InternalRow]
and use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean just calling something like new ColumnarBatch(..).rowIterator()
? We still need to set the number of rows in the batch I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you mean not using ArrowColumnVector
at all and just make an Iterator[InternalRow]
some other way? That would probably work, but I figured why not test out the columnar batch this way also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant your second comment.
We do test the columnar batch with ArrowColumnVector
in ColumnarBatchSuite
and also we use it in ArrowConverters.fromPayloadIterator()
, so I thought we don't need to use it here.
test("roundtrip payloads") { | ||
val allocator = ArrowUtils.rootAllocator.newChildAllocator("int", 0, Long.MaxValue) | ||
val vector = ArrowUtils.toArrowField("int", IntegerType, nullable = true) | ||
.createVector(allocator).asInstanceOf[NullableIntVector] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the allocator
and the vector
be closed at the end of this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thanks for catching that. I close them now.
Test build #81225 has finished for PR 18787 at commit
|
@ueshin I updated and had a couple of questions on your comments, please take a look, thanks! |
@ueshin I updated the test to use a seq of Rows now |
LGTM, pending Jenkins. |
Test build #81270 has finished for PR 18787 at commit
|
Thanks! merging to master. |
Thanks @ueshin! |
} | ||
|
||
intercept[java.lang.AssertionError] { | ||
batch.getRow(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @BryanCutler and @ueshin .
This seems to make master branch fail. Could you take a look once more? Thank you in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that is strange. I'll take a look, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It seems to happen Maven only. sbt-hadoop-2.6 passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably because the assert is being compiled out.. This should probably not be in the test then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, please check the error message here. Please ignore this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is that if the Java assertion is compiled out, then no error is produced and the test fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made #19098 to remove this check - it's not really testing the functionality added here anyway but maybe another test should be added for checkout index out of bounds errors.
What changes were proposed in this pull request?
This PR allows the creation of a
ColumnarBatch
fromReadOnlyColumnVectors
where previously a columnar batch could only allocate vectors internally. This is useful for usingArrowColumnVectors
in a batch form to do row-based iteration. Also addedArrowConverter.fromPayloadIterator
which convertsArrowPayload
iterator toInternalRow
iterator and uses aColumnarBatch
internally.How was this patch tested?
Added a new unit test for creating a
ColumnarBatch
withReadOnlyColumnVectors
and a test to verify the roundtrip of rows -> ArrowPayload -> rows, usingtoPayloadIterator
andfromPayloadIterator
.